2024-06-07
Vector Database
0

目录

Chroma DB
Client
服务端
身份验证
静态 API TOKEN 身份验证
docker
客户端
create_collection 创建集合
delete_collection 删除集合
count_collections 统计集合数量
list_collections 列出所有集合
get_collection 获取集合
getorcreate_collection
add 添加集合数据
query 查询集合数据
get 查询集合数据
update 更新集合数据
upsert
delete 删除集合数据

Chroma DB

https://github.com/chroma-core/chroma

pip install chromadb

轻量级向量数据库,目前只支持 CPU 计算

Client

python
import chromadb chroma_client = chromadb.Client() # 数据持久化 chroma_client = chromadb.PersistentClient(path="./chromadb_save") chroma_client.heartbeat() # 返回纳米时间戳心跳,测试链接是否保持

服务端

chroma run --host 0.0.0.0 --port 8000 --path /db_path --log-path /var/log/chroma.log

身份验证
shell
# 生成密码哈希 htpasswd -Bbn admin password > server.htpasswd # 设置环境变量 export CHROMA_SERVER_AUTH_CREDENTIALS_FILE="server.htpasswd" export CHROMA_SERVER_AUTH_CREDENTIALS_PROVIDER='chromadb.auth.providers.HtpasswdFileServerAuthCredentialsProvider' export CHROMA_SERVER_AUTH_PROVIDER='chromadb.auth.basic.BasicAuthServerProvider'
python
client = chromadb.HttpClient( settings=Settings( chroma_client_auth_provider="chromadb.auth.basic.BasicAuthClientProvider", chroma_client_auth_credentials="admin:password" ) )
静态 API TOKEN 身份验证

TOKENS:必须是字母数字的 ASCII 字符串。TOKENS 区分大小写

shell
# 设置环境变量 export CHROMA_SERVER_AUTH_CREDENTIALS="test-token" export CHROMA_SERVER_AUTH_CREDENTIALS_PROVIDER="chromadb.auth.token.TokenConfigServerAuthCredentialsProvider" export CHROMA_SERVER_AUTH_PROVIDER="chromadb.auth.token.TokenAuthServerProvider" # 使用 X-Chroma-Token: test-token 类型的 authentication header,设置额外的环境变量 export CHROMA_SERVER_AUTH_TOKEN_TRANSPORT_HEADER="X_CHROMA_TOKEN"
python
client = chromadb.HttpClient( settings=Settings( chroma_client_auth_provider="chromadb.auth.token.TokenAuthClientProvider", chroma_client_auth_credentials="test-token" ) )
docker

docker run -d --name chromadb-container -p 8899:8000 chromadb/chroma

yml
version: '3.9' networks: net: driver: bridge services: server: image: ghcr.io/chroma-core/chroma:latest environment: - IS_PERSISTENT=TRUE volumes: - /chroma_data:/chroma/chroma/ ports: - 11111:8000

客户端

pip install chromadb-client

python
chroma_client = chromadb.HttpClient(host='localhost', port=8000)

create_collection 创建集合

python
collection = chroma_client.create_collection( name="my_collection", embedding_function=emb_fn, # 修改向量化模型(https://docs.trychroma.com/guides/embeddings),默认向量模型:all-MiniLM-L6-v2 metadata={"hnsw:space": "cosine"} # 自定义计算向量距离的方法,{'cosine': '余弦相似度', 'ip': '内积', 'l2': '欧式距离'},默认值为 'l2' )

delete_collection 删除集合

python
chroma_client.delete_collection( name="my_collection" )

count_collections 统计集合数量

python
chroma_client.count_collections()

list_collections 列出所有集合

python
chroma_client.list_collections()

get_collection 获取集合

python
collection = chroma_client.get_collection(name="my_collection")

get_or_create_collection

如果存在就获取,不存在就创建,参数与 create_collection 相同

python
collection = chroma_client.get_or_create_collection(name="my_collection")

add 添加集合数据

python
collection.add( documents=["lorem ipsum...", "doc2", "doc3", ...], embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...], metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...], ids=["id1", "id2", "id3", ...] )
  • documents:Chroma 也存储 documents 本身。如果文档太大,无法使用所选的嵌入函数嵌入,则会引发异常。当提供 embeddings 时,可不提供 documents
  • embeddings:可以直接提供文档的向量,不在此实时计算(如果提供的 embeddings 与集合的维度不同,则会引发异常)
  • metadatas:存储附加信息并启用过滤
  • ids:每个文件必须有一个唯一的关联 id。尝试 .add 相同的 ID 两次将导致仅存储初始值

query 查询集合数据

python
collection.query( query_texts=['xxx', 'xxx'], query_embeddings=[[11.1, 12.1, 13.1],[1.1, 2.3, 3.2], ...], n_results=10, where={"metadata_field": "is_equal_to_this"}, where_document={"$contains":"search_string"}, include=["embeddings", "metadatas", "documents", "distances"] )
  • n_results:返回最接近值的个数
  • where:根据文档的 metadata 进行过滤,https://docs.trychroma.com/guides#using-where-filters
  • where_document:根据文档的内容进行过滤,包含:$contains,不包含:$not_contains
  • include:返回值包含的内容,默认 "metadatas", "documents", "distances"

get 查询集合数据

python
collection.get( ids=["id1", "id2", "id3", ...], where={"style": "style1"}, include=["embeddings", "metadatas", "documents"] )
  • include:返回值包含的内容,默认 "metadatas", "documents"

update 更新集合数据

python
collection.update( ids=["id1", "id2", "id3", ...], embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...], metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...], documents=["doc1", "doc2", "doc3", ...], )
  • ids:如果在集合中找不到 id,则会记录错误,并忽略更新
  • documents:如果提供的 documents 没有对应的 embeddings,则将使用集合的嵌入函数重新计算嵌入

upsert

如果存在就更新,不存在就添加

python
collection.upsert( ids=["id1", "id2", "id3", ...], embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...], metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...], documents=["doc1", "doc2", "doc3", ...], )

delete 删除集合数据

python
collection.delete( ids=["id1", "id2", "id3",...], where={"chapter": "20"}, where_document={"$contains":"search_string"} )