Python 使用 Chroma DB

Chroma DB

pip install chromadb

轻量级向量数据库，目前只支持 CPU 计算

Client

python
import chromadb

chroma_client = chromadb.Client()

# 数据持久化
chroma_client = chromadb.PersistentClient(path="./chromadb_save")

chroma_client.heartbeat() # 返回纳米时间戳心跳，测试链接是否保持

服务端

chroma run --host 0.0.0.0 --port 8000 --path /db_path --log-path /var/log/chroma.log

身份验证

shell
# 生成密码哈希
htpasswd -Bbn admin password > server.htpasswd

# 设置环境变量
export CHROMA_SERVER_AUTH_CREDENTIALS_FILE="server.htpasswd"
export CHROMA_SERVER_AUTH_CREDENTIALS_PROVIDER='chromadb.auth.providers.HtpasswdFileServerAuthCredentialsProvider'
export CHROMA_SERVER_AUTH_PROVIDER='chromadb.auth.basic.BasicAuthServerProvider'

python
client = chromadb.HttpClient(
    settings=Settings(
        chroma_client_auth_provider="chromadb.auth.basic.BasicAuthClientProvider",
        chroma_client_auth_credentials="admin:password"
    )
)

静态 API TOKEN 身份验证

TOKENS：必须是字母数字的 ASCII 字符串。TOKENS 区分大小写

shell
# 设置环境变量
export CHROMA_SERVER_AUTH_CREDENTIALS="test-token"
export CHROMA_SERVER_AUTH_CREDENTIALS_PROVIDER="chromadb.auth.token.TokenConfigServerAuthCredentialsProvider"
export CHROMA_SERVER_AUTH_PROVIDER="chromadb.auth.token.TokenAuthServerProvider"

# 使用 X-Chroma-Token: test-token 类型的 authentication header，设置额外的环境变量
export CHROMA_SERVER_AUTH_TOKEN_TRANSPORT_HEADER="X_CHROMA_TOKEN"

python
client = chromadb.HttpClient(
    settings=Settings(
        chroma_client_auth_provider="chromadb.auth.token.TokenAuthClientProvider",
        chroma_client_auth_credentials="test-token"
    )
)

docker

docker run -d --name chromadb-container -p 8899:8000 chromadb/chroma

yml
version: '3.9'

networks:
  net:
    driver: bridge
services:
  server:
    image: ghcr.io/chroma-core/chroma:latest
    environment:
      - IS_PERSISTENT=TRUE
    volumes:
      - /chroma_data:/chroma/chroma/
    ports:
      - 11111:8000

客户端

pip install chromadb-client

python
chroma_client = chromadb.HttpClient(host='localhost', port=8000)

create_collection 创建集合

python
collection = chroma_client.create_collection(
    name="my_collection", 
    embedding_function=emb_fn, # 修改向量化模型（https://docs.trychroma.com/guides/embeddings），默认向量模型：all-MiniLM-L6-v2
    metadata={"hnsw:space": "cosine"} # 自定义计算向量距离的方法，{'cosine': '余弦相似度', 'ip': '内积', 'l2': '欧式距离'}，默认值为 'l2'
)

delete_collection 删除集合

python
chroma_client.delete_collection(
    name="my_collection"
)

count_collections 统计集合数量

python
chroma_client.count_collections()

list_collections 列出所有集合

python
chroma_client.list_collections()

get_collection 获取集合

python
collection = chroma_client.get_collection(name="my_collection")

get_or_create_collection

如果存在就获取，不存在就创建，参数与 create_collection 相同

python
collection = chroma_client.get_or_create_collection(name="my_collection")

add 添加集合数据

python
collection.add(
    documents=["lorem ipsum...", "doc2", "doc3", ...],
    embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
    metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
    ids=["id1", "id2", "id3", ...]
)

documents：Chroma 也存储 documents 本身。如果文档太大，无法使用所选的嵌入函数嵌入，则会引发异常。当提供 embeddings 时，可不提供 documents
embeddings：可以直接提供文档的向量，不在此实时计算（如果提供的 embeddings 与集合的维度不同，则会引发异常）
metadatas：存储附加信息并启用过滤
ids：每个文件必须有一个唯一的关联 id。尝试 .add 相同的 ID 两次将导致仅存储初始值

query 查询集合数据

python
collection.query(
    query_texts=['xxx', 'xxx'],
    query_embeddings=[[11.1, 12.1, 13.1],[1.1, 2.3, 3.2], ...],
    n_results=10, 
    where={"metadata_field": "is_equal_to_this"}, 
    where_document={"$contains":"search_string"},
    include=["embeddings", "metadatas", "documents", "distances"]
)

n_results：返回最接近值的个数
where：根据文档的 metadata 进行过滤，https://docs.trychroma.com/guides#using-where-filters
where_document：根据文档的内容进行过滤，包含：$contains，不包含：$not_contains
include：返回值包含的内容，默认 "metadatas", "documents", "distances"

get 查询集合数据

python
collection.get(
	ids=["id1", "id2", "id3", ...],
	where={"style": "style1"},
    include=["embeddings", "metadatas", "documents"]
)

include：返回值包含的内容，默认 "metadatas", "documents"

update 更新集合数据

python
collection.update(
    ids=["id1", "id2", "id3", ...],
    embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
    metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
    documents=["doc1", "doc2", "doc3", ...],
)

ids：如果在集合中找不到 id，则会记录错误，并忽略更新
documents：如果提供的 documents 没有对应的 embeddings，则将使用集合的嵌入函数重新计算嵌入

upsert

如果存在就更新，不存在就添加

python
collection.upsert(
    ids=["id1", "id2", "id3", ...],
    embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
    metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
    documents=["doc1", "doc2", "doc3", ...],
)

delete 删除集合数据

python
collection.delete(
    ids=["id1", "id2", "id3",...],
	where={"chapter": "20"},
    where_document={"$contains":"search_string"}
)

目录