TXTAI支持许多不同的数据库和向量索引后端,包括外部数据库。有了现代硬件,一个节点索引可以将我们带到多远,这真是令人惊讶。轻松进入数亿甚至数十亿的记录。
TXTAI在创建自己的嵌入数据库方面提供了最大的灵活性。可以开箱即用的明智默认值。因此,除非您寻找这种配置,否则不是必需的。本文将探讨您要自定义嵌入数据库时可用的选项。
有关embeddings configuration settings can be found here的更多信息。
安装依赖项
安装txtai
和所有依赖项。
# Install txtai
pip install txtai[database,similarity] datasets
加载数据集
此示例将使用ag_news
数据集,该数据集是新闻文章的集合。我们将使用25,000个标题的子集。
import timeit
from datasets import load_dataset
def timer(embeddings, query="red sox"):
elapsed = timeit.timeit(lambda: embeddings.search(query), number=250)
print(f"{elapsed / 250} seconds per query")
dataset = load_dataset("ag_news", split="train")["text"][:25000]
numpy
让我们从最简单的嵌入数据库开始。这将是围绕使用句子转换器的矢量化文本的薄包装,将结果存储为numpy数组并运行相似性查询。
from txtai.embeddings import Embeddings
# Create embeddings instance
embeddings = Embeddings({"path": "sentence-transformers/all-MiniLM-L6-v2", "backend": "numpy"})
# Index data
embeddings.index((x, text, None) for x, text in enumerate(dataset))
embeddings.search("red sox")
[(19831, 0.6780003309249878),
(18302, 0.6639199256896973),
(16370, 0.6617192029953003)]
embeddings.info()
{
"backend": "numpy",
"build": {
"create": "2023-05-04T12:12:02Z",
"python": "3.10.11",
"settings": {
"numpy": "1.22.4"
},
"system": "Linux (x86_64)",
"txtai": "5.6.0"
},
"dimensions": 384,
"offset": 25000,
"path": "sentence-transformers/all-MiniLM-L6-v2",
"update": "2023-05-04T12:12:02Z"
}
上面的嵌入实例对文本进行了矢量,并将内容存储为numpy数组。以相似性得分返回数组索引位置。虽然可以使用句子转换器轻松完成相同的操作,但是使用TXTAI框架可以轻松地交换不同的选项,如接下来所示。
sqlite和numpy
我们将测试的下一个组合是一个带有numpy数组的SQLite数据库。
# Create embeddings instance
embeddings = Embeddings({"path": "sentence-transformers/all-MiniLM-L6-v2", "content": "sqlite", "backend": "numpy"})
# Index data
embeddings.index((x, text, None) for x, text in enumerate(dataset))
现在让我们进行搜索。
embeddings.search("red sox")
[{'id': '19831',
'text': 'Boston Red Sox Team Report - September 6 (Sports Network) - Two of the top teams in the American League tangle in a possible American League Division Series preview tonight, as the West-leading Oakland Athletics host the wild card-leading Boston Red Sox for the first of a three-game set at the ',
'score': 0.6780003309249878},
{'id': '18302',
'text': 'BASEBALL: RED-HOT SOX CLIP THE ANGELS #39; WINGS BOSTON RED SOX fans are enjoying their best week of the season. While their beloved team swept wild-card rivals Anaheim in a three-game series to establish a nine-game winning streak, the hated New York Yankees endured the heaviest loss in their history.',
'score': 0.6639199256896973},
{'id': '16370',
'text': 'Boston Red Sox Team Report - September 1 (Sports Network) - The red-hot Boston Red Sox hope to continue rolling as they continue their three-game set with the Anaheim Angels this evening at Fenway Park.',
'score': 0.6617192029953003}]
embeddings.info()
{
"backend": "numpy",
"build": {
"create": "2023-05-04T12:12:24Z",
"python": "3.10.11",
"settings": {
"numpy": "1.22.4"
},
"system": "Linux (x86_64)",
"txtai": "5.6.0"
},
"content": "sqlite",
"dimensions": 384,
"offset": 25000,
"path": "sentence-transformers/all-MiniLM-L6-v2",
"update": "2023-05-04T12:12:24Z"
}
与以前相同的结果。唯一的区别是现在可以通过关联的SQLite数据库获得内容。
让我们检查ANN对象以查看其外观。
print(embeddings.ann.backend.shape)
print(type(embeddings.ann.backend))
(25000, 384)
<class 'numpy.memmap'>
正如预期的那样,这是一个数组。让我们计算搜索查询需要多长时间执行。
timer(embeddings)
0.03392000120000011 seconds per query
一点都不糟糕!
sqlite和pytorch
现在让我们尝试一个pytorch后端。
# Create embeddings instance
embeddings = Embeddings({"path": "sentence-transformers/all-MiniLM-L6-v2", "content": "sqlite", "backend": "torch"})
# Index data
embeddings.index((x, text, None) for x, text in enumerate(dataset))
让我们再次进行搜索。
embeddings.search("red sox")
[{'id': '19831',
'text': 'Boston Red Sox Team Report - September 6 (Sports Network) - Two of the top teams in the American League tangle in a possible American League Division Series preview tonight, as the West-leading Oakland Athletics host the wild card-leading Boston Red Sox for the first of a three-game set at the ',
'score': 0.678000271320343},
{'id': '18302',
'text': 'BASEBALL: RED-HOT SOX CLIP THE ANGELS #39; WINGS BOSTON RED SOX fans are enjoying their best week of the season. While their beloved team swept wild-card rivals Anaheim in a three-game series to establish a nine-game winning streak, the hated New York Yankees endured the heaviest loss in their history.',
'score': 0.6639199256896973},
{'id': '16370',
'text': 'Boston Red Sox Team Report - September 1 (Sports Network) - The red-hot Boston Red Sox hope to continue rolling as they continue their three-game set with the Anaheim Angels this evening at Fenway Park.',
'score': 0.6617191433906555}]
embeddings.info()
{
"backend": "torch",
"build": {
"create": "2023-05-04T12:12:53Z",
"python": "3.10.11",
"settings": {
"torch": "2.0.0+cu118"
},
"system": "Linux (x86_64)",
"txtai": "5.6.0"
},
"content": "sqlite",
"dimensions": 384,
"offset": 25000,
"path": "sentence-transformers/all-MiniLM-L6-v2",
"update": "2023-05-04T12:12:53Z"
}
,一次反对检查ANN对象。
print(embeddings.ann.backend.shape)
print(type(embeddings.ann.backend))
torch.Size([25000, 384])
<class 'torch.Tensor'>
正如预期的那样,这次后端是火炬张量。接下来,我们将计算平均搜索时间。
timer(embeddings)
0.021084972200000267 seconds per query
由于火炬使用GPU计算相似性矩阵。
sqlite和faiss
现在,让我们使用Faiss + Sqlite的标准TXTAI设置运行相同的代码。
# Create embeddings instance
embeddings = Embeddings({"path": "sentence-transformers/all-MiniLM-L6-v2", "content": True})
# Index data
embeddings.index((x, text, None) for x, text in enumerate(dataset))
embeddings.search("red sox")
[{'id': '19831',
'text': 'Boston Red Sox Team Report - September 6 (Sports Network) - Two of the top teams in the American League tangle in a possible American League Division Series preview tonight, as the West-leading Oakland Athletics host the wild card-leading Boston Red Sox for the first of a three-game set at the ',
'score': 0.6780003309249878},
{'id': '18302',
'text': 'BASEBALL: RED-HOT SOX CLIP THE ANGELS #39; WINGS BOSTON RED SOX fans are enjoying their best week of the season. While their beloved team swept wild-card rivals Anaheim in a three-game series to establish a nine-game winning streak, the hated New York Yankees endured the heaviest loss in their history.',
'score': 0.6639199256896973},
{'id': '16370',
'text': 'Boston Red Sox Team Report - September 1 (Sports Network) - The red-hot Boston Red Sox hope to continue rolling as they continue their three-game set with the Anaheim Angels this evening at Fenway Park.',
'score': 0.6617192029953003}]
embeddings.info()
{
"backend": "faiss",
"build": {
"create": "2023-05-04T12:13:23Z",
"python": "3.10.11",
"settings": {
"components": "IVF632,Flat"
},
"system": "Linux (x86_64)",
"txtai": "5.6.0"
},
"content": true,
"dimensions": 384,
"offset": 25000,
"path": "sentence-transformers/all-MiniLM-L6-v2",
"update": "2023-05-04T12:13:23Z"
}
timer(embeddings)
0.008729957724000087 seconds per query
所有内容都与以前的示例保持一致。请注意,鉴于它是向量索引,Faiss的速度更快。对于25,000个记录,不同的记录可以忽略不计,但是在百万+范围内数据集的矢量指数性能迅速提高。
Sqlite和HNSW
虽然Txtai在开箱即用的许多常见默认设置中努力使事情尽可能简单,但自定义后端选项可能会导致性能提高。下一个示例将将向量存储在HNSW索引中并自定义索引选项。
# Create embeddings instance
embeddings = Embeddings({"path": "sentence-transformers/all-MiniLM-L6-v2", "content": True, "backend": "hnsw", "hnsw": {"m": 32}})
# Index data
embeddings.index((x, text, None) for x, text in enumerate(dataset))
embeddings.search("red sox")
[{'id': '19831',
'text': 'Boston Red Sox Team Report - September 6 (Sports Network) - Two of the top teams in the American League tangle in a possible American League Division Series preview tonight, as the West-leading Oakland Athletics host the wild card-leading Boston Red Sox for the first of a three-game set at the ',
'score': 0.6780003309249878},
{'id': '18302',
'text': 'BASEBALL: RED-HOT SOX CLIP THE ANGELS #39; WINGS BOSTON RED SOX fans are enjoying their best week of the season. While their beloved team swept wild-card rivals Anaheim in a three-game series to establish a nine-game winning streak, the hated New York Yankees endured the heaviest loss in their history.',
'score': 0.6639198660850525},
{'id': '16370',
'text': 'Boston Red Sox Team Report - September 1 (Sports Network) - The red-hot Boston Red Sox hope to continue rolling as they continue their three-game set with the Anaheim Angels this evening at Fenway Park.',
'score': 0.6617192029953003}]
embeddings.info()
{
"backend": "hnsw",
"build": {
"create": "2023-05-04T12:13:59Z",
"python": "3.10.11",
"settings": {
"efconstruction": 200,
"m": 32,
"seed": 100
},
"system": "Linux (x86_64)",
"txtai": "5.6.0"
},
"content": true,
"deletes": 0,
"dimensions": 384,
"hnsw": {
"m": 32
},
"metric": "ip",
"offset": 25000,
"path": "sentence-transformers/all-MiniLM-L6-v2",
"update": "2023-05-04T12:13:59Z"
}
timer(embeddings)
0.006160191656000279 seconds per query
再次,一切都与以前的示例匹配。与faiss有微不足道的性能差异。
hnswlib为许多流行的矢量数据库提供动力。这绝对是值得评估的选项。
配置存储
配置将其作为字典传递给嵌入实例。保存嵌入实例时,默认行为是将配置保存为腌制对象。可以使用JSON。
# Create embeddings instance
embeddings = Embeddings({"path": "sentence-transformers/all-MiniLM-L6-v2", "content": True, "format": "json"})
# Index data
embeddings.index((x, text, None) for x, text in enumerate(dataset))
# Save embeddings
embeddings.save("index")
!cat index/config.json
{
"path": "sentence-transformers/all-MiniLM-L6-v2",
"content": true,
"format": "json",
"dimensions": 384,
"backend": "faiss",
"offset": 25000,
"build": {
"create": "2023-05-04T12:14:25Z",
"python": "3.10.11",
"settings": {
"components": "IVF632,Flat"
},
"system": "Linux (x86_64)",
"txtai": "5.6.0"
},
"update": "2023-05-04T12:14:25Z"
}
查看存储的配置,它几乎与embeddings.info()
调用相同。通过设计,JSON配置设计为可读。在Hugging Face Hub上共享嵌入式数据库时,这是一个不错的选择。
Sqlite vs duckdb
我们要探索的最后一件事是数据库后端。
SQLite是一个面向行的数据库,DuckDB面向列。这种设计差异很重要,在评估预期的工作量时要考虑的一个因素。让我们探索。
# Create embeddings instance
embeddings = Embeddings({"path": "sentence-transformers/all-MiniLM-L6-v2", "content": "sqlite"})
# Index data
embeddings.index((x, text, None) for x, text in enumerate(dataset))
timer(embeddings, "SELECT text FROM txtai where id = 3980")
0.0001413383999997677 seconds per query
timer(embeddings, "SELECT count(*), text FROM txtai group by text order by count(*) desc")
0.03718761139199978 seconds per query
# Create embeddings instance
embeddings = Embeddings({"path": "sentence-transformers/all-MiniLM-L6-v2", "content": "duckdb"})
# Index data
embeddings.index((x, text, None) for x, text in enumerate(dataset))
timer(embeddings, "SELECT text FROM txtai where id = 3980")
0.002780103128000519 seconds per query
timer(embeddings, "SELECT count(*), text FROM txtai group by text order by count(*) desc")
0.01854579007600023 seconds per query
虽然25,000行的数据集很小,但我们可以开始看到差异。 SQLite的单行检索时间更快。 DuckDB通过总查询做得更好。这是面向行的vs列面向数据库的产物,也是开发解决方案时要考虑的因素。
包起来
本文探讨了数据库和向量索引后端的不同组合。有了现代硬件,一个节点索引可以将我们带到多远,这真是令人惊讶。轻松进入数亿甚至数十亿的记录。当硬件瓶颈成为问题时,外部矢量数据库是一个值得考虑的选择。另一个是building a distributed txtai embeddings cluster。
简单性具有力量。许多付费服务试图说服我们注册API帐户是最好的起点。在某些情况下,例如几乎没有开发人员的团队,这是事实。但是对于具有开发人员的团队,应评估TXTAI之类的选项。