自定义您自己的嵌入数据库
#showdev #python #machinelearning #nlp

TXTAI支持许多不同的数据库和向量索引后端,包括外部数据库。有了现代硬件,一个节点索引可以将我们带到多远,这真是令人惊讶。轻松进入数亿甚至数十亿的记录。

TXTAI在创建自己的嵌入数据库方面提供了最大的灵活性。可以开箱即用的明智默认值。因此,除非您寻找这种配置,否则不是必需的。本文将探讨您要自定义嵌入数据库时​​可用的选项。

有关embeddings configuration settings can be found here的更多信息。

安装依赖项

安装txtai和所有依赖项。

# Install txtai
pip install txtai[database,similarity] datasets

加载数据集

此示例将使用ag_news数据集,该数据集是新闻文章的集合。我们将使用25,000个标题的子集。

import timeit

from datasets import load_dataset

def timer(embeddings, query="red sox"):
  elapsed = timeit.timeit(lambda: embeddings.search(query), number=250)
  print(f"{elapsed / 250} seconds per query")

dataset = load_dataset("ag_news", split="train")["text"][:25000]

numpy

让我们从最简单的嵌入数据库开始。这将是围绕使用句子转换器的矢量化文本的薄包装,将结果存储为numpy数组并运行相似性查询。

from txtai.embeddings import Embeddings

# Create embeddings instance
embeddings = Embeddings({"path": "sentence-transformers/all-MiniLM-L6-v2", "backend": "numpy"})

# Index data
embeddings.index((x, text, None) for x, text in enumerate(dataset))

embeddings.search("red sox")
[(19831, 0.6780003309249878),
 (18302, 0.6639199256896973),
 (16370, 0.6617192029953003)]
embeddings.info()
{
  "backend": "numpy",
  "build": {
    "create": "2023-05-04T12:12:02Z",
    "python": "3.10.11",
    "settings": {
      "numpy": "1.22.4"
    },
    "system": "Linux (x86_64)",
    "txtai": "5.6.0"
  },
  "dimensions": 384,
  "offset": 25000,
  "path": "sentence-transformers/all-MiniLM-L6-v2",
  "update": "2023-05-04T12:12:02Z"
}

上面的嵌入实例对文本进行了矢量,并将内容存储为numpy数组。以相似性得分返回数组索引位置。虽然可以使用句子转换器轻松完成相同的操作,但是使用TXTAI框架可以轻松地交换不同的选项,如接下来所示。

sqlite和numpy

我们将测试的下一个组合是一个带有numpy数组的SQLite数据库。

# Create embeddings instance
embeddings = Embeddings({"path": "sentence-transformers/all-MiniLM-L6-v2", "content": "sqlite", "backend": "numpy"})

# Index data
embeddings.index((x, text, None) for x, text in enumerate(dataset))

现在让我们进行搜索。

embeddings.search("red sox")
[{'id': '19831',
  'text': 'Boston Red Sox Team Report - September 6 (Sports Network) - Two of the top teams in the American League tangle in a possible American League Division Series preview tonight, as the West-leading Oakland Athletics host the wild card-leading Boston Red Sox for the first of a three-game set at the ',
  'score': 0.6780003309249878},
 {'id': '18302',
  'text': 'BASEBALL: RED-HOT SOX CLIP THE ANGELS #39; WINGS BOSTON RED SOX fans are enjoying their best week of the season. While their beloved team swept wild-card rivals Anaheim in a three-game series to establish a nine-game winning streak, the hated New York Yankees endured the heaviest loss in their history.',
  'score': 0.6639199256896973},
 {'id': '16370',
  'text': 'Boston Red Sox Team Report - September 1 (Sports Network) - The red-hot Boston Red Sox hope to continue rolling as they continue their three-game set with the Anaheim Angels this evening at Fenway Park.',
  'score': 0.6617192029953003}]
embeddings.info()
{
  "backend": "numpy",
  "build": {
    "create": "2023-05-04T12:12:24Z",
    "python": "3.10.11",
    "settings": {
      "numpy": "1.22.4"
    },
    "system": "Linux (x86_64)",
    "txtai": "5.6.0"
  },
  "content": "sqlite",
  "dimensions": 384,
  "offset": 25000,
  "path": "sentence-transformers/all-MiniLM-L6-v2",
  "update": "2023-05-04T12:12:24Z"
}

与以前相同的结果。唯一的区别是现在可以通过关联的SQLite数据库获得内容。

让我们检查ANN对象以查看其外观。

print(embeddings.ann.backend.shape)
print(type(embeddings.ann.backend))
(25000, 384)
<class 'numpy.memmap'>

正如预期的那样,这是一个数组。让我们计算搜索查询需要多长时间执行。

timer(embeddings)
0.03392000120000011 seconds per query

一点都不糟糕!

sqlite和pytorch

现在让我们尝试一个pytorch后端。

# Create embeddings instance
embeddings = Embeddings({"path": "sentence-transformers/all-MiniLM-L6-v2", "content": "sqlite", "backend": "torch"})

# Index data
embeddings.index((x, text, None) for x, text in enumerate(dataset))

让我们再次进行搜索。

embeddings.search("red sox")
[{'id': '19831',
  'text': 'Boston Red Sox Team Report - September 6 (Sports Network) - Two of the top teams in the American League tangle in a possible American League Division Series preview tonight, as the West-leading Oakland Athletics host the wild card-leading Boston Red Sox for the first of a three-game set at the ',
  'score': 0.678000271320343},
 {'id': '18302',
  'text': 'BASEBALL: RED-HOT SOX CLIP THE ANGELS #39; WINGS BOSTON RED SOX fans are enjoying their best week of the season. While their beloved team swept wild-card rivals Anaheim in a three-game series to establish a nine-game winning streak, the hated New York Yankees endured the heaviest loss in their history.',
  'score': 0.6639199256896973},
 {'id': '16370',
  'text': 'Boston Red Sox Team Report - September 1 (Sports Network) - The red-hot Boston Red Sox hope to continue rolling as they continue their three-game set with the Anaheim Angels this evening at Fenway Park.',
  'score': 0.6617191433906555}]
embeddings.info()
{
  "backend": "torch",
  "build": {
    "create": "2023-05-04T12:12:53Z",
    "python": "3.10.11",
    "settings": {
      "torch": "2.0.0+cu118"
    },
    "system": "Linux (x86_64)",
    "txtai": "5.6.0"
  },
  "content": "sqlite",
  "dimensions": 384,
  "offset": 25000,
  "path": "sentence-transformers/all-MiniLM-L6-v2",
  "update": "2023-05-04T12:12:53Z"
}

,一次反对检查ANN对象。

print(embeddings.ann.backend.shape)
print(type(embeddings.ann.backend))
torch.Size([25000, 384])
<class 'torch.Tensor'>

正如预期的那样,这次后端是火炬张量。接下来,我们将计算平均搜索时间。

timer(embeddings)
0.021084972200000267 seconds per query

由于火炬使用GPU计算相似性矩阵。

sqlite和faiss

现在,让我们使用Faiss + Sqlite的标准TXTAI设置运行相同的代码。

# Create embeddings instance
embeddings = Embeddings({"path": "sentence-transformers/all-MiniLM-L6-v2", "content": True})

# Index data
embeddings.index((x, text, None) for x, text in enumerate(dataset))

embeddings.search("red sox")
[{'id': '19831',
  'text': 'Boston Red Sox Team Report - September 6 (Sports Network) - Two of the top teams in the American League tangle in a possible American League Division Series preview tonight, as the West-leading Oakland Athletics host the wild card-leading Boston Red Sox for the first of a three-game set at the ',
  'score': 0.6780003309249878},
 {'id': '18302',
  'text': 'BASEBALL: RED-HOT SOX CLIP THE ANGELS #39; WINGS BOSTON RED SOX fans are enjoying their best week of the season. While their beloved team swept wild-card rivals Anaheim in a three-game series to establish a nine-game winning streak, the hated New York Yankees endured the heaviest loss in their history.',
  'score': 0.6639199256896973},
 {'id': '16370',
  'text': 'Boston Red Sox Team Report - September 1 (Sports Network) - The red-hot Boston Red Sox hope to continue rolling as they continue their three-game set with the Anaheim Angels this evening at Fenway Park.',
  'score': 0.6617192029953003}]
embeddings.info()
{
  "backend": "faiss",
  "build": {
    "create": "2023-05-04T12:13:23Z",
    "python": "3.10.11",
    "settings": {
      "components": "IVF632,Flat"
    },
    "system": "Linux (x86_64)",
    "txtai": "5.6.0"
  },
  "content": true,
  "dimensions": 384,
  "offset": 25000,
  "path": "sentence-transformers/all-MiniLM-L6-v2",
  "update": "2023-05-04T12:13:23Z"
}
timer(embeddings)
0.008729957724000087 seconds per query

所有内容都与以前的示例保持一致。请注意,鉴于它是向量索引,Faiss的速度更快。对于25,000个记录,不同的记录可以忽略不计,但是在百万+范围内数据集的矢量指数性能迅速提高。

Sqlite和HNSW

虽然Txtai在开箱即用的许多常见默认设置中努力使事情尽可能简单,但自定义后端选项可能会导致性能提高。下一个示例将将向量存储在HNSW索引中并自定义索引选项。

# Create embeddings instance
embeddings = Embeddings({"path": "sentence-transformers/all-MiniLM-L6-v2", "content": True, "backend": "hnsw", "hnsw": {"m": 32}})

# Index data
embeddings.index((x, text, None) for x, text in enumerate(dataset))

embeddings.search("red sox")
[{'id': '19831',
  'text': 'Boston Red Sox Team Report - September 6 (Sports Network) - Two of the top teams in the American League tangle in a possible American League Division Series preview tonight, as the West-leading Oakland Athletics host the wild card-leading Boston Red Sox for the first of a three-game set at the ',
  'score': 0.6780003309249878},
 {'id': '18302',
  'text': 'BASEBALL: RED-HOT SOX CLIP THE ANGELS #39; WINGS BOSTON RED SOX fans are enjoying their best week of the season. While their beloved team swept wild-card rivals Anaheim in a three-game series to establish a nine-game winning streak, the hated New York Yankees endured the heaviest loss in their history.',
  'score': 0.6639198660850525},
 {'id': '16370',
  'text': 'Boston Red Sox Team Report - September 1 (Sports Network) - The red-hot Boston Red Sox hope to continue rolling as they continue their three-game set with the Anaheim Angels this evening at Fenway Park.',
  'score': 0.6617192029953003}]
embeddings.info()
{
  "backend": "hnsw",
  "build": {
    "create": "2023-05-04T12:13:59Z",
    "python": "3.10.11",
    "settings": {
      "efconstruction": 200,
      "m": 32,
      "seed": 100
    },
    "system": "Linux (x86_64)",
    "txtai": "5.6.0"
  },
  "content": true,
  "deletes": 0,
  "dimensions": 384,
  "hnsw": {
    "m": 32
  },
  "metric": "ip",
  "offset": 25000,
  "path": "sentence-transformers/all-MiniLM-L6-v2",
  "update": "2023-05-04T12:13:59Z"
}
timer(embeddings)
0.006160191656000279 seconds per query

再次,一切都与以前的示例匹配。与faiss有微不足道的性能差异。

hnswlib为许多流行的矢量数据库提供动力。这绝对是值得评估的选项。

配置存储

配置将其作为字典传递给嵌入实例。保存嵌入实例时,默认行为是将配置保存为腌制对象。可以使用JSON。

# Create embeddings instance
embeddings = Embeddings({"path": "sentence-transformers/all-MiniLM-L6-v2", "content": True, "format": "json"})

# Index data
embeddings.index((x, text, None) for x, text in enumerate(dataset))

# Save embeddings
embeddings.save("index")

!cat index/config.json
{
  "path": "sentence-transformers/all-MiniLM-L6-v2",
  "content": true,
  "format": "json",
  "dimensions": 384,
  "backend": "faiss",
  "offset": 25000,
  "build": {
    "create": "2023-05-04T12:14:25Z",
    "python": "3.10.11",
    "settings": {
      "components": "IVF632,Flat"
    },
    "system": "Linux (x86_64)",
    "txtai": "5.6.0"
  },
  "update": "2023-05-04T12:14:25Z"
}

查看存储的配置,它几乎与embeddings.info()调用相同。通过设计,JSON配置设计为可读。在Hugging Face Hub上共享嵌入式数据库时,这是一个不错的选择。

Sqlite vs duckdb

我们要探索的最后一件事是数据库后端。

SQLite是一个面向行的数据库,DuckDB面向列。这种设计差异很重要,在评估预期的工作量时要考虑的一个因素。让我们探索。

# Create embeddings instance
embeddings = Embeddings({"path": "sentence-transformers/all-MiniLM-L6-v2", "content": "sqlite"})

# Index data
embeddings.index((x, text, None) for x, text in enumerate(dataset))
timer(embeddings, "SELECT text FROM txtai where id = 3980")
0.0001413383999997677 seconds per query
timer(embeddings, "SELECT count(*), text FROM txtai group by text order by count(*) desc")
0.03718761139199978 seconds per query
# Create embeddings instance
embeddings = Embeddings({"path": "sentence-transformers/all-MiniLM-L6-v2", "content": "duckdb"})

# Index data
embeddings.index((x, text, None) for x, text in enumerate(dataset))
timer(embeddings, "SELECT text FROM txtai where id = 3980")
0.002780103128000519 seconds per query
timer(embeddings, "SELECT count(*), text FROM txtai group by text order by count(*) desc")
0.01854579007600023 seconds per query

虽然25,000行的数据集很小,但我们可以开始看到差异。 SQLite的单行检索时间更快。 DuckDB通过总查询做得更好。这是面向行的vs列面向数据库的产物,也是开发解决方案时要考虑的因素。

包起来

本文探讨了数据库和向量索引后端的不同组合。有了现代硬件,一个节点索引可以将我们带到多远,这真是令人惊讶。轻松进入数亿甚至数十亿的记录。当硬件瓶颈成为问题时,外部矢量数据库是一个值得考虑的选择。另一个是building a distributed txtai embeddings cluster

简单性具有力量。许多付费服务试图说服我们注册API帐户是最好的起点。在某些情况下,例如几乎没有开发人员的团队,这是事实。但是对于具有开发人员的团队,应评估TXTAI之类的选项。