外部数据库集成
#showdev #python #machinelearning #nlp

TXTAI提供了许多默认设置,以帮助开发人员快速启动。例如,元数据存储在sqlite中,faiss中的密码向量,稀疏向量中的索引索引和用NetworkX绘制数据。

这些组件中的每个组件都是可自定义的,并且可以与其他实现交换。这已被number of previous articles介绍。

本文将介绍如何将元数据存储在客户端服务器RDBMS系统中。除Sqlite和DuckDB外,现在都可以使用任何带有JSON supportSQLAlchemy supported database

安装依赖项

安装txtai和所有依赖项。

# Install txtai
pip install txtai[database] elasticsearch==7.10.1 datasets

安装Postgres

接下来,我们将安装Postgres并启动Postgres实例。

# Install and start Postgres
apt-get update && apt-get install postgresql
service postgresql start
sudo -u postgres psql -U postgres -c "ALTER USER postgres PASSWORD 'postgres';"

加载数据集

现在我们准备加载数据集了。我们将使用ag_news数据集。该数据集由120,000个新闻头条组成。

from datasets import load_dataset

# Load dataset
ds = load_dataset("ag_news", split="train")

用Postgres构建一个嵌入实例

让我们将该数据集加载到嵌入式数据库中。我们将配置此实例将元数据存储在Postgres中。请注意,下面的内容参数为SQLAlchemy connection string

此嵌入数据库将使用默认向量设置并在本地构建索引。

import txtai

# Create embeddings
embeddings = txtai.Embeddings(
    content="postgresql+psycopg2://postgres:postgres@localhost/postgres",
)

# Index dataset
embeddings.index(ds["text"])

让我们运行搜索查询,看看会回来的内容。

embeddings.search("red sox defeat yankees")
[{'id': '63561',
  'text': 'Red Sox Beat Yankees 6-4 in 12 Innings BOSTON - Down to their last three outs of the season, the Boston Red Sox rallied - against Mariano Rivera, the New York Yankees and decades of disappointment. Bill Mueller singled home the tying run off Rivera in the ninth inning and David Ortiz homered against Paul Quantrill in the 12th, leading Boston to a 6-4 victory Sunday over the Yankees that avoided a four-game sweep in the AL championship series...',
  'score': 0.8104304671287537},
 {'id': '63221',
  'text': 'Red Sox Beat Yankees 6-4 in 12 Innings BOSTON - Down to their last three outs of the season, the Boston Red Sox rallied - against Mariano Rivera, the New York Yankees and decades of disappointment. Bill Mueller singled home the tying run off Rivera in the ninth inning and David Ortiz homered against Paul Quantrill in the 12th, leading Boston to a 6-4 victory over the Yankees on Sunday night that avoided a four-game sweep in the AL championship series...',
  'score': 0.8097385168075562},
 {'id': '66861',
  'text': 'Record-Breaking Red Sox Clinch World Series Berth  NEW YORK (Reuters) - The Boston Red Sox crushed the New  York Yankees 10-3 Wednesday to complete an historic comeback  victory over their arch-rivals by four games to three in the  American League Championship Series.',
  'score': 0.8003846406936646}]

正如预期的那样,我们将获得标准的id, text, score字段,并获得了查询的最佳匹配项。但是,区别在于,所有数据库元数据通常存储在本地SQLite文件中,现在存储在Postgres Server中。

这打开了许多可能性,例如行级安全性。如果数据库未返回一行,则不会在此处显示。另外,此搜索可以选择仅返回ID和分数,这使用户知道存在记录,它们无法访问。

与其他受支持的数据库一样,可以从TXTAI SQL调用基础数据库功能。

embeddings.search("SELECT id, text, md5(text), score FROM txtai WHERE similar('red sox defeat yankees')")
[{'id': '63561',
  'text': 'Red Sox Beat Yankees 6-4 in 12 Innings BOSTON - Down to their last three outs of the season, the Boston Red Sox rallied - against Mariano Rivera, the New York Yankees and decades of disappointment. Bill Mueller singled home the tying run off Rivera in the ninth inning and David Ortiz homered against Paul Quantrill in the 12th, leading Boston to a 6-4 victory Sunday over the Yankees that avoided a four-game sweep in the AL championship series...',
  'md5': '1e55a78fdf0cb3be3ef61df650f0a50f',
  'score': 0.8104304671287537},
 {'id': '63221',
  'text': 'Red Sox Beat Yankees 6-4 in 12 Innings BOSTON - Down to their last three outs of the season, the Boston Red Sox rallied - against Mariano Rivera, the New York Yankees and decades of disappointment. Bill Mueller singled home the tying run off Rivera in the ninth inning and David Ortiz homered against Paul Quantrill in the 12th, leading Boston to a 6-4 victory over the Yankees on Sunday night that avoided a four-game sweep in the AL championship series...',
  'md5': 'a0417e1fc503a5a2945c8755b6fb18d5',
  'score': 0.8097385168075562},
 {'id': '66861',
  'text': 'Record-Breaking Red Sox Clinch World Series Berth  NEW YORK (Reuters) - The Boston Red Sox crushed the New  York Yankees 10-3 Wednesday to complete an historic comeback  victory over their arch-rivals by four games to three in the  American League Championship Series.',
  'md5': '398a8508692aed109bd8c56f067a8083',
  'score': 0.8003846406936646}]

请注意,将Postgres md5函数添加到查询中。

让我们保存并在嵌入式数据库中显示文件。

embeddings.save("vectors")
!ls -l vectors
total 183032
-rw-r--r-- 1 root root       355 Sep  7 16:38 config
-rw-r--r-- 1 root root 187420123 Sep  7 16:38 embeddings

在这种情况下仅存储配置和本地向量索引。

外部索引

如前所述,TXTAI的所有主要组件都可以用自定义组件替换。例如,有用于在WeaviateQdrant中存储密集向量的外部集成。

接下来,我们将构建一个示例,该示例将元数据存储在Postgres中,并使用Elasticsearch构建稀疏索引。

Elasticsearch的评分组件

首先,我们需要为Elasticsearch定义自定义评分组件。虽然本可以使用现有集成,但重要的是要证明创建新组件不是大的loe(〜70行代码)。参见下文。

from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk

from txtai.scoring import Scoring

class Elastic(Scoring):
  def __init__(self, config=None):
    # Scoring configuration
    self.config = config if config else {}

    # Server parameters
    self.url = self.config.get("url", "http://localhost:9200")
    self.indexname = self.config.get("indexname", "testindex")

    # Elasticsearch connection
    self.connection = Elasticsearch(self.url)

    self.terms = True
    self.normalize = self.config.get("normalize")

  def insert(self, documents, index=None):
    rows = []
    for uid, document, tags in documents:
        rows.append((index, document))

        # Increment index
        index = index + 1

    bulk(self.connection, ({"_index": self.indexname, "_id": uid, "text": text} for uid, text in rows))

  def index(self, documents=None):
    self.connection.indices.refresh(index=self.indexname)

  def search(self, query, limit=3):
    return self.batchsearch([query], limit)

  def batchsearch(self, queries, limit=3):
    # Generate bulk queries
    request = []
    for query in queries:
      req_head = {"index": self.indexname, "search_type": "dfs_query_then_fetch"}
      req_body = {
        "_source": False,
        "query": {"multi_match": {"query": query, "type": "best_fields", "fields": ["text"], "tie_breaker": 0.5}},
        "size": limit,
      }
      request.extend([req_head, req_body])

      # Run ES query
      response = self.connection.msearch(body=request, request_timeout=600)

      # Read responses
      results = []
      for resp in response["responses"]:
        result = resp["hits"]["hits"]
        results.append([(r["_id"], r["_score"]) for r in result])

      return results

  def count(self):
    response = self.connection.cat.count(self.indexname, params={"format": "json"})
    return int(response[0]["count"])

  def load(self, path):
    # No local storage
    pass

  def save(self, path):
    # No local storage
    pass

Elasticsearch服务器

与Postgres一样,我们将安装并启动一个Elasticsearch实例。

import os

# Download and extract elasticsearch
os.system("wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz")
os.system("tar -xzf elasticsearch-7.10.1-linux-x86_64.tar.gz")
os.system("chown -R daemon:daemon elasticsearch-7.10.1")
from subprocess import Popen, PIPE, STDOUT

# Start and wait for serverw
server = Popen(['elasticsearch-7.10.1/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1))
!sleep 30

让我们构建索引。与上一个示例的唯一区别是设置自定义scoring组件。

import txtai

# Creat embeddings
embeddings = txtai.Embeddings(
    keyword=True,
    content="postgresql+psycopg2://postgres:postgres@localhost/postgres",
    scoring= "__main__.Elastic"
)

# Index dataset
embeddings.index(ds["text"])

下面的搜索与以前显示的搜索相同。

embeddings.search("red sox defeat yankees")
[{'id': '66954',
  'text': 'Boston Red Sox make history Believe it, New England -- the Boston Red Sox are in the World Series. And they got there with the most unbelievable comeback of all, with four sweet swings after decades of defeat, shaming the dreaded New York Yankees.',
  'score': 21.451942},
 {'id': '69577',
  'text': 'Passing thoughts on Yankees-Red Sox series The Red Sox beat the Yankees at Yankee Stadium in a season-deciding game. The Red Sox beat the Yankees at Yankee Stadium in a season-deciding game and it wasn #39;t even close.',
  'score': 20.923117},
 {'id': '67253',
  'text': 'Sox Victorious At Last!! BOSTON -- After suffering decades of defeat and disappointment, the 2004 Boston Red Sox made history Wednesday night, beating the Yankees in the house that Ruth built and claiming the American League championship trophy.',
  'score': 20.865997}]

,我们再次获得了顶级比赛。这次,尽管该索引在Elasticsearch中。为什么结果和分数不同?这是因为这是一个关键字索引,并且使用Elasticsearch的RAW BM25分数。

对此组件的一个增强功能将是在标准评分组件中添加得分归一化。

为了良好的衡量标准,我们还可以证明md5函数也可以在这里调用。

embeddings.search("SELECT id, text, md5(text), score FROM txtai WHERE similar('red sox defeat yankees')")
[{'id': '66954',
  'text': 'Boston Red Sox make history Believe it, New England -- the Boston Red Sox are in the World Series. And they got there with the most unbelievable comeback of all, with four sweet swings after decades of defeat, shaming the dreaded New York Yankees.',
  'md5': '29084f8640d4d72e402e991bc9fdbfa0',
  'score': 21.451942},
 {'id': '69577',
  'text': 'Passing thoughts on Yankees-Red Sox series The Red Sox beat the Yankees at Yankee Stadium in a season-deciding game. The Red Sox beat the Yankees at Yankee Stadium in a season-deciding game and it wasn #39;t even close.',
  'md5': '056983d301975084b49a5987185f2ddf',
  'score': 20.923117},
 {'id': '67253',
  'text': 'Sox Victorious At Last!! BOSTON -- After suffering decades of defeat and disappointment, the 2004 Boston Red Sox made history Wednesday night, beating the Yankees in the house that Ruth built and claiming the American League championship trophy.',
  'md5': '7838fcf610f0b569829c9bafdf9012f2',
  'score': 20.865997}]

与预期的附加md5列相同的结果。

探索数据存储

我们要做的最后一件事是查看此数据在Postgres和Elasticsearch中如何存储在哪里以及如何存储。

让我们连接到本地Postgres实例,并从sections表中进行示例内容。

select id, text from sections where text like '%Red Sox%' and text like '%Yankees%' and text like '%defeat%' limit 3;
[('66954', 'Boston Red Sox make history Believe it, New England -- the Boston Red Sox are in the World Series. And they got there with the most unbelievable comeback of all, with four sweet swings after decades of defeat, shaming the dreaded New York Yankees.'),
 ('62732', "BoSox, Astros Play for Crucial Game 4 Wins (AP) AP - The Boston Red Sox entered this AL championship series hoping to finally overcome their bitter r ... (50 characters truncated) ... n-game defeat last October. Instead, they've been reduced to trying to prevent the Yankees from completing a humiliating sweep in their own ballpark."),
 ('62752', "BoSox, Astros Play for Crucial Game 4 Wins The Boston Red Sox entered this AL championship series hoping to finally overcome their bitter rivals from ... (42 characters truncated) ... game defeat last October. Instead, they've been reduced to trying to prevent the Yankees from completing a humiliating sweep in their own ballpark...")]

正如预期的,我们可以看到直接存储在Postgres中的内容!

现在让我们检查Elasticsearch。

import json
import requests

response = requests.get("http://localhost:9200/_search?q=red+sox+defeat+yankees&size=3")
print(json.dumps(response.json(), indent=2))
{
  "took": 13,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 3297,
      "relation": "eq"
    },
    "max_score": 21.451942,
    "hits": [
      {
        "_index": "testindex",
        "_type": "_doc",
        "_id": "66954",
        "_score": 21.451942,
        "_source": {
          "text": "Boston Red Sox make history Believe it, New England -- the Boston Red Sox are in the World Series. And they got there with the most unbelievable comeback of all, with four sweet swings after decades of defeat, shaming the dreaded New York Yankees."
        }
      },
      {
        "_index": "testindex",
        "_type": "_doc",
        "_id": "69577",
        "_score": 20.923117,
        "_source": {
          "text": "Passing thoughts on Yankees-Red Sox series The Red Sox beat the Yankees at Yankee Stadium in a season-deciding game. The Red Sox beat the Yankees at Yankee Stadium in a season-deciding game and it wasn #39;t even close."
        }
      },
      {
        "_index": "testindex",
        "_type": "_doc",
        "_id": "67253",
        "_score": 20.865997,
        "_source": {
          "text": "Sox Victorious At Last!! BOSTON -- After suffering decades of defeat and disappointment, the 2004 Boston Red Sox made history Wednesday night, beating the Yankees in the house that Ruth built and claiming the American League championship trophy."
        }
      }
    ]
  }
}

与通过嵌入数据库运行的内容相同的查询结果。

让我们保存嵌入式数据库并查看存储的内容。

embeddings.save("elastic")
!ls -l elastic
total 4
-rw-r--r-- 1 root root 155 Sep  7 16:39 config

我们所拥有的只是配置。无databaseembeddingsscoring文件。这些数据在Postgres和Elasticsearch!

包起来

本文展示了如何将外部数据库和其他外部集成与嵌入数据库一起使用。该体系结构确保随着新的索引和存储数据的可用方式,TXTAI可以轻松适应。

本文还展示了如何创建自定义组件是低水平的努力,并且可以在没有现有集成的情况下为组件轻松完成。