。这些组件中的每个组件都是可自定义的,并且可以与其他实现交换。这已被number of previous articles介绍。
本文将介绍如何将元数据存储在客户端服务器RDBMS系统中。除Sqlite和DuckDB外,现在都可以使用任何带有JSON support的SQLAlchemy supported database。
# Install txtai
pip install txtai[database] elasticsearch==7.10.1 datasets
# Install and start Postgres
apt-get update && apt-get install postgresql
service postgresql start
sudo -u postgres psql -U postgres -c "ALTER USER postgres PASSWORD 'postgres';"
from datasets import load_dataset
# Load dataset
ds = load_dataset("ag_news", split="train")
让我们将该数据集加载到嵌入式数据库中。我们将配置此实例将元数据存储在Postgres中。请注意,下面的内容参数为SQLAlchemy connection string。
import txtai
# Create embeddings
embeddings = txtai.Embeddings(
# Index dataset
embeddings.search("red sox defeat yankees")
[{'id': '63561',
'text': 'Red Sox Beat Yankees 6-4 in 12 Innings BOSTON - Down to their last three outs of the season, the Boston Red Sox rallied - against Mariano Rivera, the New York Yankees and decades of disappointment. Bill Mueller singled home the tying run off Rivera in the ninth inning and David Ortiz homered against Paul Quantrill in the 12th, leading Boston to a 6-4 victory Sunday over the Yankees that avoided a four-game sweep in the AL championship series...',
'score': 0.8104304671287537},
{'id': '63221',
'text': 'Red Sox Beat Yankees 6-4 in 12 Innings BOSTON - Down to their last three outs of the season, the Boston Red Sox rallied - against Mariano Rivera, the New York Yankees and decades of disappointment. Bill Mueller singled home the tying run off Rivera in the ninth inning and David Ortiz homered against Paul Quantrill in the 12th, leading Boston to a 6-4 victory over the Yankees on Sunday night that avoided a four-game sweep in the AL championship series...',
'score': 0.8097385168075562},
{'id': '66861',
'text': 'Record-Breaking Red Sox Clinch World Series Berth NEW YORK (Reuters) - The Boston Red Sox crushed the New York Yankees 10-3 Wednesday to complete an historic comeback victory over their arch-rivals by four games to three in the American League Championship Series.',
'score': 0.8003846406936646}]
正如预期的那样,我们将获得标准的id, text, score
字段,并获得了查询的最佳匹配项。但是,区别在于,所有数据库元数据通常存储在本地SQLite文件中,现在存储在Postgres Server中。
与其他受支持的数据库一样,可以从TXTAI SQL调用基础数据库功能。
embeddings.search("SELECT id, text, md5(text), score FROM txtai WHERE similar('red sox defeat yankees')")
[{'id': '63561',
'text': 'Red Sox Beat Yankees 6-4 in 12 Innings BOSTON - Down to their last three outs of the season, the Boston Red Sox rallied - against Mariano Rivera, the New York Yankees and decades of disappointment. Bill Mueller singled home the tying run off Rivera in the ninth inning and David Ortiz homered against Paul Quantrill in the 12th, leading Boston to a 6-4 victory Sunday over the Yankees that avoided a four-game sweep in the AL championship series...',
'md5': '1e55a78fdf0cb3be3ef61df650f0a50f',
'score': 0.8104304671287537},
{'id': '63221',
'text': 'Red Sox Beat Yankees 6-4 in 12 Innings BOSTON - Down to their last three outs of the season, the Boston Red Sox rallied - against Mariano Rivera, the New York Yankees and decades of disappointment. Bill Mueller singled home the tying run off Rivera in the ninth inning and David Ortiz homered against Paul Quantrill in the 12th, leading Boston to a 6-4 victory over the Yankees on Sunday night that avoided a four-game sweep in the AL championship series...',
'md5': 'a0417e1fc503a5a2945c8755b6fb18d5',
'score': 0.8097385168075562},
{'id': '66861',
'text': 'Record-Breaking Red Sox Clinch World Series Berth NEW YORK (Reuters) - The Boston Red Sox crushed the New York Yankees 10-3 Wednesday to complete an historic comeback victory over their arch-rivals by four games to three in the American League Championship Series.',
'md5': '398a8508692aed109bd8c56f067a8083',
'score': 0.8003846406936646}]
请注意,将Postgres md5
!ls -l vectors
total 183032
-rw-r--r-- 1 root root 355 Sep 7 16:38 config
-rw-r--r-- 1 root root 187420123 Sep 7 16:38 embeddings
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
from txtai.scoring import Scoring
class Elastic(Scoring):
def __init__(self, config=None):
# Scoring configuration
self.config = config if config else {}
# Server parameters
self.url = self.config.get("url", "http://localhost:9200")
self.indexname = self.config.get("indexname", "testindex")
# Elasticsearch connection
self.connection = Elasticsearch(self.url)
self.terms = True
self.normalize = self.config.get("normalize")
def insert(self, documents, index=None):
rows = []
for uid, document, tags in documents:
rows.append((index, document))
# Increment index
index = index + 1
bulk(self.connection, ({"_index": self.indexname, "_id": uid, "text": text} for uid, text in rows))
def index(self, documents=None):
def search(self, query, limit=3):
return self.batchsearch([query], limit)
def batchsearch(self, queries, limit=3):
# Generate bulk queries
request = []
for query in queries:
req_head = {"index": self.indexname, "search_type": "dfs_query_then_fetch"}
req_body = {
"_source": False,
"query": {"multi_match": {"query": query, "type": "best_fields", "fields": ["text"], "tie_breaker": 0.5}},
"size": limit,
request.extend([req_head, req_body])
# Run ES query
response = self.connection.msearch(body=request, request_timeout=600)
# Read responses
results = []
for resp in response["responses"]:
result = resp["hits"]["hits"]
results.append([(r["_id"], r["_score"]) for r in result])
return results
def count(self):
response = self.connection.cat.count(self.indexname, params={"format": "json"})
return int(response[0]["count"])
def load(self, path):
# No local storage
def save(self, path):
# No local storage
import os
# Download and extract elasticsearch
os.system("wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz")
os.system("tar -xzf elasticsearch-7.10.1-linux-x86_64.tar.gz")
os.system("chown -R daemon:daemon elasticsearch-7.10.1")
from subprocess import Popen, PIPE, STDOUT
# Start and wait for serverw
server = Popen(['elasticsearch-7.10.1/bin/elasticsearch'], stdout=PIPE, stderr=STDOUT, preexec_fn=lambda: os.setuid(1))
!sleep 30
import txtai
# Creat embeddings
embeddings = txtai.Embeddings(
scoring= "__main__.Elastic"
# Index dataset
embeddings.search("red sox defeat yankees")
[{'id': '66954',
'text': 'Boston Red Sox make history Believe it, New England -- the Boston Red Sox are in the World Series. And they got there with the most unbelievable comeback of all, with four sweet swings after decades of defeat, shaming the dreaded New York Yankees.',
'score': 21.451942},
{'id': '69577',
'text': 'Passing thoughts on Yankees-Red Sox series The Red Sox beat the Yankees at Yankee Stadium in a season-deciding game. The Red Sox beat the Yankees at Yankee Stadium in a season-deciding game and it wasn #39;t even close.',
'score': 20.923117},
{'id': '67253',
'text': 'Sox Victorious At Last!! BOSTON -- After suffering decades of defeat and disappointment, the 2004 Boston Red Sox made history Wednesday night, beating the Yankees in the house that Ruth built and claiming the American League championship trophy.',
'score': 20.865997}]
,我们再次获得了顶级比赛。这次,尽管该索引在Elasticsearch中。为什么结果和分数不同?这是因为这是一个关键字索引,并且使用Elasticsearch的RAW BM25分数。
embeddings.search("SELECT id, text, md5(text), score FROM txtai WHERE similar('red sox defeat yankees')")
[{'id': '66954',
'text': 'Boston Red Sox make history Believe it, New England -- the Boston Red Sox are in the World Series. And they got there with the most unbelievable comeback of all, with four sweet swings after decades of defeat, shaming the dreaded New York Yankees.',
'md5': '29084f8640d4d72e402e991bc9fdbfa0',
'score': 21.451942},
{'id': '69577',
'text': 'Passing thoughts on Yankees-Red Sox series The Red Sox beat the Yankees at Yankee Stadium in a season-deciding game. The Red Sox beat the Yankees at Yankee Stadium in a season-deciding game and it wasn #39;t even close.',
'md5': '056983d301975084b49a5987185f2ddf',
'score': 20.923117},
{'id': '67253',
'text': 'Sox Victorious At Last!! BOSTON -- After suffering decades of defeat and disappointment, the 2004 Boston Red Sox made history Wednesday night, beating the Yankees in the house that Ruth built and claiming the American League championship trophy.',
'md5': '7838fcf610f0b569829c9bafdf9012f2',
'score': 20.865997}]
select id, text from sections where text like '%Red Sox%' and text like '%Yankees%' and text like '%defeat%' limit 3;
[('66954', 'Boston Red Sox make history Believe it, New England -- the Boston Red Sox are in the World Series. And they got there with the most unbelievable comeback of all, with four sweet swings after decades of defeat, shaming the dreaded New York Yankees.'),
('62732', "BoSox, Astros Play for Crucial Game 4 Wins (AP) AP - The Boston Red Sox entered this AL championship series hoping to finally overcome their bitter r ... (50 characters truncated) ... n-game defeat last October. Instead, they've been reduced to trying to prevent the Yankees from completing a humiliating sweep in their own ballpark."),
('62752', "BoSox, Astros Play for Crucial Game 4 Wins The Boston Red Sox entered this AL championship series hoping to finally overcome their bitter rivals from ... (42 characters truncated) ... game defeat last October. Instead, they've been reduced to trying to prevent the Yankees from completing a humiliating sweep in their own ballpark...")]
import json
import requests
response = requests.get("http://localhost:9200/_search?q=red+sox+defeat+yankees&size=3")
print(json.dumps(response.json(), indent=2))
"took": 13,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
"hits": {
"total": {
"value": 3297,
"relation": "eq"
"max_score": 21.451942,
"hits": [
"_index": "testindex",
"_type": "_doc",
"_id": "66954",
"_score": 21.451942,
"_source": {
"text": "Boston Red Sox make history Believe it, New England -- the Boston Red Sox are in the World Series. And they got there with the most unbelievable comeback of all, with four sweet swings after decades of defeat, shaming the dreaded New York Yankees."
"_index": "testindex",
"_type": "_doc",
"_id": "69577",
"_score": 20.923117,
"_source": {
"text": "Passing thoughts on Yankees-Red Sox series The Red Sox beat the Yankees at Yankee Stadium in a season-deciding game. The Red Sox beat the Yankees at Yankee Stadium in a season-deciding game and it wasn #39;t even close."
"_index": "testindex",
"_type": "_doc",
"_id": "67253",
"_score": 20.865997,
"_source": {
"text": "Sox Victorious At Last!! BOSTON -- After suffering decades of defeat and disappointment, the 2004 Boston Red Sox made history Wednesday night, beating the Yankees in the house that Ruth built and claiming the American League championship trophy."
!ls -l elastic
total 4
-rw-r--r-- 1 root root 155 Sep 7 16:39 config