ð€txtai 6.0中的新功能
#showdev #python #machinelearning #nlp

TXTAI 6.0带来了许多主要功能增强功能。亮点包括:

  • 嵌入

    • 稀疏/关键字索引
    • 混合搜索
    • Subindexes
    • 精简方法
  • 大语言模型(LLMS)

    • 自动实例化最佳的基础模型
    • 通过参数可以立即支持,因为在上游释放功能

这些只是很大的高水平变化。也有许多改进和错误修复。

本笔记本将用示例涵盖所有更改。

标准升级免责声明

6.0是迄今为止最大的,即使不是最大的发行版之一!虽然几乎所有内容都向后兼容,但在部署之前升级和测试之前,备份生产索引是谨慎的。

安装依赖项

安装txtai和所有依赖项。

# Install txtai
pip install txtai datasets

稀疏索引

虽然密集矢量索引是迄今为止语义搜索系统的最佳选择,但稀疏关键字索引仍然可以添加值。在某些情况下,找到确切匹配很重要,或者我们只希望快速索引快速进行数据集的初始扫描。

不幸的是,对于本地基于Python的关键字索引库,没有很多很棒的选择。大多数可用的选项都不扩展,并且效率很低,仅专为简单情况而设计。 TXTAI使用6.0,在与Apache Lucene相当的情况下添加了具有速度和准确性的性能稀疏索引组件。未来的文章将讨论此背后的工程。

让我们看看。我们将使用prompt dataset on the Hugging Face Hub用于所有示例。

from datasets import load_dataset

import txtai

# Load dataset
ds = load_dataset("fka/awesome-chatgpt-prompts", split="train")

def stream():
  for row in ds:
    yield f"{row['act']} {row['prompt']}"

# Build sparse keyword index
embeddings = txtai.Embeddings(keyword=True, content=True)
embeddings.index(stream())

embeddings.search("Linux terminal", 1)
[{'id': '0',
  'text': 'Linux Terminal I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output inside one unique code block, and nothing else. do not write explanations. do not type commands unless I instruct you to do so. when i need to tell you something in english, i will do so by putting text inside curly brackets {like this}. my first command is pwd',
  'score': 0.5932681465337526}]

,它是一个关键字索引!

几件事要在这里打开包装。首先,对于那些与TXTAI的人,请注意,stream方法中仅产生了文本字段。使用6.0,当没有提供ID时,它们会自动生成。

下一个通知分数。那些与关键字分数(TF-IDF,BM25)相关的人会注意到该分数似乎很低。那是因为使用关键字索引,默认分数在0和1之间归一化。

稍后再到这些项目。

混合搜索

稀疏索引的添加可以启用混合搜索。混合搜索结合了稀疏和密集的矢量指数的结果。

# Build hybrid index
embeddings = txtai.Embeddings(hybrid=True, content=True)
embeddings.index(stream())

embeddings.search("Linux terminal", 1)
[{'id': '0',
  'text': 'Linux Terminal I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output inside one unique code block, and nothing else. do not write explanations. do not type commands unless I instruct you to do so. when i need to tell you something in english, i will do so by putting text inside curly brackets {like this}. my first command is pwd',
  'score': 0.6078515601252442}]

简单的变化,影响很大。现在,该新索引具有稀疏和密集(使用默认的sentence-transformers/all-MiniLM-L6-v2模型)索引。这些分数合并为单个分数,如上所述。

评分权重(也称为alpha)控制稀疏和致密索引之间的加权。

embeddings.search("Linux terminal", 1, weights=1)
[{'id': '0',
  'text': 'Linux Terminal I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output inside one unique code block, and nothing else. do not write explanations. do not type commands unless I instruct you to do so. when i need to tell you something in english, i will do so by putting text inside curly brackets {like this}. my first command is pwd',
  'score': 0.6224349737167358}]
embeddings.search("Linux terminal", 1, weights=0)
[{'id': '0',
  'text': 'Linux Terminal I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output inside one unique code block, and nothing else. do not write explanations. do not type commands unless I instruct you to do so. when i need to tell you something in english, i will do so by putting text inside curly brackets {like this}. my first command is pwd',
  'score': 0.5932681465337526}]

1的重量仅使用密集索引,而0仅使用稀疏索引。请注意,使用weight = 0的分数与之前的稀疏索引查询相同。

子索引

虽然稀疏和混合索引是很棒的新功能,但该版本的奖励是添加子索引。子索引将添加许多新方法来构建TXTAI嵌入实例。让我们在这里简要介绍。

# Build index with subindexes
embeddings = txtai.Embeddings(
    content=True,
    defaults=False,
    indexes={
        "sparse": {
            "keyword": True
        },
        "dense": {

        }
    }
)
embeddings.index(stream())

# Run search
embeddings.search("select id, text, score from txtai where similar('Linux terminal', 'sparse') and similar('Linux terminal', 'dense')", 1)
[{'id': '0',
  'text': 'Linux Terminal I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output inside one unique code block, and nothing else. do not write explanations. do not type commands unless I instruct you to do so. when i need to tell you something in english, i will do so by putting text inside curly brackets {like this}. my first command is pwd',
  'score': 0.6078515601252442}]
embeddings.search("select id, text, score from txtai where similar('Linux terminal', 'dense')", 1)
[{'id': '0',
  'text': 'Linux Terminal I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output inside one unique code block, and nothing else. do not write explanations. do not type commands unless I instruct you to do so. when i need to tell you something in english, i will do so by putting text inside curly brackets {like this}. my first command is pwd',
  'score': 0.6224349737167358}]
embeddings.search("select id, text, score from txtai where similar('Linux terminal', 'sparse')", 1)
[{'id': '0',
  'text': 'Linux Terminal I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output inside one unique code block, and nothing else. do not write explanations. do not type commands unless I instruct you to do so. when i need to tell you something in english, i will do so by putting text inside curly brackets {like this}. my first command is pwd',
  'score': 0.5932681465337526}]

请注意分数与上面的相同。上面的三个搜索进行了混合搜索,密集和稀疏的搜索。这次虽然使用子指数。顶级嵌入仅具有关联的数据库。

indexes中的每个部分都是一个完整的嵌入式索引,支持所有可用选项。例如,让我们添加图形子指数。

# Build index with graph subindex
embeddings = txtai.Embeddings(
    content=True,
    defaults=False,
    functions=[
        {"name": "graph", "function": "indexes.act.graph.attribute"}
    ],
    expressions=[
        {"name": "topic", "expression": "graph(indexid, 'topic')"},
    ],
    indexes={
        "act": {
            "keyword": True,
            "columns": {
                "text": "act"
            },
            "graph": {
                "topics": {}
            }
        },
        "prompt":{
            "columns": {
                "text": "prompt"
            }
        }
    }
)
embeddings.index(ds)

# Run search
embeddings.search("select id, act, prompt, score, topic from txtai where similar('Linux terminal')", 1)
[{'id': '0',
  'act': 'Linux Terminal',
  'prompt': 'I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output inside one unique code block, and nothing else. do not write explanations. do not type commands unless I instruct you to do so. when i need to tell you something in english, i will do so by putting text inside curly brackets {like this}. my first command is pwd',
  'score': 0.6382951796072414,
  'topic': 'terminal_linux_sql'}]

请注意,在此查询中添加了新的topic字段。这是来自运行主题建模的图形索引。另请注意,添加了两个不同列的两个索引。

请注意,图索引不同,因为它们取决于可用的稀疏或致密索引。这就是图形自动构造的方式。为了良好的措施,让我们将图形添加到密集的索引中。

# Build index with graph subindex
embeddings = txtai.Embeddings(
    content=True,
    defaults=False,
    functions=[
        {"name": "graph", "function": "indexes.act.graph.attribute"}
    ],
    expressions=[
        {"name": "topic", "expression": "graph(indexid, 'topic')"},
    ],
    indexes={
        "act": {
            "path": "intfloat/e5-small-v2",
            "columns": {
                "text": "act"
            },
            "graph": {
                "topics": {}
            }
        },
        "prompt":{
            "path": "sentence-transformers/all-MiniLM-L6-v2",
            "columns": {
                "text": "prompt"
            }
        }
    }
)
embeddings.index(ds)

# Run search
embeddings.search("select id, act, prompt, score, topic from txtai where similar('Linux terminal')", 1)
[{'id': '0',
  'act': 'Linux Terminal',
  'prompt': 'I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output inside one unique code block, and nothing else. do not write explanations. do not type commands unless I instruct you to do so. when i need to tell you something in english, i will do so by putting text inside curly brackets {like this}. my first command is pwd',
  'score': 1.0,
  'topic': 'linux_terminal'}]

除了主题不同,

几乎相同。这是由于向量指数的分组。请注意,act列和prompt列是向量索引,但指定不同的向量模型。这打开了加权不仅稀疏与矢量的另一种可能性,而且开辟了不同的向量模型。

embeddings.search("select id, act, prompt, score from txtai where similar('Linux terminal', 'act') and similar('Linux terminal', 'prompt')", 1)
[{'id': '0',
  'act': 'Linux Terminal',
  'prompt': 'I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output inside one unique code block, and nothing else. do not write explanations. do not type commands unless I instruct you to do so. when i need to tell you something in english, i will do so by putting text inside curly brackets {like this}. my first command is pwd',
  'score': 0.7881423830986023}]

与往常一样,到目前为止讨论的一切也得到了TXTAI应用程序实例的支持。

# Build index with graph subindex
app = txtai.Application("""
writable: True
embeddings:
  content: True
  defaults: False
  functions:
    - name: graph
      function: indexes.act.graph.attribute
  expressions:
    - name: topic
      expression: graph(indexid, 'topic')
  indexes:
    act:
      path: intfloat/e5-small-v2
      columns:
        text: act
      graph:
        topics:
    prompt:
      path: sentence-transformers/all-MiniLM-L6-v2
      columns:
        text: prompt
""")

app.add(ds)
app.index()

app.search("select id, act, prompt, topic, score from txtai where similar('Linux terminal', 'act') and similar('Linux terminal', 'prompt')", 1)
[{'id': '0',
  'act': 'Linux Terminal',
  'prompt': 'I want you to act as a linux terminal. I will type commands and you will reply with what the terminal should show. I want you to only reply with the terminal output inside one unique code block, and nothing else. do not write explanations. do not type commands unless I instruct you to do so. when i need to tell you something in english, i will do so by putting text inside curly brackets {like this}. my first command is pwd',
  'topic': 'linux_terminal',
  'score': 0.7881423830986023}]

简化的方法

已经涵盖了其中的大部分内容,但是添加了许多更改以使其更容易搜索和索引数据。现有的接口都得到了支持,这都是关于易于使用的。

请参阅下面的代码说明。

# Top-level import includes Application and Embeddings
import txtai

app = txtai.Application("""writable: False""")
embeddings = txtai.Embeddings()
# Ids are automatically generated when omitted
embeddings.index(["test"])
print(embeddings.search("test"))

# UUID ids are also supported - use any of the methods in https://docs.python.org/3/library/uuid.html
embeddings = txtai.Embeddings(autoid="uuid5")
embeddings.index(["test"])
embeddings.search("test")
[(0, 0.9999998807907104)]
[('4be0643f-1d98-573b-97cd-ca98a65347dd', 0.9999998807907104)]

大语言模型(LLM)

虽然本发行版的大部分更改都带有嵌入式软件包,但LLMS也具有重要的更改,使其更容易使用。

import torch

from txtai.pipeline import LLM

# Create model and set dtype to use 16-bit floats
llm = LLM("tiiuae/falcon-rw-1b", torch_dtype=torch.bfloat16)

print(llm("Write a short list of things to do in Paris", maxlength=55))
- Visit the Eiffel Tower.
- Visit the Louvre.
- Visit the Arc de Triomphe.
- Visit the Notre Dame Cathedral.
- Visit the Sacre Coeur Basilica.

新的LLM管道自动检测模型的类型并使用最佳可用方法加载。

现在,管道框架通过关键字参数到基础方法,该方法在发布时自动增加了对新的拥抱面部功能的支持。

包起来

本笔记本给TXTAI 6.0提供了快速概述。更新的文档和更多示例将即将到来。有很多要覆盖的东西,可以建立很多!

有关更多信息,请参见以下链接。