快速提示:将$ EuclideAndistance与Singlestore Kai用于MongoDB
#singlestoredb #mongodb #euclideandistance #pymongo

抽象的

SingleStore Kai for MongoDB的公告为涡轮增压JSON Analytics提供了令人兴奋的机会。 Preview Release -$euclideanDistance$dotProduct中提供了两个向量函数。在这篇简短的文章中,我们将使用来自previous article的一些示例数据来评估$euclideanDistance

本文中使用的笔记本文件可在GitHub中找到。

介绍

singlestoredB支持一系列vector functions。在a previous article中,我们使用了EUCLIDEAN_DISTANCEJSON_ARRAY_PACK函数。在another previous article中,我们使用了DOT_PRODUCTUNHEX函数。在这篇简短的文章中,我们将使用Singlestore Kai的$euclideanDistance进行MongoDB。

创建一个SinglestoredB云帐户

previous article展示了创建一个免费的SinglestoredB云帐户的步骤。我们将使用以下设置:

  • Workspace组名称: Iris演示组
  • 云提供商: aws
  • 地区:美国东1(N。Virginia)
  • 工作空间名称: iris-demo
  • 大小: S-00
  • 高级设置:
    • SINGLESTORE KAI选择
    • Martech申请取消选择

从左导航窗格中,我们将选择开发> SQL Editor 创建一个新数据库,如下:

CREATE DATABASE IF NOT EXISTS iris_db;

新笔记本

previous article展示了创建新笔记本的步骤。

我们将调用笔记本 kai_demo ,然后从可用选项中选择空白笔记本模板。

填写笔记本

创建表

我们将使用GitHub Gist的SQL代码作为我们的表格,如下:

%%sql

USE iris_db;
DROP TABLE IF EXISTS iris;
CREATE TABLE IF NOT EXISTS iris (
     vector BLOB,
     species VARCHAR(20)
);

加载数据

现在我们将数据加载到表中,如下:

%%sql

USE iris_db;
INSERT INTO iris VALUES
(JSON_ARRAY_PACK('[5.1,3.5,1.4,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.9,3,1.4,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.7,3.2,1.3,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.6,3.1,1.5,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5,3.6,1.4,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5.4,3.9,1.7,0.4]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.6,3.4,1.4,0.3]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5,3.4,1.5,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.4,2.9,1.4,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.9,3.1,1.5,0.1]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5.4,3.7,1.5,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.8,3.4,1.6,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.8,3,1.4,0.1]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.3,3,1.1,0.1]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5.8,4,1.2,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5.7,4.4,1.5,0.4]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5.4,3.9,1.3,0.4]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5.1,3.5,1.4,0.3]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5.7,3.8,1.7,0.3]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5.1,3.8,1.5,0.3]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5.4,3.4,1.7,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5.1,3.7,1.5,0.4]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.6,3.6,1,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5.1,3.3,1.7,0.5]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.8,3.4,1.9,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5,3,1.6,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5,3.4,1.6,0.4]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5.2,3.5,1.5,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5.2,3.4,1.4,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.7,3.2,1.6,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.8,3.1,1.6,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5.4,3.4,1.5,0.4]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5.2,4.1,1.5,0.1]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5.5,4.2,1.4,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.9,3.1,1.5,0.1]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5,3.2,1.2,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5.5,3.5,1.3,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.9,3.1,1.5,0.1]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.4,3,1.3,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5.1,3.4,1.5,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5,3.5,1.3,0.3]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.5,2.3,1.3,0.3]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.4,3.2,1.3,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5,3.5,1.6,0.6]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5.1,3.8,1.9,0.4]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.8,3,1.4,0.3]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5.1,3.8,1.6,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.6,3.2,1.4,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5.3,3.7,1.5,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5,3.3,1.4,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[7,3.2,4.7,1.4]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[6.4,3.2,4.5,1.5]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[6.9,3.1,4.9,1.5]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[5.5,2.3,4,1.3]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[6.5,2.8,4.6,1.5]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[5.7,2.8,4.5,1.3]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[6.3,3.3,4.7,1.6]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[4.9,2.4,3.3,1]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[6.6,2.9,4.6,1.3]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[5.2,2.7,3.9,1.4]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[5,2,3.5,1]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[5.9,3,4.2,1.5]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[6,2.2,4,1]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[6.1,2.9,4.7,1.4]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[5.6,2.9,3.6,1.3]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[6.7,3.1,4.4,1.4]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[5.6,3,4.5,1.5]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[5.8,2.7,4.1,1]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[6.2,2.2,4.5,1.5]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[5.6,2.5,3.9,1.1]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[5.9,3.2,4.8,1.8]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[6.1,2.8,4,1.3]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[6.3,2.5,4.9,1.5]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[6.1,2.8,4.7,1.2]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[6.4,2.9,4.3,1.3]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[6.6,3,4.4,1.4]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[6.8,2.8,4.8,1.4]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[6.7,3,5,1.7]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[6,2.9,4.5,1.5]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[5.7,2.6,3.5,1]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[5.5,2.4,3.8,1.1]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[5.5,2.4,3.7,1]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[5.8,2.7,3.9,1.2]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[6,2.7,5.1,1.6]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[5.4,3,4.5,1.5]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[6,3.4,4.5,1.6]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[6.7,3.1,4.7,1.5]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[6.3,2.3,4.4,1.3]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[5.6,3,4.1,1.3]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[5.5,2.5,4,1.3]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[5.5,2.6,4.4,1.2]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[6.1,3,4.6,1.4]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[5.8,2.6,4,1.2]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[5,2.3,3.3,1]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[5.6,2.7,4.2,1.3]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[5.7,3,4.2,1.2]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[5.7,2.9,4.2,1.3]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[6.2,2.9,4.3,1.3]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[5.1,2.5,3,1.1]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[5.7,2.8,4.1,1.3]'),'Iris-versicolor'),
(JSON_ARRAY_PACK('[6.3,3.3,6,2.5]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[5.8,2.7,5.1,1.9]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[7.1,3,5.9,2.1]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.3,2.9,5.6,1.8]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.5,3,5.8,2.2]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[7.6,3,6.6,2.1]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[4.9,2.5,4.5,1.7]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[7.3,2.9,6.3,1.8]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.7,2.5,5.8,1.8]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[7.2,3.6,6.1,2.5]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.5,3.2,5.1,2]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.4,2.7,5.3,1.9]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.8,3,5.5,2.1]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[5.7,2.5,5,2]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[5.8,2.8,5.1,2.4]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.4,3.2,5.3,2.3]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.5,3,5.5,1.8]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[7.7,3.8,6.7,2.2]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[7.7,2.6,6.9,2.3]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6,2.2,5,1.5]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.9,3.2,5.7,2.3]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[5.6,2.8,4.9,2]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[7.7,2.8,6.7,2]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.3,2.7,4.9,1.8]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.7,3.3,5.7,2.1]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[7.2,3.2,6,1.8]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.2,2.8,4.8,1.8]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.1,3,4.9,1.8]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.4,2.8,5.6,2.1]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[7.2,3,5.8,1.6]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[7.4,2.8,6.1,1.9]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[7.9,3.8,6.4,2]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.4,2.8,5.6,2.2]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.3,2.8,5.1,1.5]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.1,2.6,5.6,1.4]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[7.7,3,6.1,2.3]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.3,3.4,5.6,2.4]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.4,3.1,5.5,1.8]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6,3,4.8,1.8]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.9,3.1,5.4,2.1]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.7,3.1,5.6,2.4]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.9,3.1,5.1,2.3]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[5.8,2.7,5.1,1.9]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.8,3.2,5.9,2.3]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.7,3.3,5.7,2.5]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.7,3,5.2,2.3]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.3,2.5,5,1.9]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.5,3,5.2,2]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.2,3.4,5.4,2.3]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[5.9,3,5.1,1.8]'),'Iris-virginica');

安装库

我们将安装一个将在以后使用的库:

!pip install tabulate --quiet

导入库

接下来,我们将导入一些库,如下:

import pymongo
import struct

from pymongo import MongoClient
from tabulate import tabulate

连接到Singlestore Kai

我们现在将连接到我们的系统,如下所示:

client = MongoClient("mongodb://admin:<password>@<host>:27017/?authMechanism=PLAIN&tls=true&loadBalanced=true")

我们将用我们的singlestoredB云帐户中的值替换<password><host>

我们将切换到虹膜数据库并列出集合,如下:

db = client["iris_db"]

for coll in db.list_collection_names():
    print(coll)

输出应如下:

Iris 

现在让我们从列表中获取一个向量并将其转换为字节:

vector = [5.1, 3.5, 1.4, 0.2]

vector_bytes = struct.pack('f' * len(vector), *vector)

print(vector_bytes)

结果应如下:

b'\xcd\xcc\xbc@\x00\x00@@33\xa3@ff\xe6?'

示例查询

查询1

这是我们在上一篇文章中使用的第一个SQL查询:

SELECT species
FROM iris
WHERE EUCLIDEAN_DISTANCE(vector, JSON_ARRAY_PACK('[5.9,3,5.1,1.8]')) = 0;

结果是:

+----------------+
| species        |
+----------------+
| Iris-virginica |
+----------------+

这是一种使用Singlestore Kai的解决方案:

query = {
    "$expr": {
        "$eq": [
            { "$euclideanDistance": ["$vector", vector_bytes] },
            0
        ]
    }
}

projection = { "species": 1 }

document = db.iris.find_one(query, projection)

species = document["species"]

print(species)

由于我们使用的向量存储在数据库中,我们正在寻找一个匹配,因此结果应为:

Iris-virginica

查询2

这是我们在上一篇文章中使用的第二个SQL查询,寻找附近的其他花朵:

SELECT EUCLIDEAN_DISTANCE(vector, JSON_ARRAY_PACK('[5.9,3,5.1,1.8]')) AS euclidean_distance, species
FROM iris
ORDER BY euclidean_distance
LIMIT 5;

结果是:

+---------------------+----------------+
| euclidean_distance  | species        |
+---------------------+----------------+
|                   0 | Iris-virginica |
| 0.28284244589567653 | Iris-virginica |
| 0.31622746208231284 | Iris-virginica |
|  0.3316624219760969 | Iris-virginica |
|  0.3316624219760969 | Iris-virginica |
+---------------------+----------------+

这是一种使用Singlestore Kai的解决方案:

pipeline = [{
    "$project": {
        "euclidean_distance": {
            "$euclideanDistance": [ "$vector", vector_bytes ] },
        "species": "$species" } }, {
    "$sort": {
        "euclidean_distance": 1 } }, {
    "$limit": 5 }
]

cursor = db.iris.aggregate(pipeline)

table = []

for document in cursor:
    species = document["species"]
    euclidean_distance = document["euclidean_distance"]
    table.append([euclidean_distance, species])

print(tabulate(table, headers = ["euclidean_distance", "species"]))

结果应该是:

  euclidean_distance  species
--------------------  --------------
            0         Iris-virginica
            0.282842  Iris-virginica
            0.316227  Iris-virginica
            0.331662  Iris-virginica
            0.331662  Iris-virginica

查询3

这是我们在上一篇文章中使用的第三个SQL查询,使用一些虚拟的数据值来做出预测:

SELECT EUCLIDEAN_DISTANCE(vector, JSON_ARRAY_PACK('[5.2,3.6,1.5,0.3]')) AS euclidean_distance, species
FROM iris
ORDER BY euclidean_distance
LIMIT 5;

结果是:

+---------------------+-------------+
| euclidean_distance  | species     |
+---------------------+-------------+
| 0.14142129538778386 | Iris-setosa |
|  0.1732049874122573 | Iris-setosa |
| 0.17320510570613526 | Iris-setosa |
| 0.17320538530952567 | Iris-setosa |
| 0.19999992325900512 | Iris-setosa |
+---------------------+-------------+

这是一种使用Singlestore Kai的解决方案:

vector = [5.2, 3.6, 1.5, 0.3]

vector_bytes = struct.pack('f' * len(vector), *vector)

pipeline = [{
    "$project": {
        "euclidean_distance": {
            "$euclideanDistance": [ "$vector", vector_bytes ] },
        "species": "$species" } }, {
    "$sort": {
        "euclidean_distance": 1 } }, {
    "$limit": 5 }
]

cursor = db.iris.aggregate(pipeline)

table = []

for document in cursor:
    species = document["species"]
    euclidean_distance = document["euclidean_distance"]
    table.append([euclidean_distance, species])

print(tabulate(table, headers = ["euclidean_distance", "species"]))

结果应该是:

  euclidean_distance  species
--------------------  -----------
            0.141421  Iris-setosa
            0.173205  Iris-setosa
            0.173205  Iris-setosa
            0.173205  Iris-setosa
            0.2       Iris-setosa

查询4

最后,这是我们在上一篇文章中使用的第四个SQL查询:

SELECT species
FROM iris
ORDER BY EUCLIDEAN_DISTANCE(vector, JSON_ARRAY_PACK('[5.2,3.6,1.5,0.3]'))
LIMIT 1;

输出为:

+-------------+
| species     |
+-------------+
| Iris-setosa |
+-------------+

这是一种使用Singlestore Kai的解决方案:

pipeline = [{
    "$project": {
        "euclidean_distance": {
            "$euclideanDistance": [ "$vector", vector_bytes ] },
        "species": "$species" } }, {
    "$sort": {
        "euclidean_distance": 1 } }, {
    "$limit": 1 }
]

cursor = db.iris.aggregate(pipeline)

table = []

for document in cursor:
    species = document["species"]
    table.append([species])

print(tabulate(table, headers = ["species"]))

结果应该是:

species
-----------
Iris-setosa

将SQL结果与SINGLESTORE KAI结果进行比较,我们可以看到Singlestore Kai中的$euclideanDistance函数正在按预期工作。

概括

在这篇简短的文章中,我们使用$euclideanDistance函数测试了SQL针对SINGLESTORE KAI的测试。在将来的文章中,我们将尝试此新产品产品的其他功能。敬请期待。