如何使用Azure OpenAI嵌入模型查找最相关的文档
#openai #java #embedding #ada

1.简介

最初,我从未处理过像“嵌入”,“集成”或“嵌入”之类的机器学习概念,所以当我遇到这些术语时,我感到非常困惑。我不明白该模型可以做什么。因此,我决定研究并实际使用该模型来掌握其有用性。在此条目中,我想与您分享我的发现。

2.高准确性和负担得起的模型已经到来!

OpenAI具有用于文本嵌入目的的名为text-embedding-ada-002的模型。通过使用此模型,您可以以低得多的成本找到最相关的文档。

还有其他可用的嵌入模型,例如Davinci模型和其他一些模型。但是,text-embedding-ada-002在大多数流程中提供了更高的准确性,并且与其他过程相比,价格更高。因此,强烈建议将此模型用于相似性搜索等任务。

根据Azure OpenAI Service Pricing,截至2023年6月8日,定价有以下差异:

嵌入模型 每1000个令牌
ada $ 0.0004
babbage $ 0.005
居里 $ 0.02
Davinci $ 0.20

3.什么是嵌入?

在使用嵌入之前,必须了解其基本概念。否则,我们将不知道如何有效使用它。简而言之,嵌入的想法基于我们在高中时学到的向量的概念。

高中期间,我们了解到,如果它们指向相同的方向并且具有相同的长度,则将两个箭头视为相同的矢量。通过应用此概念,我们可以找到最接近可能的长度和方向的向量,这意味着它们是最相似的。在下面的图中,蓝线代表相同的向量。如果我们可以找到最接近此蓝线的向量“ 1”,则将指示与它最相似的内容。

Vector Image

实际上,当您将自然语言字符串传递给text-embedding-ada-002 model时,您将获得一个由1536个浮点数组成的高维数组(向量),如下所示。

示例:字符串结果的摘录“请告诉我有关Azure Blob”

[-0.013197514, -0.025243968, 0.011384923, -0.015929632, -0.006410221, 0.031038966, 
-0.016921926, -0.010776317, -0.0019300125, -0.016300088, 0.01767607, -0.0047100903,
0.009691408, -0.014183193, -0.017001309, -0.014434575, 0.01902559, 0.010961545, 
0.013561356, -0.017371766, -0.007964816, 0.0026841562, 0.0019663966, -0.0019878964, 
-0.025614424, -0.0030298054, 0.020229574, -0.01455365, 0.022703694, -0.02033542, 
0.035696134, -0.002441044, -0.008057429, 0.0061191483, 0.004263558, -0.0025518502,
0.018046526, 0.011411385, 0.0063804523, -0.0021020102, 0.027572552, -0.017967142,
0.0077663567, 0.005361697, -0.0116693815, 0.004524862, -0.043581568, -0.01028017, 

.... Omitted

-0.0017265921, 0.083035186, -0.006205147, -0.008646191, 0.0070651355, -0.019052051, 
0.008374964, 0.024225213, 0.01522841, 0.019951731, -0.006516066, 0.017967142, 
0.0058082296, -0.0053253127, -0.009929558, -0.039109625, -0.031277116, -0.015863478, 
0.011040928, 0.012529369, 0.013012286, 0.022981536, -0.013706892, 0.012965979, 
0.011953839, -0.01903882, 0.015347485, 0.019052051, -0.0046538603, 0.012191989, 
-0.020983716, 0.0078722015, -0.0018605519, -0.02775778, -0.026739024, -0.010359553, 
-0.013918581, -0.011933993, 0.0066814483, 0.005196315, -0.0045744767, -2.7598185E-4, 
0.012251527, -0.018178832, -0.013276898, 0.011709073, -0.022928614, 0.002131779, 
-0.007462053, 0.0044554016]

使用嵌入的方法涉及将用户输入字符串的向量与数据库中的预储备向量进行比较并搜索最接近的向量。

有几种方法可以找到最接近的匹配,但是最常见的方法之一是计算余弦相似性。当您通过两个向量计算余弦相似性时,您将获得从-1到1的结果。此计算与存储在数据库中的数据一起执行。

最终,最接近1的结果的内容被认为是最相似的。

Example of a user's input text:
AAA BBB CCC DDD EEE FFF GGG HHH III JJJ

Calculation results of cosine similarity with stored strings in the database:
AAA BBB CCC DDD EEE FFF GGG HHH III JJ  =       0.9942949478994986  <---- Closest to 1 (highest similarity)
AAA BBB KKK LLL MMM NNN OOO PPP QQQ RRR =       0.930036739659776   <---- Next closest to 1
Today, we are presenting at an IT event.=           0.7775105340227892  <---- Furthest from 1

以这种方式,将用户的输入查询字符串进行了矢量化,并通过将其与数据库中的预先保存的向量(数组)结合在一起,可以执行相似性搜索。

4.处理字符串时要考虑的要点

令牌限制

可以由text-embedding-ada-002处理的最大令牌(大致相当于字符计数)为8192令牌。因此,当处理超过大约8000个字符的文本时,需要分裂。

准备用于处理的字符串

如“ Replace newlines with a single space”部分中提到的,已经确认“ 存在新线字符可能会导致意外结果。

因此,建议在将消息发送到text-embedding-ada-002之前,用空间替换newline字符(\ n)。

例如,如果您的文本类似下面的文本,请替换所有Newline字符并将其转换为单行字符串(请参阅提供的示例代码)。

Visual Studio Code for Java is an open-source code editor provided by Microsoft.

It is an extension on Visual Studio Code that supports the Java programming language.

It provides an environment for Java developers to efficiently write, build, debug, test, and execute code. Visual Studio Code is a lightweight text editor that supports multiple languages and is known for its high extensibility. For Java developers, Visual Studio Code for Java can be used as an excellent development environment.

Visual Studio Code for Java can be used by installing the Java Development Kit (JDK), a basic toolset for compiling and running Java programs.

To start a Java project in Visual Studio Code, first install the JDK, and then install the Java Extension Pack from the Visual Studio Code extension marketplace.

The Java Extension Pack includes features such as Java language support, debugging, testing, and project management.

The main features of Visual Studio Code for Java are as follows:

1. Syntax highlighting: Color-coding for Java code to improve readability.
2. Code completion: Suggests possible code during input, allowing for efficient code writing.
3. Code navigation: Easy jumping to classes, methods, and variables, and searching for definitions.
4. Refactoring: Features to change code structure and names, improving code quality.
5. Debugging: Setting breakpoints, step execution, and variable monitoring.
6. Testing: Supports testing frameworks like JUnit and TestNG, creating, running, and displaying test results.
7. Project management: Supports build tools like Maven and Gradle, managing project configurations and dependencies.
8. Git integration: Integrated with Git for source code version control.

Visual Studio Code for Java offers a wealth of features to improve developer productivity and provides an environment suitable for Java project development. In addition, the Visual Studio Code extension marketplace offers various Java-related extensions that can be added as needed. With these extensions, Java developers can use Visual Studio Code as an integrated development environment.

4.操作验证

4.1 Azure中可用的矢量数据库选项

截至2023年6月8日,在Azure中存储向量有几种选择。请根据您的需要选择适当的数据库。

  1. How to enable and use pgvector on Azure Database for PostgreSQL - Flexible Server
  2. Using vector search on embeddings in Azure Cosmos DB for MongoDB vCore
  3. Azure Cognitive Search (Private Preview)
  4. Azure Cache for Redis Enterprise

4.2使用Azure PostgreSQL灵活服务器的矢量搜索

如前所述,矢量数据库有多种选择,您可以选择适合您需求的一种。但是,出于验证目的,我们决定在这种情况下使用PostgreSQL灵活服务器。设置PostgreSQL灵活服务器处理向量的步骤如下。如果您有兴趣,请尝试一下。如果您选择其他持久性目的地,请跳过此部分。

4.2.1设置环境变量

要在Azure上创建资源,请修改并相应地设置以下环境变量。

export RESOURCE_GROUP=PostgreSQL
export LOCATION=eastus
export POSTGRES_SERVER_NAME=yoshiopgsql3
export POSTGRES_USER_NAME=yoterada
export POSTGRES_USER_PASS='!'$(head -c 12 /dev/urandom | base64 | tr -dc '[:alpha:]'| fold -w 8 | head -n 1)$RANDOM
echo "GENERATED PASSWORD: " $POSTGRES_USER_PASS
export POSTGRES_DB_NAME=VECTOR_DB
export SUBSCRIPTION_ID=********-****-****-****-************
export PUBLIC_IP=$(curl ifconfig.io -4)

在上面的配置示例中,密码将自动生成并输出到标准输出。请输入您自己的密码或记下生成的密码。

4.2.2安装Azure PostgreSQL灵活服务器

请执行以下三个命令。运行这些命令时,将执行以下任务:

  1. 安装Azure Postgresql灵活服务器
  2. 配置防火墙
  3. 创建一个新数据库
az postgres flexible-server create --name $POSTGRES_SERVER_NAME \
    -g $RESOURCE_GROUP \
    --location $LOCATION \
    --admin-user $POSTGRES_USER_NAME \
    --admin-password $POSTGRES_USER_PASS \
    --public-access $PUBLIC_IP
    --yes
az postgres flexible-server firewall-rule create \
    -g $RESOURCE_GROUP \
    -n $POSTGRES_SERVER_NAME \
    -r AllowAllAzureIPs \
    --start-ip-address 0.0.0.0 \
    --end-ip-address 255.255.255.255
az postgres flexible-server db create \
    -g $RESOURCE_GROUP \
    -s $POSTGRES_SERVER_NAME \
    -d $POSTGRES_DB_NAME

4.2.3配置Azure PostgreSQL灵活服务器以支持多语言支持

由于在这种情况下要持续存在的数据包括日本字符串,请执行以下设置以启用数据库中的日语UTF-8。

az postgres flexible-server parameter set \
    -g $RESOURCE_GROUP \
    --server-name $POSTGRES_SERVER_NAME \
    --subscription $SUBSCRIPTION_ID \
    --name lc_monetary --value "ja_JP.utf-8"
az postgres flexible-server parameter set \
    -g $RESOURCE_GROUP \
    --server-name $POSTGRES_SERVER_NAME \
    --subscription $SUBSCRIPTION_ID \
    --name lc_numeric --value "ja_JP.utf-8"
az postgres flexible-server parameter set \
    -g $RESOURCE_GROUP \
    --server-name $POSTGRES_SERVER_NAME \
    --subscription $SUBSCRIPTION_ID \
    --name timezone --value "Asia/Tokyo"

4.2.4在Azure Postgresql灵活服务器上安装扩展

为了在PostgreSQL中处理UUID和向量,我们将利用扩展名。请执行以下命令。

注意¼

不要在“ vector,uuid-ossp”之间留出任何空间。

az postgres flexible-server parameter set \
    -g $RESOURCE_GROUP \
    --server-name $POSTGRES_SERVER_NAME \
    --subscription $SUBSCRIPTION_ID \
    --name azure.extensions --value "VECTOR,UUID-OSSP"

4.3创建一个表以处理PostgreSQL中的向量

现在,PostgreSQL配置已经完成,请执行以下命令进行连接。

> psql -U $POSTGRES_USER_NAME -d $POSTGRES_DB_NAME \
      -h $POSTGRES_SERVER_NAME.postgres.database.azure.com 

成功连接了一旦成功连接,启用了PostgreSQL中早期添加的扩展。请为每个执行CREATE EXTENSION命令,如下所示。

SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, bits: 256, compression: off)
Type "help" for help.

VECTOR_DB=>
VECTOR_DB=> CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE EXTENSION
VECTOR_DB=> CREATE EXTENSION IF NOT EXISTS "vector";
CREATE EXTENSION

最后,创建一个表以存储向量数据。向量信息将存储在embedding VECTOR(1536)部分中。为简单起见,我们还将将原始文本保存在一起,并显示最相似字符串的原始字符串。在实践中,您可能需要使用URL作为链接目的地,或者如果需要的话,您可以在以后加入另一个表。

VECTOR_DB=> CREATE TABLE TBL_VECTOR_TEST(
    id uuid,
    embedding VECTOR(1536),
    origntext varchar(8192),
    PRIMARY KEY (id)
    );
CREATE TABLE

4.4创建Java应用程序

4.4.1将依赖项添加到Maven项目

要使用Azure OpenAI库,请连接到PostgreSQL,并执行数据持久性,至少需要添加以下依赖项。请将它们添加到您的pom.xml

        <dependency>
            <groupId>com.azure</groupId>
            <artifactId>azure-ai-openai</artifactId>
            <version>1.0.0-beta.1</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.postgresql/postgresql -->
        <dependency>
            <groupId>org.postgresql</groupId>
            <artifactId>postgresql</artifactId>
            <version>42.6.0</version>
        </dependency>

4.4.2创建和设置属性文件

请在src/main/resources目录中创建一个应用程序。

azure.openai.url=https://YOUR_OWN_AZURE_OPENAI.openai.azure.com
azure.openai.model.name=gpt-4
azure.openai.api.key=************************************

azure.postgresql.jdbcurl=jdbc:postgresql://YOUR_POSTGRESQL.postgres.database.azure.com:5432/VECTOR_DB
azure.postgresql.user=yoterada
azure.postgresql.password=************************************

logging.group.mycustomgroup=com.yoshio3
logging.level.mycustomgroup=DEBUG
logging.level.root=INFO

4.4.3实施Java程序

最后,实现Java代码。

package com.yoshio3;

import java.io.IOException;
import java.util.Arrays;
import java.util.List;
import java.util.Properties;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.UUID;
import java.util.concurrent.TimeUnit;

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.azure.ai.openai.OpenAIClient;
import com.azure.ai.openai.OpenAIClientBuilder;
import com.azure.ai.openai.models.EmbeddingsOptions;
import com.azure.core.credential.AzureKeyCredential;

public class VectorTest {
    private final static Logger LOGGER = LoggerFactory.getLogger(VectorTest.class);

    // Azure OpenAI API KEY
    private String OPENAI_API_KEY = "";
    // Azure OpenAI Instance URL
    private String OPENAI_URL = "";

    private String POSTGRESQL_JDBC_URL = "";
    private String POSTGRESQL_USER = "";
    private String POSTGRESQL_PASSWORD = "";

    private final static List<String> INPUT_DATA = Arrays.asList(
            "Visual Studio Code for Java is an extension for the open-source code editor Visual Studio Code provided by Microsoft, which supports the Java programming language. It offers an environment for Java developers to efficiently write, build, debug, test, and execute code. Visual Studio Code is a lightweight text editor with multi-language support and high extensibility. For Java developers, Visual Studio Code for Java can be an excellent development environment. Visual Studio Code for Java can be used by installing the Java Development Kit (JDK), which is a basic toolset for compiling and running Java programs. To start a Java project in Visual Studio Code, install the JDK and then install the Java Extension Pack from the Visual Studio Code extension marketplace. The Java Extension Pack includes features such as Java language support, debugging, testing, and project management. The main features of Visual Studio Code for Java are as follows: 1. Syntax highlighting: Color-coding Java code to improve readability. 2. Code completion: Suggesting possible code while entering it, enabling efficient code writing.3. Code navigation: Easy jumping to classes, methods, variables, and finding definitions. 4. Refactoring: Changing the structure and names of code to improve its quality. 5. Debugging: Setting breakpoints, stepping through code, and monitoring variables. 6. Testing: Supporting testing frameworks such as JUnit and TestNG, allowing test creation, execution, and result display. 7. Project management: Supporting build tools such as Maven and Gradle, enabling project configuration and dependency management. 8. Git integration: Integrating with Git for source code version control. Visual Studio Code for Java offers a wealth of productivity-enhancing features and provides an environment suitable for Java project development. Additionally, the Visual Studio Code extension marketplace has various Java-related extensions, which can be added as needed. With these extensions, Java developers can use Visual Studio Code as an integrated development environment.",
            "Azure App Service for Java is a fully managed platform on Microsoft's cloud platform Azure, designed for hosting, deploying, and managing Java applications. Azure App Service supports the development and execution of web applications, mobile apps, APIs, and other backend applications, allowing Java developers to quickly deploy and scale their applications. By abstracting infrastructure management, developers can focus on their application code. Azure App Service for Java includes the Java Development Kit (JDK) and a web server, supporting Java runtimes such as Tomcat, Jetty, and JBoss EAP. Furthermore, Azure App Service provides features to support the entire lifecycle of Java applications, such as building CI/CD (Continuous Integration/Continuous Delivery) pipelines, setting up custom domains, managing SSL certificates, and monitoring and diagnosing applications.",
            "Azure Container Apps is a fully managed service on Microsoft's cloud platform Azure for deploying, managing, and scaling container-based applications. Azure Container Apps is suitable for applications that implement microservices architecture, web applications, backend services, and job execution. This service abstracts container orchestration platforms like Kubernetes, freeing developers from infrastructure management and scaling, allowing them to focus on application code. Azure Container Apps deploys applications using Docker container images and provides features such as automatic scaling, rolling updates, and auto-recovery. Moreover, Azure Container Apps is platform-independent, supporting any programming language or framework. Developers can easily deploy applications using the Azure portal, Azure CLI, or CI/CD pipelines such as GitHub Actions.In terms of security, Azure Container Apps offers features like network isolation, private endpoints, and Azure Active Directory (AAD) integration, ensuring the safety of your applications. Additionally, it provides features to support application monitoring and diagnostics, enabling the identification of performance and issues using tools like Azure Monitor and Azure Application Insights.",
            "Azure Cosmos DB is a global, distributed multi-model database service from Microsoft, offering low latency, high availability, and high throughput. Cosmos DB is a NoSQL database that supports multiple data models, including key-value, document, column-family, and graph. This fully managed service caters to various applications and can be used for the development of web, mobile, IoT, and gaming solutions. The key features of Azure Cosmos DB include: Global distribution: Automatically replicates data across multiple geographic regions, providing high availability and low latency. Horizontal scaling: Utilizes partition keys to split data across multiple physical partitions, enabling flexible scaling of throughput and storage capacity. Five consistency models: Choose from five consistency models, ranging from strong to eventual consistency, depending on the consistency requirements of your globally distributed applications. Real-time analytics: Integrates with Azure Synapse Analytics and Azure Functions for real-time data processing and analysis. Additionally, Azure Cosmos DB offers multiple APIs, such as SQL API, MongoDB API, Cassandra API, Gremlin API, and Table API, enabling developers to build applications using familiar APIs. In terms of data security, Cosmos DB provides features like encryption, network isolation, and access control to ensure data protection. Furthermore, Azure Monitor and Azure Application Insights can be utilized to identify database performance and issues.",
            "Azure Kubernetes Service (AKS) is a Kubernetes cluster management service provided by Microsoft that simplifies the deployment, scaling, and operation of containerized applications. As a managed Kubernetes service, AKS automates tedious infrastructure management and update tasks, allowing developers to focus on application development. AKS offers enterprise-grade security, monitoring, and operational management features, and it easily integrates with DevOps pipelines. Additionally, it can work with other Azure services to support flexible application development. The main features and benefits of AKS are as follows: Cluster provisioning and scaling: AKS automates cluster infrastructure management, allowing you to add or remove nodes as needed, ensuring efficient resource usage and operations. Security and access control: AKS provides built-in Azure Active Directory (AD) integration and enables secure cluster access management using role-based access control (RBAC). Integration with CI/CD pipelines: AKS integrates with CI/CD tools such as Azure DevOps and Jenkins, automating application build, test, and deployment processes. Monitoring and logging: AKS integrates with monitoring tools like Azure Monitor, Prometheus, and Grafana, allowing you to monitor cluster performance and resource usage. Logs can be centrally managed through Azure Log Analytics. Networking and storage: AKS uses Azure Virtual Networks (VNet) to run clusters within private networks. It also provides persistent data storage using Azure Storage and Azure Disks. Collaboration with other Azure services: AKS can work with other Azure services such as Azure Functions and Azure Cosmos DB, extending the functionality of your applications. These features enable AKS to reduce the burden of infrastructure management and operations for developers, allowing them to focus on application development. The provision of enterprise-grade security, monitoring, and operational management features ensures a reliable user experience.",
            "Azure Cognitive Services is a cloud-based service that integrates AI capabilities provided by Microsoft, allowing you to easily add artificial intelligence features to applications, websites, and bots. Even without deep learning or machine learning expertise, you can use these features through APIs. Azure Cognitive Services is divided into the following five categories: 1. Vision: Analyzes images and videos, providing features such as facial recognition, image recognition, and Optical Character Recognition (OCR). It includes Computer Vision, Custom Vision, Face API, Form Recognizer, and Video Indexer. 2. Speech: Provides speech-related features such as speech recognition, speech synthesis, and speech translation. It includes Speech Services, Speech Translation, and Speaker Recognition. 3. Language: Offers natural language processing (NLP) capabilities, allowing text analysis, machine translation, document summarization, and keyword extraction. It includes Text Analytics, Language Understanding (LUIS), QnA Maker, and Translation. 4. Decision: Provides features to support decision-making and recommendations, enabling personalized content and actions for individual users. It includes Personalization, Anomaly Detector, and Content Moderator. 5. Web Search: Utilizes Bing's search engine to provide web search, image search, video search, news search, and map search capabilities. It includes Bing Web Search API, Bing Image Search API, Bing Video Search API, Bing News Search API, and Bing Maps API. By combining these AI features, you can improve user experience and enhance the value of applications and services. In addition, Azure Cognitive Services is designed with privacy and security in mind, allowing businesses and developers to use it with confidence.",
            "Azure Container Instances (ACI) is a service provided by Microsoft that allows quick and easy deployment of containers. With ACI, you can run application containers without managing virtual machines or orchestration, reducing the workload on infrastructure management for developers. ACI offers per-second billing, with costs generated based on resource usage. The main features and benefits are as follows: 1. Simple and fast deployment: ACI can deploy containers quickly using Docker container images. Containers running on ACI are also compatible with Docker commands and Kubernetes clusters. 2. No need to manage the operating system: ACI eliminates the need to manage and update the host OS, allowing developers to focus on application development and operations. 3. Seamless scaling: ACI enables flexible scaling of the number of containers, allowing you to increase or decrease resources according to load. 4. Security: ACI uses Azure's security features to protect containers and data, and also offers network isolation. 5. Flexible resource allocation based on requirements: With ACI, you can allocate CPU and memory individually, optimizing resources according to application requirements. 6. Event-driven container execution: ACI can be integrated with services like Azure Functions and Logic Apps to enable event-driven container execution. These features make Azure Container Instances effective for various scenarios, such as short-term workloads, batch processing, and development and testing environments. Additionally, by combining with AKS, you can also achieve container management with orchestration capabilities.",
            "Azure Data Lake Storage (ADLS) is a large-scale data lake solution provided by Microsoft, designed to efficiently store, process, and analyze petabyte-scale data. ADLS is part of Azure Storage and centrally manages unstructured, semi-structured, and structured data, enabling advanced analytics such as big data analysis, machine learning, and real-time analysis. The main features and benefits of ADLS are as follows: 1. Scalability: ADLS provides high scalability that can store petabyte-scale data, flexibly adapting to the growth of data. 2. Performance: ADLS offers optimized performance for large-scale data reads and writes, making it suitable for big data processing and real-time analysis. 3. Security and Compliance: ADLS provides security features such as data encryption, access control, and audit logs, addressing corporate compliance requirements. 4. High Compatibility: ADLS is compatible with the Hadoop Distributed File System (HDFS) and can be integrated with existing Hadoop ecosystems as well as big data analytics tools like Apache Spark and Azure Databricks. 5. Hierarchical Storage: ADLS offers three storage tiers - hot, cool, and archive - providing optimal cost-performance based on data access frequency. 6. Integration of Data Lake and Object Storage: Azure Data Lake Storage Gen2 integrates ADLS and Azure Blob Storage, offering a solution that combines the benefits of large-scale data lakes and object storage. Azure Data Lake Storage is a powerful platform for enterprises to efficiently manage large amounts of data and achieve advanced data processing, such as big data analysis and machine learning. This enables businesses to fully leverage the value of their data, gaining business insights and improving decision-making.",
            "Azure Blob Storage is an object storage service provided by Microsoft, offering a cloud-based solution for storing and managing large amounts of unstructured data. It can securely and scalably store various types of data, such as text, binary data, images, videos, and log files. The main features and benefits are as follows: 1. Scalability: Azure Blob Storage provides high scalability that can store petabyte-scale data, flexibly adapting to the growth of data. 2. Performance: Azure Blob Storage offers high performance for data reads and writes, enabling rapid processing of large amounts of data. 3. Hierarchical Storage: Azure Blob Storage provides three storage tiers - hot, cool, and archive - offering optimal cost-performance based on data access frequency. 4. Security and Compliance: Azure Blob Storage offers security features such as data encryption, access control, and audit logs, addressing corporate compliance requirements. 5. Global Access: Azure Blob Storage leverages Microsoft's Azure data centers, allowing fast and secure data access from anywhere in the world. 6. Integration and Compatibility: Azure Blob Storage can be integrated with Azure Data Lake Storage Gen2, other Azure services, and on-premises systems, enabling centralized data management and analysis. Azure Blob Storage is effectively used in various scenarios, such as web applications, backups, archives, big data analytics, and IoT devices. This enables businesses to fully leverage the value of their data, gaining business insights and improving decision-making.");

    private OpenAIClient client;

    public VectorTest() throws IOException {
        Properties properties = new Properties();
        properties.load(this.getClass().getResourceAsStream("/application.properties"));
        OPENAI_API_KEY = properties.getProperty("azure.openai.api.key");
        OPENAI_URL = properties.getProperty("azure.openai.url");
        POSTGRESQL_JDBC_URL = properties.getProperty("azure.postgresql.jdbcurl");
        POSTGRESQL_USER = properties.getProperty("azure.postgresql.user");
        POSTGRESQL_PASSWORD = properties.getProperty("azure.postgresql.password");

        client = new OpenAIClientBuilder()
                .credential(new AzureKeyCredential(OPENAI_API_KEY))
                .endpoint(OPENAI_URL)
                .buildClient();
    }


    public static void main(String[] args) {
        VectorTest test;
        try {
            test = new VectorTest();
            // Execute the insertion into the database only once.
            // test.insertDataToPostgreSQL();

            // Retrieve similar documents using vector search for the data registered in the DB
            test.findMostSimilarString("Please tell me about Azure Blob Storage");
        } catch (IOException e) {
            LOGGER.error("Error : ", e);
        }
    }

    /**
     * Invoke Azure OpenAI (text-embedding-ada-002)
     */
    private List<Double> invokeTextEmbedding(String originalText) {
        EmbeddingsOptions embeddingsOptions = new EmbeddingsOptions(Arrays.asList(originalText));
        var result = client.getEmbeddings("text-embedding-ada-002", embeddingsOptions);
        var embedding = result.getData().stream().findFirst().get().getEmbedding();
        return embedding;
    }

    private void insertDataToPostgreSQL() {
        try (var connection = DriverManager.getConnection(POSTGRESQL_JDBC_URL, POSTGRESQL_USER, POSTGRESQL_PASSWORD)) {
            var insertSql = "INSERT INTO TBL_VECTOR_TEST (id, embedding, origntext) VALUES (?, ?::vector, ?)";

            for (String originText : INPUT_DATA) {
                // Call the text embedding and obtain the vector array
                List<Double> embedding = invokeTextEmbedding(originText);
                // Sleep for 10 seconds to prevent errors 
            // due to sending a large number of requests in a short period of time
                TimeUnit.SECONDS.sleep(10);

                PreparedStatement insertStatement = connection.prepareStatement(insertSql);
                insertStatement.setObject(1, UUID.randomUUID());
                insertStatement.setArray(2, connection.createArrayOf("double", embedding.toArray()));
                insertStatement.setString(3, originText);
                insertStatement.executeUpdate();
            }
        } catch (SQLException | InterruptedException e) {
            LOGGER.error("Connection failure." + e.getMessage());
        }
    }

    public void findMostSimilarString(String data) {
        try (var connection = DriverManager.getConnection(POSTGRESQL_JDBC_URL, POSTGRESQL_USER, POSTGRESQL_PASSWORD)) {
            // Create a vector array by calling the text embedding for the string the user wants to search
            List<Double> embedding = invokeTextEmbedding(data);
            String array = embedding.toString();
            LOGGER.info("Embedding: \n" + array);

            // Search with the vector array (find the closest string to the user's input)
            String querySql = "SELECT origntext FROM TBL_VECTOR_TEST ORDER BY embedding <-> ?::vector LIMIT 1;";
            PreparedStatement queryStatement = connection.prepareStatement(querySql);
            queryStatement.setString(1, array);
            ResultSet resultSet = queryStatement.executeQuery();
            while (resultSet.next()) {
                String origntext = resultSet.getString("origntext");
                LOGGER.info("Origntext: " + origntext);
            }
        } catch (SQLException e) {
            LOGGER.error("Connection failure." + e.getMessage());
        }
    }
}
点1

定义了称为private final static List<String> INPUT_DATA的字符串列表。此列表包含以下服务的描述,并保存为字符数组列表。如前所述,如果包括新线字符,结果可能不准确,因此所有Newline字符均由空格替换。

  1. Visual Studio代码
  2. Java的Azure App Service
  3. Azure容器应用
  4. Azure Cosmos DB
  5. Azure Kubernetes服务
  6. Azure认知服务
  7. Azure容器实例
  8. Azure数据湖存储
  9. Azure Blob存储
点2

在以下invokeTextEmbedding方法中,称为Azure OpenAI嵌入式模型。通过为此方法提供字符串,结果返回浮点数的列表。

    private List<Double> invokeTextEmbedding(String originalText) {
        EmbeddingsOptions embeddingsOptions = new EmbeddingsOptions(Arrays.asList(originalText));
        var result = client.getEmbeddings("text-embedding-ada-002", embeddingsOptions);
        var embedding = result.getData().stream().findFirst().get().getEmbedding();
        return embedding;
    }
点3

以下方法一次从准备好的字符串列表(INPUT_DATA)中获取一个元素,调用Azure OpenAI的嵌入式,接收一个多维数组(向量),然后将其保存到数据库中。此过程用于插入测试数据,因此请仅执行一次。

    private void insertDataToPostgreSQL() {
    private void insertDataToPostgreSQL() {
        try (var connection = DriverManager.getConnection(POSTGRESQL_JDBC_URL, POSTGRESQL_USER, POSTGRESQL_PASSWORD)) {
            var insertSql = "INSERT INTO TBL_VECTOR_TEST (id, embedding, origntext) VALUES (?, ?::vector, ?)";

            for (String originText : INPUT_DATA) {
                // Call the text embedding and obtain the vector array
                List<Double> embedding = invokeTextEmbedding(originText);
                // Sleep for 10 seconds to prevent errors 
            // due to sending a large number of requests in a short period of time
                TimeUnit.SECONDS.sleep(10);

                PreparedStatement insertStatement = connection.prepareStatement(insertSql);
                insertStatement.setObject(1, UUID.randomUUID());
                insertStatement.setArray(2, connection.createArrayOf("double", embedding.toArray()));
                insertStatement.setString(3, originText);
                insertStatement.executeUpdate();
            }
        } catch (SQLException | InterruptedException e) {
            LOGGER.error("Connection failure." + e.getMessage());
        }
    }
点4

最后,将存储在数据库中的信息与用户输入的信息进行比较,并找到最相关的文档。

    public static void main(String[] args) {
        VectorTest test;
        try {
            test = new VectorTest();
            // Retrieve similar documents using vector search for the data registered in the DB
            test.findMostSimilarString("Please tell me about Azure Blob Storage");

在此示例中,字符串“请告诉我有关Azure Blob”的字符串如上图所示输入了Main()方法。请注意,还通过调用此字符串的invokeTextEmbedding(data);创建多维数组。

多维数组传递给以下查询:

SELECT origntext FROM TBL_VECTOR_TEST ORDER BY embedding <-> ?::vector LIMIT 1;"

由于上面指定了LIMIT 1,因此只有最相关的文档是输出。如果要返回多个结果,请更改此值。

此外,可以更改<->符号。在PostgreSQL的pgvector中,您可以通过指定以下三个运算符之一来计算相似性。根据需要更改操作员。

操作员 描述
<-> <-> 欧几里得距离:测量n维空间中两个向量之间的直线距离
<#> 负面的内部产品
<=> 余弦相似性:测量两个向量之间的角度的余弦

欧几里得距离在n维空间中测量两个向量之间的直线距离,而余弦相似性测量了两个向量之间的角度的余弦。

实现查询结果以返回原始文本。

    public void findMostSimilarString(String data) {
        try (var connection = DriverManager.getConnection(POSTGRESQL_JDBC_URL, POSTGRESQL_USER, POSTGRESQL_PASSWORD)) {
            // Create a vector array by calling the text embedding for the string the user wants to search
            List<Double> embedding = invokeTextEmbedding(data);
            String array = embedding.toString();
            LOGGER.info("Embedding: \n" + array);

            // Search with the vector array (find the closest string to the user's input)
            String querySql = "SELECT origntext FROM TBL_VECTOR_TEST ORDER BY embedding <-> ?::vector LIMIT 1;";
            PreparedStatement queryStatement = connection.prepareStatement(querySql);
            queryStatement.setString(1, array);
            ResultSet resultSet = queryStatement.executeQuery();
            while (resultSet.next()) {
                String origntext = resultSet.getString("origntext");
                LOGGER.info("Origntext: " + origntext);
            }
        } catch (SQLException e) {
            LOGGER.error("Connection failure." + e.getMessage());
        }
    }

通过以这种方式使用多维矢量阵列,您可以找到高度相关的文档。
如果您有这样的用例,请尝试一下。

参考信息

参考信息在下面列出。请根据需要看看。