Pythonベクトルデータベースの使用方法と修正

[

Embeddings and Vector Databases With ChromaDB

by Harrison Hoffman Nov 15, 2023

The era of large language models (LLMs) is here, bringing with it rapidly evolving libraries like ChromaDB that help augment LLM applications. You’ve most likely heard of chatbots like OpenAI’s ChatGPT, and perhaps you’ve even experienced their remarkable ability to reason about natural language processing (NLP) problems.

Modern LLMs, while imperfect, can accurately solve a wide range of problems and provide correct answers to many questions. But, due to the limits of their training and the number of text tokens they can process, LLMs aren’t a silver bullet for all tasks.

You wouldn’t expect an LLM to provide relevant responses about topics that don’t appear in their training data. For example, if you asked ChatGPT to summarize information in confidential company documents, then you’d be out of luck. You could show some of these documents to ChatGPT, but there’s a limited number of documents that you can upload before you exceed ChatGPT’s maximum number of tokens. How would you select documents to show ChatGPT?

To address these shortcomings and scale your LLM applications, one great option is to use a vector database like ChromaDB. A vector database allows you to store encoded unstructured objects, like text, as lists of numbers that you can compare to one another. You can, for example, find a collection of documents relevant to a question that you want an LLM to answer.

Represent Data as Vectors

Before diving into embeddings and vector databases, you should understand what vectors are and what they represent.

Vector Basics

You can describe vectors with variable levels of complexity, but one great starting place is to think of a vector as an array of numbers. For example, you could represent vectors using NumPy arrays as follows:

import numpy as np

vector = np.array([1, 2, 3])

In this case, vector is a one-dimensional array with three elements: 1, 2, and 3. Vectors can have any number of dimensions, depending on the problem you’re trying to solve.

Vector Similarity

Vectors can also be used to measure the similarity between different objects. For example, you can use the cosine distance to compare the similarity between two vectors.

To calculate the cosine similarity between two vectors a and b, you can use the following formula:

cosine_similarity = np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

The returned value will range from -1 to 1, with a higher value indicating a higher similarity. A value of 1 indicates that the vectors are identical, while a value of -1 indicates that the vectors are pointing in opposite directions.

Encode Objects in Embeddings

In the context of NLP, embeddings are representations of words or texts in the form of dense vectors. These vectors capture semantic relationships between words or texts, allowing us to perform various tasks such as word similarity and text classification.

Word Embeddings

Word embeddings are vector representations of words in a high-dimensional space. They are learned from large text corpora using techniques like Word2Vec, GloVe, or FastText.

With Python, you can use libraries like Gensim or SpaCy to generate word embeddings. Here’s an example using Gensim:

from gensim.models import Word2Vec

sentences = [["I", "love", "python"], ["Python", "is", "great"]]
model = Word2Vec(sentences, min_count=1)

# Get the word embedding for "python"
embedding = model.wv["python"]

In this example, we create a Word2Vec model using two sentences. We can then access the word embedding for “python” by indexing the Word2Vec model.

Text Embeddings

While word embeddings are useful for representing individual words, text embeddings are designed to capture the meaning of entire texts. They are learned in a similar way to word embeddings, but on a larger scale using techniques like Doc2Vec or Universal Sentence Encoder.

To generate text embeddings, you can use libraries like Gensim, TensorFlow, or Hugging Face’s Transformers. Here’s an example using TensorFlow and Universal Sentence Encoder:

import tensorflow as tf
import tensorflow_hub as hub

encoder = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")
texts = ["I love python", "Python is great"]
embeddings = encoder(texts)["outputs"]

# Get the text embedding for "I love python"
embedding = embeddings[0]

In this example, we load the Universal Sentence Encoder and use it to encode a list of texts. The resulting embeddings can then be used for various text analysis tasks.

Get Started With ChromaDB, an Open-Source Vector Database

Now that you understand the basics of vectors and embeddings, let’s dive into using ChromaDB, an open-source vector database.

What Is a Vector Database?

A vector database is a database designed to store and efficiently query vector data. Unlike traditional relational databases, which store data in tabular form, vector databases store data as vectors, allowing for advanced similarity search and retrieval.

ChromaDB is one such vector database that specializes in handling large-scale vector data. It provides a simple and efficient way to encode, store, and search for vectors, making it an ideal choice for LLM applications.

Meet ChromaDB for LLM Applications

ChromaDB is specifically designed to augment LLM applications like ChatGPT. With ChromaDB, you can store encoded textual information as vectors and perform similarity searches to retrieve relevant documents or context.

To get started with ChromaDB, you’ll need to install the ChromaDB Python library and set up a database. You can do this by following the official ChromaDB documentation. Once installed and configured, you can start encoding and querying your vectors.

Practical Example: Add Context for a Large Language Model (LLM)

In this practical example, we’ll walk through the process of adding context for a large language model using ChromaDB. We’ll start by preparing and inspecting our dataset, then we’ll create a collection and add reviews to it. Next, we’ll connect to an LLM service and provide context to the LLM using ChromaDB.

Prepare and Inspect Your Dataset

Before we can add context to our LLM, we need a dataset to work with. In this example, let’s assume we have a dataset of product reviews.

First, we need to load and preprocess the dataset. This may involve cleaning the text, removing stopwords, and tokenizing the text into individual words or sentences.

import pandas as pd

# Load the dataset
dataset = pd.read_csv("reviews.csv")

# Preprocess the dataset (cleaning, removing stopwords, tokenization, etc.)
preprocessed_reviews = preprocess_reviews(dataset["review"])

Create a Collection and Add Reviews

Now that we have our preprocessed reviews, we can create a collection in ChromaDB and add our reviews to it.

from chromadb.client import ChromaDBClient

# Connect to the ChromaDB server
client = ChromaDBClient("localhost", 8080)

# Create a collection
collection = client.create_collection("product_reviews")

# Add reviews to the collection
for review in preprocessed_reviews:
    vector = encode_review(review)
    collection.add_vector(vector)

In this example, we first connect to the ChromaDB server using the client library. We then create a collection called “product_reviews” and add each preprocessed review to the collection as a vector.

Connect to an LLM Service

With our collection of reviews stored in ChromaDB, we can now connect to an LLM service and provide context to the LLM.

from llm_service import LLMService

# Connect to the LLM service
llm = LLMService("https://llm-service.com")

# Get a query from the user
query = input("Enter your query: ")

# Find similar reviews using ChromaDB
similar_reviews = collection.find_similar_vectors(encode_query(query))

# Provide context to the LLM with the similar reviews
llm.set_context(similar_reviews)

In this example, we first connect to the LLM service using the LLMService library. We then ask the user for a query and use ChromaDB to find similar reviews. Finally, we provide the similar reviews as context to the LLM using the set_context method.

Conclusion

In this tutorial, you learned about the power of vector databases and how they can enhance large language models like ChatGPT. You also learned about representing data as vectors, generating word and text embeddings, and getting started with ChromaDB.

With ChromaDB, you can store encoded unstructured objects as vectors and perform advanced similarity searches. This allows you to add context to your LLM applications and scale them to handle larger datasets.

By combining the strengths of LLMs and vector databases like ChromaDB, you can overcome the limitations of LLMs and unlock new possibilities for NLP applications.