Pythonベクトルデータベースの使用方法と修正
Embeddings and Vector Databases With ChromaDB
by Harrison Hoffman Nov 15, 2023
The era of large language models (LLMs) is here, bringing with it rapidly evolving libraries like ChromaDB that help augment LLM applications. You’ve most likely heard of chatbots like OpenAI’s ChatGPT, and perhaps you’ve even experienced their remarkable ability to reason about natural language processing (NLP) problems.
Modern LLMs, while imperfect, can accurately solve a wide range of problems and provide correct answers to many questions. But, due to the limits of their training and the number of text tokens they can process, LLMs aren’t a silver bullet for all tasks.
You wouldn’t expect an LLM to provide relevant responses about topics that don’t appear in their training data. For example, if you asked ChatGPT to summarize information in confidential company documents, then you’d be out of luck. You could show some of these documents to ChatGPT, but there’s a limited number of documents that you can upload before you exceed ChatGPT’s maximum number of tokens. How would you select documents to show ChatGPT?
To address these shortcomings and scale your LLM applications, one great option is to use a vector database like ChromaDB. A vector database allows you to store encoded unstructured objects, like text, as lists of numbers that you can compare to one another. You can, for example, find a collection of documents relevant to a question that you want an LLM to answer.
Represent Data as Vectors
Before diving into embeddings and vector databases, you should understand what vectors are and what they represent.
Vector Basics
You can describe vectors with variable levels of complexity, but one great starting place is to think of a vector as an array of numbers. For example, you could represent vectors using NumPy arrays as follows:
In this case, vector
is a one-dimensional array with three elements: 1, 2, and 3. Vectors can have any number of dimensions, depending on the problem you’re trying to solve.
Vector Similarity
Vectors can also be used to measure the similarity between different objects. For example, you can use the cosine distance to compare the similarity between two vectors.
To calculate the cosine similarity between two vectors a
and b
, you can use the following formula:
The returned value will range from -1 to 1, with a higher value indicating a higher similarity. A value of 1 indicates that the vectors are identical, while a value of -1 indicates that the vectors are pointing in opposite directions.
Encode Objects in Embeddings
In the context of NLP, embeddings are representations of words or texts in the form of dense vectors. These vectors capture semantic relationships between words or texts, allowing us to perform various tasks such as word similarity and text classification.
Word Embeddings
Word embeddings are vector representations of words in a high-dimensional space. They are learned from large text corpora using techniques like Word2Vec, GloVe, or FastText.
With Python, you can use libraries like Gensim or SpaCy to generate word embeddings. Here’s an example using Gensim:
In this example, we create a Word2Vec model using two sentences. We can then access the word embedding for “python” by indexing the Word2Vec model.
Text Embeddings
While word embeddings are useful for representing individual words, text embeddings are designed to capture the meaning of entire texts. They are learned in a similar way to word embeddings, but on a larger scale using techniques like Doc2Vec or Universal Sentence Encoder.
To generate text embeddings, you can use libraries like Gensim, TensorFlow, or Hugging Face’s Transformers. Here’s an example using TensorFlow and Universal Sentence Encoder:
In this example, we load the Universal Sentence Encoder and use it to encode a list of texts. The resulting embeddings can then be used for various text analysis tasks.
Get Started With ChromaDB, an Open-Source Vector Database
Now that you understand the basics of vectors and embeddings, let’s dive into using ChromaDB, an open-source vector database.
What Is a Vector Database?
A vector database is a database designed to store and efficiently query vector data. Unlike traditional relational databases, which store data in tabular form, vector databases store data as vectors, allowing for advanced similarity search and retrieval.
ChromaDB is one such vector database that specializes in handling large-scale vector data. It provides a simple and efficient way to encode, store, and search for vectors, making it an ideal choice for LLM applications.
Meet ChromaDB for LLM Applications
ChromaDB is specifically designed to augment LLM applications like ChatGPT. With ChromaDB, you can store encoded textual information as vectors and perform similarity searches to retrieve relevant documents or context.
To get started with ChromaDB, you’ll need to install the ChromaDB Python library and set up a database. You can do this by following the official ChromaDB documentation. Once installed and configured, you can start encoding and querying your vectors.
Practical Example: Add Context for a Large Language Model (LLM)
In this practical example, we’ll walk through the process of adding context for a large language model using ChromaDB. We’ll start by preparing and inspecting our dataset, then we’ll create a collection and add reviews to it. Next, we’ll connect to an LLM service and provide context to the LLM using ChromaDB.
Prepare and Inspect Your Dataset
Before we can add context to our LLM, we need a dataset to work with. In this example, let’s assume we have a dataset of product reviews.
First, we need to load and preprocess the dataset. This may involve cleaning the text, removing stopwords, and tokenizing the text into individual words or sentences.
Create a Collection and Add Reviews
Now that we have our preprocessed reviews, we can create a collection in ChromaDB and add our reviews to it.
In this example, we first connect to the ChromaDB server using the client library. We then create a collection called “product_reviews” and add each preprocessed review to the collection as a vector.
Connect to an LLM Service
With our collection of reviews stored in ChromaDB, we can now connect to an LLM service and provide context to the LLM.
In this example, we first connect to the LLM service using the LLMService library. We then ask the user for a query and use ChromaDB to find similar reviews. Finally, we provide the similar reviews as context to the LLM using the set_context
method.
Conclusion
In this tutorial, you learned about the power of vector databases and how they can enhance large language models like ChatGPT. You also learned about representing data as vectors, generating word and text embeddings, and getting started with ChromaDB.
With ChromaDB, you can store encoded unstructured objects as vectors and perform advanced similarity searches. This allows you to add context to your LLM applications and scale them to handle larger datasets.
By combining the strengths of LLMs and vector databases like ChromaDB, you can overcome the limitations of LLMs and unlock new possibilities for NLP applications.