In this tutorial, we’ll demonstrate how to use Hugging Face embeddings with Upstash Vector and LangChain to perform a similarity search. We will upload a few sample documents, embed them using Hugging Face, and then perform a search query to find the most semantically similar documents.

Important Note on Python Version

Recent Python versions may cause compatibility issues with torch, a dependency for Hugging Face models. Therefore, we recommend using Python 3.9 to avoid any installation issues.

Installation and Setup

First, we need to set up our environment and install the necessary libraries. Install the dependencies by running the following command:

pip install langchain sentence_transformers upstash-vector python-dotenv langchain-community langchain-huggingface

Next, create a .env file in your project directory with the following content, replacing your_upstash_url and your_upstash_token with your actual Upstash credentials:

UPSTASH_VECTOR_REST_URL=your_upstash_url
UPSTASH_VECTOR_REST_TOKEN=your_upstash_token

This configuration file will allow us to load the required environment variables.

Code

We will load our environment variables and initialize the Hugging Face embeddings model along with the Upstash Vector store.

# Load environment variables for API keys and Upstash configuration
from dotenv import load_dotenv
import os
load_dotenv()

# Import required libraries
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores.upstash import UpstashVectorStore

# Initialize Hugging Face embeddings model
embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-mpnet-base-v2")

# Set up Upstash Vector Store (automatically uses the environment variables)
vector_store = UpstashVectorStore(embedding=embeddings)

Next, we will create sample documents and embed them using Hugging Face embeddings, then store them in Upstash Vector.

# Import the required Document class from LangChain
from langchain.schema import Document

# Sample documents to embed and store as Document objects
documents = [
    Document(page_content="Global warming is causing sea levels to rise."),
    Document(page_content="Artificial intelligence is transforming many industries."),
    Document(page_content="Renewable energy is vital for sustainable development.")
]

# Embed documents and store in Upstash Vector with batching
vector_store.add_documents(
    documents=documents,
    batch_size=100,               
    embedding_chunk_size=200      
)

print("Documents with embeddings have been stored in Upstash Vector.")

When inserting documents, they are first embedded using the Embeddings object. Many embedding models, such as the Hugging Face models, support embedding multiple documents at once. This allows for efficient processing by batching documents and embedding them in parallel.

  • The embedding_chunk_size parameter controls the number of documents processed in parallel when creating embeddings.

Once the embeddings are created, they are stored in Upstash Vector. To reduce the number of HTTP requests, the vectors are also batched when they are sent to Upstash Vector.

  • The batch_size parameter controls the number of vectors included in each HTTP request when sending to Upstash Vector.

In the Upstash Vector free tier, there is a limit of 1000 vectors per batch.

Now, we can perform a semantic search.

# Querying Vectors using a text query
query_text = "What are the effects of global warming?"
query_embedding = embeddings.embed_query(query_text)

# Perform similarity search with the query text
result_text_query = vector_store.similarity_search(
    query=query_text,
    k=5  # Number of top results to return
)

print("Results for text-based similarity search:")
for res in result_text_query:
    print(res.page_content)

Here’s the output of our text-based similarity search:

Results for text-based similarity search:
Global warming is causing sea levels to rise.
Renewable energy is vital for sustainable development.
Artificial intelligence is transforming many industries.

Alternatively, you can perform a similarity search using the vector directly.

# Querying Vectors using a vector directly
result_vector_query = vector_store.similarity_search_by_vector(
    embedding=query_embedding,
    k=5
)

print("Results for vector-based similarity search:")
for res in result_vector_query:
    print(res.page_content)

This will output similar results, as it is searching based on the similarity of the embedding.

Notes

  • You can specify batch sizes and chunk sizes to control the efficiency of document processing and storage in Upstash Vector.
  • Upstash Vector supports namespaces for organizing different types of documents. You can set a namespace while creating the UpstashVectorStore instance.

To learn more about LangChain and its integration with Upstash Vector, visit the LangChain documentation.