Langchain

LangChain is an open source framework for developing applications powered by language models. LangChain offers a series of modular, easy to use components that can be pieced together into a chain for building language based applications.

LangChain components can be used to preprocess data or break it into chunks, embed the chunks using LLM algorithms, and run similarity search on those embeddings with vector databases. There are a number of features that LangChain offers to make managing and optimizing the use of language models easy:

  • Access to pre-trained LLMs from OpenAI, Hugging Face, Cohere, and more
  • Tools for preprocessing text and code
  • Vector stores, including Rockset, for application serving
  • Off-the-shelf chains to build applications

As a real-time search and analytics database, Rockset uses indexing to deliver scalable and performant personalization, product search, semantic search, chatbot applications, and more. Since Rockset is purpose-built for real-time, you can build these responsive applications on constantly updating, streaming data. By integrating Rockset with LangChain, you can easily use LLMs on your own real-time data for production-ready vector search applications.

We'll walk through a demonstration of how to use Rockset as a vector store in LangChain. To get started, make sure you have access to a Rockset account and an API key available.

Setting Up Your Environment

  1. Leverage the Rockset console to create a Collection with the Write API as your source. In this walkthrough, we create a collection named langchain_demo. Configure the following Ingest Transformation with VECTOR_ENFORCEto define your embeddings field and take advantage of performance and storage optimizations:
SELECT _input.* EXCEPT(_meta), 
VECTOR_ENFORCE(_input.description_embedding, 
1536,
'float') as description_embedding
FROM _input
  1. Create and save a new API Key by navigating to the API Keys tab of the Rockset Console. For this example, we assume you are using the Oregon(us-west-2) region.

  2. Install the Rockset Python client and additional dependencies to work with LangChain and OpenAI.

pip install rockset langchain openai tiktoken
  1. This tutorial uses OpenAI to create embeddings. You will need to create an OpenAI account and get an API key. Set the API key as OPENAI_API_KEY environment variable.

Using Rockset as a Vector Store

The following sections outline how to generate and store vector embeddings in Rockset and search across embeddings to find similar documents to your search queries.

1. Define Key Variables

import os
import rockset

ROCKSET_API_KEY = os.environ.get("ROCKSET_API_KEY") # Verify ROCKSET_API_KEY environment variable
ROCKSET_API_SERVER = rockset.Regions.usw2a1 # Verify Rockset region
rockset_client = rockset.RocksetClient(ROCKSET_API_SERVER, ROCKSET_API_KEY)

COLLECTION_NAME = 'langchain_demo'
WORKSPACE = 'langchain_demo_ws'
TEXT_KEY = 'description'
EMBEDDING_KEY = 'description_embedding'

2. Prepare Documents

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader
from langchain.vectorstores.rocksetdb import Rockset

# file located in https://github.com/langchain-ai/langchain/blob/master/docs/extras/modules/state_of_the_union.txt
loader = TextLoader('../../../state_of_the_union.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

3. Embed and Insert Documents

embeddings = OpenAIEmbeddings() # Verify OPENAI_API_KEY environment variable

docsearch = Rockset(
    client=rockset_client,
    embeddings=embeddings,
    collection_name=COLLECTION_NAME,
    workspace=WORKSPACE,
    text_key=TEXT_KEY,
    embedding_key=EMBEDDING_KEY
)

ids = docsearch.add_texts(
    texts=[d.page_content for d in docs],
    metadatas=[d.metadata for d in docs],
)

4. Search for Similar Documents

query = "What did the president say about Ketanji Brown Jackson?"
output = docsearch.similarity_search_with_relevance_scores(query, 4, Rockset.DistanceFunction.COSINE_SIM)

print("output length:", len(output))
for d, dist in output:
    print(dist, d.metadata, d.page_content[:20] + '...')
# output length: 4 # 0.764990692109871 {'source': '../../../state_of_the_union.txt'} Madam Speaker, Madam... # 0.7485416901622112 {'source': '../../../state_of_the_union.txt'} And I'm taking robus... # 0.7468678973398306 {'source': '../../../state_of_the_union.txt'} And so many families... # 0.7436231261419488 {'source': '../../../state_of_the_union.txt'} Groups of citizens b...

5. Search for Similar Documents with Metadata Filtering

output = docsearch.similarity_search_with_relevance_scores(
    query,
    4,
    Rockset.DistanceFunction.COSINE_SIM,
    where_str="{} NOT LIKE '%citizens%'".format(TEXT_KEY)
)

print("output length:", len(output))
for d, dist in output:
    print(dist, d.metadata, d.page_content[:20] + '...')
# output length: 4 # 0.7651359650263554 {'source': '../../../state_of_the_union.txt'} Madam Speaker, Madam... # 0.7486265516824893 {'source': '../../../state_of_the_union.txt'} And I’m taking robus... # 0.7469625542348115 {'source': '../../../state_of_the_union.txt'} And so many families... # 0.7344177777547739 {'source': '../../../state_of_the_union.txt'} We see the unity amo...

6. Delete Inserted Documents [Optional]

You must have the unique ID associated with each document to delete them from your collection. Define IDs when inserting documents with Rockset.add_texts(). Rockset will otherwise generate a unique ID for each document. Regardless, Rockset.add_texts() returns the IDs of inserted documents.

To delete these documents, simply use the Rockset.delete_texts() function.

docsearch.delete_texts(ids)

Using Rockset as a Data Source

LangChain document loaders expose a load method for loading data as documents from a source, and Rockset can be configured as a data source. The following sections demonstrate how to use Rockset as a document loader in LangChain.

Executing Queries

The RocksetLoader class allows you to create LangChain documents from Rockset collections through SQL queries.

Start by initializing a RocksetLoader with the following sample code:

from langchain.document_loaders import RocksetLoader
from rockset import RocksetClient, Regions, models

loader = RocksetLoader(
    RocksetClient(api_key="<api key>"),
    models.QueryRequestSql(
        query="SELECT * FROM langchain_demo LIMIT 3" # SQL query
    ),
    ["text"],  # content columns
    metadata_keys=["author", "date"],  # metadata columns
)

Here, you can see that the following query is run:

SELECT * FROM langchain_demo LIMIT 3

The text column in the collection is used as the page content, and the author and date columns associated with the author are used as metadata If you do not specify metadata_keys, the whole Rockset document will be used as metadata.

To execute the query and access an iterator over the resulting Documents, run the following:

loader.lazy_load()

To execute the query and access all resulting Documents at once, run the following:

loader.load() 

Here is an example response of loader.load():

[
    Document(
        page_content="Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas a libero porta, dictum ipsum eget, hendrerit neque. Morbi blandit, ex ut suscipit viverra, enim velit tincidunt tellus, a tempor velit nunc et ex. Proin hendrerit odio nec convallis lobortis. Aenean in purus dolor. Vestibulum orci orci, laoreet eget magna in, commodo euismod justo.", 
        metadata={"author": "Joe Biden", "date": "2022-11-13T18:26:45.000000Z"}
    ),
    Document(
        page_content="Integer at finibus odio. Nam sit amet enim cursus lacus gravida feugiat vestibulum sed libero. Aenean eleifend est quis elementum tincidunt. Curabitur sit amet ornare erat. Nulla id dolor ut magna volutpat sodales fringilla vel ipsum. Donec ultricies, lacus sed fermentum dignissim, lorem elit aliquam ligula, sed suscipit sapien purus nec ligula.", 
        metadata={"author": "Donald Trump", "date": "2022-11-13T18:28:53.000000Z"}
    ),
    Document(
        page_content="Morbi tortor enim, commodo id efficitur vitae, fringilla nec mi. Nullam molestie faucibus aliquet. Praesent a est facilisis, condimentum justo sit amet, viverra erat. Fusce volutpat nisi vel purus blandit, et facilisis felis accumsan. Phasellus luctus ligula ultrices tellus tempor hendrerit. Donec at ultricies leo.", 
        metadata={"author": "Barack Obama", "date": "2022-11-13T18:49:04.000000Z"}
    )
]

Content Columns

You can choose to use multiple columns as content:

from langchain.document_loaders import RocksetLoader
from rockset import RocksetClient, Regions, models

loader = RocksetLoader(
    RocksetClient(Regions.usw2a1, "<api key>"),
    models.QueryRequestSql(query="SELECT * FROM langchain_demo LIMIT 1 WHERE id=38"),
    ["sentence1", "sentence2"],  # TWO content columns
)

If the "sentence1" field is "This is the first sentence." and the "sentence2" field is "This is the second sentence.", the page_content of the resulting Document would be:

This is the first sentence.
This is the second sentence.

You can define you own function to join content columns by setting the content_columns_joiner argument in the RocksetLoader constructor. content_columns_joiner is a method that takes in a List[Tuple[str, Any]] as an argument, which represents a list of tuples of (column name, column value). By default, this method joins each column value with a new line.

For example, if you wanted to join sentence1 and sentence2 with a space instead of a new line, you could set content_columns_joiner like so:

from langchain.document_loaders import RocksetLoader
from rockset import models

RocksetLoader(
    RocksetClient(Regions.usw2a1, "<api key>"),
    models.QueryRequestSql(query="SELECT * FROM langchain_demo LIMIT 1 WHERE id=38"),
    ["sentence1", "sentence2"],
    content_columns_joiner=lambda docs: " ".join(
        [doc[1] for doc in docs]
    ),  # join with space instead of /n
)

The page_content of the resulting Document would be:

This is the first sentence. This is the second sentence.

Oftentimes you want to include the column name in the page_content. You can do this too by running:

from langchain.document_loaders import RocksetLoader
from rockset import models

RocksetLoader(
    RocksetClient(Regions.usw2a1, "<api key>"),
    models.QueryRequestSql(query="SELECT * FROM langchain_demo LIMIT 1 WHERE id=38"),
    ["sentence1", "sentence2"],
    content_columns_joiner=lambda docs: "\n".join(
        [f"{doc[0]}: {doc[1]}" for doc in docs]
    ),
)

This would result in the following page_content:

sentence1: This is the first sentence.
sentence2: This is the second sentence.

Using Rockset for Chat History

Rockset can be used to store chat history. LangChain's RocksetChatMessageHistory class is responsible for remembering chat interactions that can be passed into a model.

Construct a RocksetChatMessageHistory object.

from langchain.memory.chat_message_histories import RocksetChatMessageHistory
from rockset import RocksetClient

history = RocksetChatMessageHistory(
    session_id="MySession",
    client=RocksetClient(),
    collection="langchain_demo",
    sync=True
)

If collection langchain_demo does not exist in the commons workspace, it will be created by LangChain.

Add chat messages:

history.add_user_message("hi!")
history.add_ai_message("whats up?")

Get message history:

print(history.messages)

Clear chat history:

history.clear()