Langchain
LangChain is an open source framework for developing applications powered by language models. LangChain offers a series of modular, easy to use components that can be pieced together into a chain for building language based applications.
LangChain components can be used to preprocess data or break it into chunks, embed the chunks using LLM algorithms, and run similarity search on those embeddings with vector databases. There are a number of features that LangChain offers to make managing and optimizing the use of language models easy:
- Access to pre-trained LLMs from OpenAI, Hugging Face, Cohere, and more
- Tools for preprocessing text and code
- Vector stores, including Rockset, for application serving
- Off-the-shelf chains to build applications
As a real-time search and analytics database, Rockset uses indexing to deliver scalable and performant personalization, product search, semantic search, chatbot applications, and more. Since Rockset is purpose-built for real-time, you can build these responsive applications on constantly updating, streaming data. By integrating Rockset with LangChain, you can easily use LLMs on your own real-time data for production-ready vector search applications.
We'll walk through a demonstration of how to use Rockset as a vector store in LangChain. To get started, make sure you have access to a Rockset account and an API key available.
Setting Up Your Environment
- Leverage the Rockset console to create a Collection with the Write API as your source. In this walkthrough, we create a collection named
langchain_demo
. Configure the following Ingest Transformation withVECTOR_ENFORCE
to define your embeddings field and take advantage of performance and storage optimizations:
SELECT _input.* EXCEPT(_meta),
VECTOR_ENFORCE(_input.description_embedding,
1536,
'float') as description_embedding
FROM _input
-
Create and save a new API Key by navigating to the API Keys tab of the Rockset Console. For this example, we assume you are using the
Oregon(us-west-2)
region. -
Install the Rockset Python client and additional dependencies to work with LangChain and OpenAI.
pip install rockset langchain openai tiktoken
- This tutorial uses OpenAI to create embeddings. You will need to create an OpenAI account and get an API key. Set the API key as
OPENAI_API_KEY
environment variable.
Using Rockset as a Vector Store
The following sections outline how to generate and store vector embeddings in Rockset and search across embeddings to find similar documents to your search queries.
1. Define Key Variables
import os
import rockset
ROCKSET_API_KEY = os.environ.get("ROCKSET_API_KEY") # Verify ROCKSET_API_KEY environment variable
ROCKSET_API_SERVER = rockset.Regions.usw2a1 # Verify Rockset region
rockset_client = rockset.RocksetClient(ROCKSET_API_SERVER, ROCKSET_API_KEY)
COLLECTION_NAME = 'langchain_demo'
WORKSPACE = 'langchain_demo_ws'
TEXT_KEY = 'description'
EMBEDDING_KEY = 'description_embedding'
2. Prepare Documents
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader
from langchain.vectorstores.rocksetdb import Rockset
# file located in https://github.com/langchain-ai/langchain/blob/master/docs/extras/modules/state_of_the_union.txt
loader = TextLoader('../../../state_of_the_union.txt')
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)
3. Embed and Insert Documents
embeddings = OpenAIEmbeddings() # Verify OPENAI_API_KEY environment variable
docsearch = Rockset(
client=rockset_client,
embeddings=embeddings,
collection_name=COLLECTION_NAME,
workspace=WORKSPACE,
text_key=TEXT_KEY,
embedding_key=EMBEDDING_KEY
)
ids = docsearch.add_texts(
texts=[d.page_content for d in docs],
metadatas=[d.metadata for d in docs],
)
4. Search for Similar Documents
query = "What did the president say about Ketanji Brown Jackson?"
output = docsearch.similarity_search_with_relevance_scores(query, 4, Rockset.DistanceFunction.COSINE_SIM)
print("output length:", len(output))
for d, dist in output:
print(dist, d.metadata, d.page_content[:20] + '...')
5. Search for Similar Documents with Metadata Filtering
output = docsearch.similarity_search_with_relevance_scores(
query,
4,
Rockset.DistanceFunction.COSINE_SIM,
where_str="{} NOT LIKE '%citizens%'".format(TEXT_KEY)
)
print("output length:", len(output))
for d, dist in output:
print(dist, d.metadata, d.page_content[:20] + '...')
6. Delete Inserted Documents [Optional]
You must have the unique ID associated with each document to delete them from your collection. Define IDs when inserting documents with Rockset.add_texts()
. Rockset will otherwise generate a unique ID for each document. Regardless, Rockset.add_texts()
returns the IDs of inserted documents.
To delete these documents, simply use the Rockset.delete_texts()
function.
docsearch.delete_texts(ids)
Using Rockset as a Data Source
LangChain document loaders expose a load
method for loading data as documents from a source, and Rockset can be configured as a data source. The following sections demonstrate how to use Rockset as a document loader in LangChain.
Executing Queries
The RocksetLoader
class allows you to create LangChain documents from Rockset collections through SQL queries.
Start by initializing a RocksetLoader
with the following sample code:
from langchain.document_loaders import RocksetLoader
from rockset import RocksetClient, Regions, models
loader = RocksetLoader(
RocksetClient(api_key="<api key>"),
models.QueryRequestSql(
query="SELECT * FROM langchain_demo LIMIT 3" # SQL query
),
["text"], # content columns
metadata_keys=["author", "date"], # metadata columns
)
Here, you can see that the following query is run:
SELECT * FROM langchain_demo LIMIT 3
The text
column in the collection is used as the page content, and the author
and date
columns associated with the author are used as metadata If you do not specify metadata_keys
, the whole Rockset document will be used as metadata.
To execute the query and access an iterator over the resulting Document
s, run the following:
loader.lazy_load()
To execute the query and access all resulting Document
s at once, run the following:
loader.load()
Here is an example response of loader.load()
:
[
Document(
page_content="Lorem ipsum dolor sit amet, consectetur adipiscing elit. Maecenas a libero porta, dictum ipsum eget, hendrerit neque. Morbi blandit, ex ut suscipit viverra, enim velit tincidunt tellus, a tempor velit nunc et ex. Proin hendrerit odio nec convallis lobortis. Aenean in purus dolor. Vestibulum orci orci, laoreet eget magna in, commodo euismod justo.",
metadata={"author": "Joe Biden", "date": "2022-11-13T18:26:45.000000Z"}
),
Document(
page_content="Integer at finibus odio. Nam sit amet enim cursus lacus gravida feugiat vestibulum sed libero. Aenean eleifend est quis elementum tincidunt. Curabitur sit amet ornare erat. Nulla id dolor ut magna volutpat sodales fringilla vel ipsum. Donec ultricies, lacus sed fermentum dignissim, lorem elit aliquam ligula, sed suscipit sapien purus nec ligula.",
metadata={"author": "Donald Trump", "date": "2022-11-13T18:28:53.000000Z"}
),
Document(
page_content="Morbi tortor enim, commodo id efficitur vitae, fringilla nec mi. Nullam molestie faucibus aliquet. Praesent a est facilisis, condimentum justo sit amet, viverra erat. Fusce volutpat nisi vel purus blandit, et facilisis felis accumsan. Phasellus luctus ligula ultrices tellus tempor hendrerit. Donec at ultricies leo.",
metadata={"author": "Barack Obama", "date": "2022-11-13T18:49:04.000000Z"}
)
]
Content Columns
You can choose to use multiple columns as content:
from langchain.document_loaders import RocksetLoader
from rockset import RocksetClient, Regions, models
loader = RocksetLoader(
RocksetClient(Regions.usw2a1, "<api key>"),
models.QueryRequestSql(query="SELECT * FROM langchain_demo LIMIT 1 WHERE id=38"),
["sentence1", "sentence2"], # TWO content columns
)
If the "sentence1" field is "This is the first sentence." and the "sentence2" field is "This is the second sentence.", the page_content
of the resulting Document
would be:
This is the first sentence.
This is the second sentence.
You can define you own function to join content columns by setting the content_columns_joiner
argument in the RocksetLoader
constructor. content_columns_joiner
is a method that takes in a List[Tuple[str, Any]]
as an argument, which represents a list of tuples of (column name, column value). By default, this method joins each column value with a new line.
For example, if you wanted to join sentence1 and sentence2 with a space instead of a new line, you could set content_columns_joiner
like so:
from langchain.document_loaders import RocksetLoader
from rockset import models
RocksetLoader(
RocksetClient(Regions.usw2a1, "<api key>"),
models.QueryRequestSql(query="SELECT * FROM langchain_demo LIMIT 1 WHERE id=38"),
["sentence1", "sentence2"],
content_columns_joiner=lambda docs: " ".join(
[doc[1] for doc in docs]
), # join with space instead of /n
)
The page_content
of the resulting Document
would be:
This is the first sentence. This is the second sentence.
Oftentimes you want to include the column name in the page_content
. You can do this too by running:
from langchain.document_loaders import RocksetLoader
from rockset import models
RocksetLoader(
RocksetClient(Regions.usw2a1, "<api key>"),
models.QueryRequestSql(query="SELECT * FROM langchain_demo LIMIT 1 WHERE id=38"),
["sentence1", "sentence2"],
content_columns_joiner=lambda docs: "\n".join(
[f"{doc[0]}: {doc[1]}" for doc in docs]
),
)
This would result in the following page_content
:
sentence1: This is the first sentence.
sentence2: This is the second sentence.
Using Rockset for Chat History
Rockset can be used to store chat history. LangChain's RocksetChatMessageHistory
class is responsible for remembering chat interactions that can be passed into a model.
Construct a RocksetChatMessageHistory
object.
from langchain.memory.chat_message_histories import RocksetChatMessageHistory
from rockset import RocksetClient
history = RocksetChatMessageHistory(
session_id="MySession",
client=RocksetClient(),
collection="langchain_demo",
sync=True
)
If collection langchain_demo
does not exist in the commons
workspace, it will be created by LangChain.
Add chat messages:
history.add_user_message("hi!")
history.add_ai_message("whats up?")
Get message history:
print(history.messages)
Clear chat history:
history.clear()
Updated 9 months ago