LlamaIndex

LlamaIndex is an open-source framework for developing applications powered by language models. LlamaIndex offers tools that facilitate data ingestion, structuring, and storage for LLM-backed apps.

We'll walk through a demonstration of how to use Rockset as a vector store in LlamaIndex.

Tutorial

In this example, we'll use OpenAI's text-embedding-ada-002 model to generate embeddings and Rockset as vector store to store embeddings. We'll ingest text from a file and ask questions about the content.

Setting Up Your Environment

  1. Create an API key from the Rockset console and set the ROCKSET_API_KEY environment variable.
    Find your API server here and set the ROCKSET_API_SERVER environment variable. Set the OPENAI_API_KEY environment variable.

  2. Install the dependencies.

pip3 install llama_index rockset 
  1. LlamaIndex allows you to ingest data from a variety of sources.
    For this example, we'll read from a text file named constitution.txt, which is a transcript of the American Constitution, found here.

Data ingestion

  1. Use LlamaIndex's SimpleDirectoryReader class to convert the text file to a list of Document objects.
from llama_index import SimpleDirectoryReader

docs = SimpleDirectoryReader(input_files=["{path to}/consitution.txt"]).load_data()
  1. Instantiate the LLM and service context.
from llama_index import ServiceContext
from llama_index.llms import OpenAI

llm = OpenAI(temperature=0.8, model="gpt-3.5-turbo")
service_context = ServiceContext.from_defaults(llm=llm)
  1. Instantiate the vector store and storage context.
from llama_index import StorageContext
from llama_index.vector_stores import RocksetVectorStore

vector_store = RocksetVectorStore.with_new_collection(
    collection="llamaindex_demo",
    dimensions=1536  # optional param to configure the ingest tranformation
)                    # https://rockset.com/docs/vector-functions/#vector_enforce
storage_context = StorageContext.from_defaults(vector_store=vector_store)
  1. Add documents to the llamaindex_demo collection and create an index.
from llama_index import VectorStoreIndex

index = VectorStoreIndex.from_documents(
    docs,
    storage_context=storage_context,
    service_context=service_context
)

Querying

  1. Ask a question about your document and generate a response.
response = (
    index
    .as_query_engine(service_context=service_context)
    .query("What is the duty of the president?")
)

print(str(response))
  1. Run the program.
$ python3 main.py
The duty of the president is to faithfully execute the Office of President of the United States, preserve, protect and defend the Constitution of the United States, serve as the Commander in Chief of the Army and Navy, grant reprieves and pardons for offenses against the United States (except in cases of impeachment), make treaties and appoint ambassadors and other public ministers, take care that the laws be faithfully executed, and commission all the officers of the United States.

Metadata Filtering

Metadata filtering allows you to retrieve documents that match specific filters.

  1. Add nodes to your vector store and create an index.
from llama_index.vector_stores import RocksetVectorStore
from llama_index import VectorStoreIndex, StorageContext
from llama_index.vector_stores.types import NodeWithEmbedding
from llama_index.schema import TextNode

nodes = [
    NodeWithEmbedding(
        node=TextNode(
            text="Apples are blue",
            metadata={"type": "fruit"}, 
        ),
        embedding=[...],
    )
]
index = VectorStoreIndex(
    nodes, 
    storage_context=StorageContext.from_defaults(
        vector_store=RocksetVectorStore(
            collection="llamaindex_demo"
        )
    )
)
  1. Define metadata filters.
from llama_index.vector_stores.types import ExactMatchFilter, MetadataFilters

filters = MetadataFilters(filters=[
    ExactMatchFilter(
        key="type", 
        value="fruit"
    )
])
  1. Retrieve relevant documents that satisfy the filters.
retriever = index.as_retriever(filters=filters)
retirever.retrieve("What colors are apples?")

Indexing from Collections

If nodes already exist in a collection, you can create an index from the collection.

  1. Instantiate the vector store.
from llama_index.vector_stores import RocksetVectorStore

vector_store = RocksetVectorStore(collection="llamaindex_demo")
  1. Instantiate the LLM and service context.
from llama_index import ServiceContext
from llama_index.llms import OpenAI

llm = OpenAI(temperature=0.8, model="gpt-3.5-turbo")
service_context = ServiceContext.from_defaults(llm=llm)
  1. Create the index.
from llama_index import VectorStoreIndex

index = VectorStoreIndex.from_vector_store(
    vector_store, 
    service_context=service_context
)
  1. Ask a question.
response = index.as_query_engine().query("What is the duty of the president?")
print(str(response))

Configuration

  • collection: Name of the collection to query (required).
RocksetVectorStore(collection="my_collection")
  • workspace: Name of the workspace containing the collection. Defaults to "commons".
RocksetVectorStore(worksapce="my_workspace")
  • api_key: The API key to use to authenticate Rockset requests. Ignored if client is passed in. Defaults to the ROCKSET_API_KEY environment variable.
RocksetVectorStore(api_key="<my key>")
  • api_server: The API server to use for Rockset requests. Ignored if client is passed in. Defaults to the ROCKSET_API_KEY environment variable or "https://api.use1a1.rockset.com" if the ROCKSET_API_SERVER is not set.
from rockset import Regions
RocksetVectorStore(api_server=Regions.euc1a1)
  • client: Rockset client object to use to execute Rockset requests. If not specified, a client object is internally constructed with the api_key parameter (or ROCKSET_API_SERVER environment variable) and the api_server parameter (or ROCKSET_API_SERVER environment variable).
from rockset import RocksetClient
RocksetVectorStore(client=RocksetClient(api_key="<my key>"))
  • embedding_col: The name of the database field containing embeddings. Defaults to "embedding".
RocksetVectorStore(embedding_col="my_embedding")
  • metadata_col: The name of the database field containing node data. Defaults to "metadata".
RocksetVectorStore(metadata_col="node")
  • distance_func: The metric to measure vector relationship. Defaults to cosine similarity.
RocksetVectorStore(distance_func=RocksetVectorStore.DistanceFunc.DOT_PRODUCT)