Running Vector Search with Rockset

Loading Vector Data

The first step in enabling Vector Search is creating a collection. When creating the collection you should make sure to wrap your vector fields in a call to VECTOR_ENFORCE inside the collection's ingest transformation. VECTOR_ENFORCE ensures that all incoming vectors are uniform in length and type and will return a NULL value on failure. In addition to performing uniformity checks, it signals to Rockset that this array should be treated as a vector allowing Rockset to make indexing performance optimizations like making sure the array is stored compactly for fast access and avoiding the creation of an inverted index entry for each vector element.

    SELECT
        title,
        author,
        VECTOR_ENFORCE(book_embedding, 5, 'float') as book_embedding
    FROM
        _input

Generating embeddings

Once your Collection has been created, you will need to generate Embeddings for your data if you have not done so already. Depending on your use case, you can develop your own models or use open-source and proprietary models provided by third parties. If you are planning on performing a vector search on text data, OpenAI offers a very easy to use text embedding generation API. Other model hosting platforms like Hugging Face offer free open-source text embedding models for use.

Popular third party language models typically have simple to use integrations with Python orchestration tools like LangChain. You can use a model of your choice along with Rockset's LangChain Integration to easily generate and store your embeddings.

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader
from langchain.vectorstores.rocksetdb import Rockset

"""
 Set up using one of the embedding generator integrations available at
 langchain.embeddings.*.
"""

embeddings = OpenAIEmbeddings() # Verify OPENAI_API_KEY environment variable

docsearch=Rockset(
    client=rockset_client,
    embeddings=embeddings,
    collection_name=COLLECTION_NAME,
    workspace=WORKSPACE,
    text_key=TEXT_KEY,
    embedding_key=EMBEDDING_KEY
)

ids=docsearch.add_texts(
    texts=[d.page_content for d in docs],
    metadatas=[d.metadata for d in docs],
)

Alternatively to using LangCor Rockset's Python client to upload your embeddings, you can load data to another primary store like S3 and ingest data in real-time using one of Rockset's data source integrations.

KNN Search

Performing a K-Nearest Neighbors (KNN) search on your embeddings directly in Rockset is as simple as adding an ORDER BY clause to sort by the distance to your query vector. To calculate the distance between two vectors, we can use one of Rockset's distance/similarity functions, EUCLIDEAN_DIST, COSINE_SIM or DOT_PRODUCT. Cosine similarity is just the normalized dot product of two vectors, so if vectors are already normalized (all of their inner elements sum to 1) these two functions are equivalent.

    SELECT
        title,
        author,
    FROM
        book_dataset
    ORDER BY
        COSINE_SIM([0, 1, 0, 1, 0], book_embedding) DESC

In practice, we should never have a reason to type a query vector in by hand as typically the meaning of each encoded value is opaque and understood only by the encoding model. To pass in query vectors, turn your query into a Query Lambda and make the query vector a parameter. This lets us generate embeddings on the fly and perform similarity search on them.

SELECT
    title,
    author
FROM
    book_dataset
ORDER BY
    COSINE_SIM(:target_embedding, book_embedding) DESC

Creating the :target_embedding parameter looks like this:

The parameter value will be the result of a query like the one below, with the title of whichever book you want to use:

SELECT
    book_dataset.book_embedding
  FROM
    book_dataset
  WHERE
    v.title = 'Sapiens: A Brief History of Humankind'
  LIMIT
    1

An important benefit to using Rockset for vector search is that joining tables and checking complex predicates can easily coexist with a similarity search. You simply update the WHERE clause and JOIN any extra information you are interested in. In vector search parlance this is commonly referred to as metadata filtering.

SELECT
    book_dataset.title,
    book_dataset.author
FROM
    book_dataset ds
    JOIN book_metadata m ON ds.isbn = m.isbn
WHERE
    m.publish_date > DATE(2010, 12, 26)
    and m.rating >= 4
    and m.price < 50
ORDER BY
    COSINE_SIM(:target_embedding, ds.book_embedding) DESC
LIMIT
    30

ANN Search

πŸ› οΈ

ANN Search is currently in Beta.

Collections must have been created after October 19, 2023 to utilize ANN Search.

When the dataset is small, it is easy to scan the whole dataset when performing a KNN search. Many times all of the vectors that need to be looked at can fit in memory allowing for fast scans and comparisons. This is especially true if we add many selective predicates to our query. However, in some cases it may be inescapable that we want to query over a billion or more vectors. In these cases the cost of looking up and comparing every single one of our stored vectors is too high and will lead to significant latency increases. At a certain point an exact K Nearest Neighbor search becomes too expensive to scale. Thankfully in most cases exact ordering is not necessary and we really only need a best effort, or Approximate Nearest Neighbor (ANN) search.

Creating the index

In order to perform approximate scans we need to add extra structure to our data. To do this we will run a DDL command CREATE SIMILARITY INDEX which will create an index on our embedding field path. Once the command finishes, vector search queries will be able to transparently take advantage of the new index. This vastly speeds up nearest neighbor search.

Index creation has the form:

CREATE
	<SIMILARITY|DISTANCE> INDEX <NAME>
 ON
 	FIELD <Collection | RRN>:<FieldPath> DIMENSION <Vector Dimension> AS <Factory String>

Note the option to create a DISTANCE index which will create an index that orders vectors based on a distance metric rather than a similarity metric. Currently the inner product of two vectors is the only supported similarity metric while L2 distance is the only supported distance metric. This metric must be decided at creation time and will affect what functions can utilize the index.

On creation, we specify the name of our new index along with it's associated Collection, either by name or RRN, and the dimension of the vectors that the index will be scanning and training on. The <Factory String> is a formatted string that specifies underlying type information and configuration for the new index.

Factory String

The factory string has the form: <Provider>:<Parameters>:<ProviderConfig>


Provider: The library or creator of the index implementation.

Available providers include:

  • faiss: The FAISS library for similarity search.

Parameters: Rockset specific parameters for querying and maintaining the index. Assigning a parameter value has the syntax param=value. Setting parameters is optional and if unset the default value will be used.

Available parameters include:

  • nprobe: The number of centroids (posting lists) that will be visited during a query by default (1 by default).
  • *minprobe, maxprobe: Specify a minimum number of posting lists to traverse and expand the search as necessary to fulfill a passed in limit up until maxprobe lists have been traversed. *Not yet available.

ProviderConfig: Provider specific index construction string.

Construction strings for providers:

  • faiss: This string is defined by the FAISS library for index factory construction. A formal outline of the grammar for the factory string can be found here. FAISS supports a few form of indexes but Rockset only supports the IVF family of indexes mentioned here.

Putting it together we get:

CREATE
	SIMILARITY INDEX book_catalogue_embeddings_ann_index
ON
	FIELD commons.book_dataset:book_embedding DIMENSION 1536 AS 'faiss::IVF256,Flat';

Here we are creating an index named "book_catalogue_embeddings_ann_index" on our "book_embedding" field path. We specify the dimension of the input vectors along with a factory string that indicates what type of index to create.

We specified that we want to use the FAISS IVF index with 256 centroids. The centroid value should be selected based on your specific collection (see specifying a centroid value below). Each centroid will be used to generate a posting list which Rockset will be able to use to quickly look up documents similar to a query vector.

Flat indicates we want to use the Flat coarse quantizer for our IVF index. Note we did not specify any parameters so the default nprobe value for the index will be 1.

πŸ’‘

Specifying a centroid value

You must specify a centroid value that gives you a sufficient number of samples in each cluster based on your total number of vectors. We recommend a value around 16x the square root of (total number of vectors/64), as long as this value is no more than (total number of vectors/64). If it is, use a smaller factor. This has to due with how we partition documents for the clustering of the collection on the backend.

For example, if you have 64,000 vectors in a collection, you may use a centroid value of 505 and specify IVF505 since 16x(square root(64000/64)) = ~505.

If you are receiving an 'Insufficient samples'error, this is an indicator that you need to adjust your centroid value.

Rockset's Vector Search Indexing

The factory string previously discussed currently only supports one form of ANN architecture which is Rockset's index filtering sitting on top of FAISS Inverted File Indexes. Rockset will train a set of centroids on the collection's vector data with the number of these centroids being N specified in the term IVF{N} used in the factory string. Each centroid acts as a representative for a cluster of similar vectors and their associated documents. Together the centroids form an index on the vector data that can be used at query time. For each centroid Rockset will have a posting list that can be used for iteration.

At query time FAISS will gather the set of centroids (clusters/posting lists) in the index that have the smallest coarse distance from the query vector. The number of centroids that are retrieved and searched are based on the nprobe parameter whose default is set at index creation but can also be specified for each individual query. From here Rockset must simply iterate the posting lists and return the closest vectors.

🚧

nprobe Note

When searching nprobe posting lists, the results returned may be less than the requested limit due to a selective predicate. In this case increasing the centroids to probe may increase the number of returned results.

Building the Index

After running the CREATE command an RRN, Rockset's globally unique identifier, will be returned as the result. At this point Rockset has started building the index. Building the index requires Rockset to scan the vector data in for training and then perform updates to index all documents.

πŸ’‘

Index Building Tip

Building an index is a memory intensive operation as clustering is performed in memory. For large collections it is recommended that only one similarity index be trained at a time.

You can check the state of the index by querying Rockset's _system workspace.

SELECT
  *
FROM
	_system.similarity_index
WHERE
	rrn = <RRN>

This will return information about the default nprobe for the index, the factory string it is associated with and other information that was provided at creation time. It will also print information about the index_status which can be in state 'TRAINING' or 'READY'. The index is only useable once it has reached the 'READY' state.

Given the number of centroids C, Rockset must read at least C * 64 vectors before training completes. If not enough vectors are available to be trained on, the index will stay in the 'TRAINING' state and wait until more vectors are ingested.

Querying on the index

Querying using the index happens as before but we must specify that approximate distance results are ok. This will let Rockset's Cost Based Optimizer (CBO) decide whether to use a KNN search or if an ANN search will be more efficient. So long as Rockset's CBO is enabled, selecting to use the index happens completely transparent to the user.

πŸ’‘

Force index use with HINTs

There are situations where the CBO may have not collected enough stats to feel confident in using the index, so to force the index to be used you may add HINT(access_path=index_similarity_search) after the FROM clause.

If the index is not yet ready, approximate similarity search queries will return an error. If you would still like to query without an index you may force the optimizer to perform a brute force search using the column index with HINT(access_path=column_scan).

If there are multiple useable indexes on the same field of the collection, the oldest available index will be used for the query by default. You can override this default behavior by using HINT(similarity_index='<Name>').

SELECT
    book_dataset.title,
    book_dataset.author
FROM
    book_dataset ds
    JOIN book_metadata m ON ds.isbn = m.isbn
WHERE
    m.publish_date > DATE(2010, 12, 26)
    and m.rating >= 4
    and m.price < 50
ORDER BY
    APPROX_DOT_PRODUCT(:target_embedding, ds.book_embedding) option(nprobe=2) DESC
LIMIT
    30

The functions APPROX_DOT_PRODUCT and APPROX_EUCLIDEAN_DIST are approximate versions of DOT_PRODUCT and EUCLIDEAN_DIST respectively. Each will try to use an applicable index if available and if there is no index on the field and distance type required then the sister function will be invoked resulting in a brute force KNN search. To change the number of posting lists iterated in an ANN search you may update the nprobe parameter per query by setting the option options(nprobe=<# of posting lists>). You cannot specify more posting lists to query than there are centroids in the index.

When searching, Rockset will try to push any predicate checks into the similarity index scan itself to avoid having to "pre" or "post" filter results. When querying the similarity index a provided result limit is required for the index to be used since without a limit a full scan of the collection would be performed using the index. Rockset will try to push a limit down to the similarity index, but if it fails due to a predicate that cannot be resolved within the index, then the index will not be used and the optimizer will fall back to a brute force KNN search.

Dropping an index

To delete an index you must issue the DROP command on the index. Once performed, queries will no longer use the index and Rockset will begin the process of cleaning up the index.

DROP <SIMILARITY|DISTANCE> INDEX <Index Name>

For example, to clear the index "book_catalogue_embeddings_ann_index" that we created we would run the following query:

DROP SIMILARITY INDEX book_catalogue_embeddings_ann_index

πŸ“˜

Want to learn more?

Check out this workshop from one of our Solutions Engineers to learn more about utilizing Vector Search in Rockset.

Check out this blog for more information on how Vector Search was integrated into Rockset.