Running Vector Search with Rockset
Loading Vector Data
The first step in enabling Vector Search is creating a collection. When creating the collection you should make sure to wrap your vector fields in a call to VECTOR_ENFORCE
inside the collection's ingest transformation. VECTOR_ENFORCE
ensures that all incoming vectors are uniform in length and type and will return a NULL
value on failure. In addition to performing uniformity checks, it signals to Rockset that this array should be treated as a vector allowing Rockset to make indexing performance optimizations like making sure the array is stored compactly for fast access and avoiding the creation of an inverted index entry for each vector element.
SELECT
title,
author,
VECTOR_ENFORCE(book_embedding, 5, 'float') as book_embedding
FROM
_input
Generating embeddings
Once your Collection has been created, you will need to generate Embeddings for your data if you have not done so already. Depending on your use case, you can develop your own models or use open-source and proprietary models provided by third parties. If you are planning on performing a vector search on text data, OpenAI offers a very easy to use text embedding generation API. Other model hosting platforms like Hugging Face offer free open-source text embedding models for use.
Popular third party language models typically have simple to use integrations with Python orchestration tools like LangChain. You can use a model of your choice along with Rockset's LangChain Integration to easily generate and store your embeddings.
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader
from langchain.vectorstores.rocksetdb import Rockset
"""
Set up using one of the embedding generator integrations available at
langchain.embeddings.*.
"""
embeddings = OpenAIEmbeddings() # Verify OPENAI_API_KEY environment variable
docsearch=Rockset(
client=rockset_client,
embeddings=embeddings,
collection_name=COLLECTION_NAME,
workspace=WORKSPACE,
text_key=TEXT_KEY,
embedding_key=EMBEDDING_KEY
)
ids=docsearch.add_texts(
texts=[d.page_content for d in docs],
metadatas=[d.metadata for d in docs],
)
Alternatively to using LangCor Rockset's Python client to upload your embeddings, you can load data to another primary store like S3 and ingest data in real-time using one of Rockset's data source integrations.
KNN Search
Performing a K-Nearest Neighbors (KNN) search on your embeddings directly in Rockset is as simple as adding an ORDER BY
clause to sort by the distance to your query vector. To calculate the distance between two vectors, we can use one of Rockset's distance/similarity functions, EUCLIDEAN_DIST
, COSINE_SIM
or DOT_PRODUCT
. Cosine similarity is just the normalized dot product of two vectors, so if vectors are already normalized (all of their inner elements sum to 1) these two functions are equivalent.
SELECT
title,
author,
FROM
book_dataset
ORDER BY
COSINE_SIM([0, 1, 0, 1, 0], book_embedding) DESC
In practice, we should never have a reason to type a query vector in by hand as typically the meaning of each encoded value is opaque and understood only by the encoding model. To pass in query vectors, turn your query into a Query Lambda and make the query vector a parameter. This lets us generate embeddings on the fly and perform similarity search on them.
SELECT
title,
author
FROM
book_dataset
ORDER BY
COSINE_SIM(:target_embedding, book_embedding) DESC
Creating the :target_embedding
parameter looks like this:
The parameter value will be the result of a query like the one below, with the title of whichever book you want to use:
SELECT
book_dataset.book_embedding
FROM
book_dataset
WHERE
v.title = 'Sapiens: A Brief History of Humankind'
LIMIT
1
An important benefit to using Rockset for vector search is that joining tables and checking complex predicates can easily coexist with a similarity search. You simply update the WHERE
clause and JOIN any extra information you are interested in. In vector search parlance this is commonly referred to as metadata filtering.
SELECT
book_dataset.title,
book_dataset.author
FROM
book_dataset ds
JOIN book_metadata m ON ds.isbn = m.isbn
WHERE
m.publish_date > DATE(2010, 12, 26)
and m.rating >= 4
and m.price < 50
ORDER BY
COSINE_SIM(:target_embedding, ds.book_embedding) DESC
LIMIT
30
ANN Search
ANN Search is currently in Beta.
Collections must have been created after October 19, 2023 to utilize ANN Search.
When the dataset is small, it is easy to scan the whole dataset when performing a KNN search. Many times all of the vectors that need to be looked at can fit in memory allowing for fast scans and comparisons. This is especially true if we add many selective predicates to our query. However, in some cases it may be inescapable that we want to query over a billion or more vectors. In these cases the cost of looking up and comparing every single one of our stored vectors is too high and will lead to significant latency increases. At a certain point an exact K Nearest Neighbor search becomes too expensive to scale. Thankfully in most cases exact ordering is not necessary and we really only need a best effort, or Approximate Nearest Neighbor (ANN) search.
Creating the index
In order to perform approximate scans we need to add extra structure to our data. To do this we will run a DDL command CREATE SIMILARITY INDEX
which will create an index on our embedding field path. Once the command finishes, vector search queries will be able to transparently take advantage of the new index. This vastly speeds up nearest neighbor search.
Index creation has the form:
CREATE
<SIMILARITY|DISTANCE> INDEX <NAME>
ON
FIELD <Collection | RRN>:<FieldPath> DIMENSION <Vector Dimension> AS <Factory String>
Note the option to create a DISTANCE
index which will create an index that orders vectors based on a distance metric rather than a similarity metric. Currently the inner product of two vectors is the only supported similarity metric while L2 distance is the only supported distance metric. This metric must be decided at creation time and will affect what functions can utilize the index.
On creation, we specify the name of our new index along with it's associated Collection, either by name or RRN, and the dimension of the vectors that the index will be scanning and training on. The <Factory String>
is a formatted string that specifies underlying type information and configuration for the new index.
Factory String
The factory string has the form: <Provider>:<Parameters>:<ProviderConfig>
Provider: The library or creator of the index implementation.
Available providers include:
- faiss: The FAISS library for similarity search.
Parameters: Rockset specific parameters for querying and maintaining the index. Assigning a parameter value has the syntax param=value
. Setting parameters is optional and if unset the default value will be used.
Available parameters include:
- nprobe: The number of centroids (posting lists) that will be visited during a query by default (1 by default).
- *minprobe, maxprobe: Specify a minimum number of posting lists to traverse and expand the search as necessary to fulfill a passed in limit up until maxprobe lists have been traversed. *Not yet available.
ProviderConfig: Provider specific index construction string.
Construction strings for providers:
- faiss: This string is defined by the FAISS library for index factory construction. A formal outline of the grammar for the factory string can be found here. FAISS supports a few form of indexes but Rockset only supports the IVF family of indexes mentioned here.
Putting it together we get:
CREATE
SIMILARITY INDEX book_catalogue_embeddings_ann_index
ON
FIELD commons.book_dataset:book_embedding DIMENSION 1536 AS 'faiss::IVF256,Flat';
Here we are creating an index named "book_catalogue_embeddings_ann_index" on our "book_embedding" field path. We specify the dimension of the input vectors along with a factory string that indicates what type of index to create.
We specified that we want to use the FAISS IVF index with 256 centroids. The centroid value should be selected based on your specific collection (see specifying a centroid value below). Each centroid will be used to generate a posting list which Rockset will be able to use to quickly look up documents similar to a query vector.
Flat
indicates we want to use the Flat coarse quantizer for our IVF index. Note we did not specify any parameters so the default nprobe
value for the index will be 1.
Specifying a centroid value
You must specify a centroid value that gives you a sufficient number of samples in each cluster based on your total number of vectors. We recommend a value around 16x the square root of (total number of vectors/64), as long as this value is no more than (total number of vectors/64). If it is, use a smaller factor. This has to due with how we partition documents for the clustering of the collection on the backend.
For example, if you have 64,000 vectors in a collection, you may use a centroid value of 505 and specify
IVF505
since16x(square root(64000/64)) = ~505
.If you are receiving an
'Insufficient samples'
error, this is an indicator that you need to adjust your centroid value.
Rockset's Vector Search Indexing
The factory string previously discussed currently only supports one form of ANN architecture which is Rockset's index filtering sitting on top of FAISS Inverted File Indexes. Rockset will train a set of centroids on the collection's vector data with the number of these centroids being N specified in the term IVF{N} used in the factory string. Each centroid acts as a representative for a cluster of similar vectors and their associated documents. Together the centroids form an index on the vector data that can be used at query time. For each centroid Rockset will have a posting list that can be used for iteration.
At query time FAISS will gather the set of centroids (clusters/posting lists) in the index that have the smallest coarse distance from the query vector. The number of centroids that are retrieved and searched are based on the nprobe
parameter whose default is set at index creation but can also be specified for each individual query. From here Rockset must simply iterate the posting lists and return the closest vectors.
nprobe
NoteWhen searching
nprobe
posting lists, the results returned may be less than the requested limit due to a selective predicate. In this case increasing the centroids to probe may increase the number of returned results.
Building the Index
After running the CREATE
command an RRN, Rockset's globally unique identifier, will be returned as the result. At this point Rockset has started building the index. Building the index requires Rockset to scan the vector data in for training and then perform updates to index all documents.
Index Building Tip
Building an index is a memory intensive operation as clustering is performed in memory. For large collections it is recommended that only one similarity index be trained at a time.
You can check the state of the index by querying Rockset's _system workspace.
SELECT
*
FROM
_system.similarity_index
WHERE
rrn = <RRN>
This will return information about the default nprobe
for the index, the factory string it is associated with and other information that was provided at creation time. It will also print information about the index_status
which can be in state 'TRAINING' or 'READY'. The index is only useable once it has reached the 'READY' state.
Given the number of centroids C
, Rockset must read at least C * 64
vectors before training completes. If not enough vectors are available to be trained on, the index will stay in the 'TRAINING' state and wait until more vectors are ingested.
Querying on the index
Querying using the index happens as before but we must specify that approximate distance results are ok. This will let Rockset's Cost Based Optimizer (CBO) decide whether to use a KNN search or if an ANN search will be more efficient. So long as Rockset's CBO is enabled, selecting to use the index happens completely transparent to the user.
Force index use with HINTs
There are situations where the CBO may have not collected enough stats to feel confident in using the index, so to force the index to be used you may add
HINT(access_path=index_similarity_search)
after theFROM
clause.If the index is not yet ready, approximate similarity search queries will return an error. If you would still like to query without an index you may force the optimizer to perform a brute force search using the column index with
HINT(access_path=column_scan)
.If there are multiple useable indexes on the same field of the collection, the oldest available index will be used for the query by default. You can override this default behavior by using
HINT(similarity_index='<Name>')
.
SELECT
book_dataset.title,
book_dataset.author
FROM
book_dataset ds
JOIN book_metadata m ON ds.isbn = m.isbn
WHERE
m.publish_date > DATE(2010, 12, 26)
and m.rating >= 4
and m.price < 50
ORDER BY
APPROX_DOT_PRODUCT(:target_embedding, ds.book_embedding) option(nprobe=2) DESC
LIMIT
30
The functions APPROX_DOT_PRODUCT
and APPROX_EUCLIDEAN_DIST
are approximate versions of DOT_PRODUCT
and EUCLIDEAN_DIST
respectively. Each will try to use an applicable index if available and if there is no index on the field and distance type required then the sister function will be invoked resulting in a brute force KNN search. To change the number of posting lists iterated in an ANN search you may update the nprobe
parameter per query by setting the option options(nprobe=<# of posting lists>)
. You cannot specify more posting lists to query than there are centroids in the index.
When searching, Rockset will try to push any predicate checks into the similarity index scan itself to avoid having to "pre" or "post" filter results. When querying the similarity index a provided result limit
is required for the index to be used since without a limit
a full scan of the collection would be performed using the index. Rockset will try to push a limit down to the similarity index, but if it fails due to a predicate that cannot be resolved within the index, then the index will not be used and the optimizer will fall back to a brute force KNN search.
Dropping an index
To delete an index you must issue the DROP
command on the index. Once performed, queries will no longer use the index and Rockset will begin the process of cleaning up the index.
DROP <SIMILARITY|DISTANCE> INDEX <Index Name>
For example, to clear the index "book_catalogue_embeddings_ann_index" that we created we would run the following query:
DROP SIMILARITY INDEX book_catalogue_embeddings_ann_index
Want to learn more?
Check out this workshop from one of our Solutions Engineers to learn more about utilizing Vector Search in Rockset.
Check out this blog for more information on how Vector Search was integrated into Rockset.
Updated about 1 month ago