## Loading Vector Data

The first step in enabling <<glossary:Vector Search>> is creating a collection. When creating the collection you should make sure to wrap your vector fields in a call to [`VECTOR_ENFORCE`](🔗) inside the collection's ingest transformation. `VECTOR_ENFORCE` ensures that all incoming vectors are uniform in length and type and will return a `NULL` value on failure. In addition to performing uniformity checks, it signals to Rockset that this array should be treated as a vector allowing Rockset to make indexing performance optimizations like making sure the array is stored compactly for fast access and avoiding the creation of an inverted index entry for each vector element.

## Generating embeddings

Once your [<<glossary:Collection>>](🔗) has been created, you will need to generate <<glossary:Embeddings>> for your data if you have not done so already. Depending on your use case, you can develop your own models or use open-source and proprietary models provided by third parties. If you are planning on performing a vector search on text data, [OpenAI](🔗) offers a very easy to use [text embedding generation API](🔗). Other model hosting platforms like [Hugging Face](🔗) offer [free open-source text embedding models](🔗) for use.

Popular third party language models typically have simple to use integrations with Python orchestration tools like [LangChain](🔗). You can use a model of your choice along with [Rockset's LangChain <<glossary:Integration>>](🔗) to easily generate and store your embeddings.

Alternatively to using LangCor [Rockset's Python client](🔗) to upload your embeddings, you can load data to another primary store like S3 and ingest data in real-time using one of Rockset's [data source integrations](🔗).

## KNN Search

Performing a K-Nearest Neighbors (KNN) search on your embeddings directly in Rockset is as simple as adding an `ORDER BY` clause to sort by the distance to your query vector. To calculate the distance between two vectors, we can use one of Rockset's distance/similarity functions, [`EUCLIDEAN_DIST`](🔗), [`COSINE_SIM`](🔗) or [`DOT_PRODUCT`](🔗). Cosine similarity is just the normalized dot product of two vectors, so if vectors are already normalized (all of their inner elements sum to 1) these two functions are equivalent.

In practice, we should never have a reason to type a query vector in by hand as typically the meaning of each encoded value is opaque and understood only by the encoding model. To pass in query vectors, turn your query into a [<<glossary:Query Lambda>>]() and make the query vector a parameter. This lets us generate embeddings on the fly and perform similarity search on them.

Creating the `:target_embedding` parameter looks like this:

The parameter value will be the result of a query like the one below, with the title of whichever book you want to use:

An important benefit to using Rockset for vector search is that joining tables and checking complex predicates can easily coexist with a similarity search. You simply update the `WHERE` clause and JOIN any extra information you are interested in. In vector search parlance this is commonly referred to as metadata filtering.

# ANN Search

ANN Search is currently in Beta.

Collections must have been created after October 19, 2023 to utilize ANN Search.

When the dataset is small, it is easy to scan the whole dataset when performing a KNN search. Many times all of the vectors that need to be looked at can fit in memory allowing for fast scans and comparisons. This is especially true if we add many selective predicates to our query. However, in some cases it may be inescapable that we want to query over a billion or more vectors. In these cases the cost of looking up and comparing every single one of our stored vectors is too high and will lead to significant latency increases. At a certain point an exact K Nearest Neighbor search becomes too expensive to scale. Thankfully in most cases exact ordering is not necessary and we really only need a best effort, or Approximate Nearest Neighbor (ANN) search.

## Creating the index

In order to perform approximate scans we need to add extra structure to our data. To do this we will run a DDL command `CREATE SIMILARITY INDEX` which will create an index on our embedding field path. Once the command finishes, vector search queries will be able to transparently take advantage of the new index. This vastly speeds up nearest neighbor search.

Index creation has the form:

Note the option to create a `DISTANCE` index which will create an index that orders vectors based on a distance metric rather than a similarity metric. Currently the inner product of two vectors is the only supported similarity metric while L2 distance is the only supported distance metric. This metric must be decided at creation time and will affect what functions can utilize the index.

On creation, we specify the name of our new index along with it's associated Collection, either by name or RRN, and the dimension of the vectors that the index will be scanning and training on. The `<Factory String>` is a formatted string that specifies underlying type information and configuration for the new index.

### Factory String

The factory string has the form: **\<Provider\>:\<Parameters\>:\<ProviderConfig\>**

**Provider:** The library or creator of the index implementation.

Available providers include:

  • _faiss_: The [FAISS](🔗) library for similarity search.

**Parameters:** Rockset specific parameters for querying and maintaining the index. Assigning a parameter value has the syntax `param=value`. Setting parameters is optional and if unset the default value will be used.

Available parameters include:

  • _nprobe_: The number of centroids (posting lists) that will be visited during a query by default (1 by default).

  • \*_minprobe_, _maxprobe_: Specify a minimum number of posting lists to traverse and expand the search as necessary to fulfill a passed in limit up until _maxprobe_ lists have been traversed. \*Not yet available.

**ProviderConfig:** Provider specific index construction string.

Construction strings for providers:

  • _faiss_: This string is defined by the FAISS library for [index factory construction](🔗). A formal outline of the grammar for the factory string can be found [here](🔗). FAISS supports a few form of indexes but Rockset only supports the IVF family of indexes mentioned here.

Putting it together we get:

Here we are creating an index named "book_catalogue_embeddings_ann_index" on our "book_embedding" field path. We specify the dimension of the input vectors along with a factory string that indicates what type of index to create. Here we specified that we want to use the FAISS IVF index with 256 centroids. Each centroid will be used to generate a posting list which Rockset will be able to use to quickly look up documents similar to a query vector. `Flat` here indicates we want to use the Flat coarse quantizer for our [IVF index](🔗). Note we did not specify any parameters so the default `nprobe` value for the index will be 1.

### Rockset's Vector Search Indexing

The factory string previously discussed currently only supports one form of ANN architecture which is Rockset's index filtering sitting on top of FAISS Inverted File Indexes. Rockset will train a set of centroids on the collection's vector data with the number of these centroids being _N_ specified in the term _IVF{N}_ used in the factory string. Each centroid acts as a representative for a cluster of similar vectors and their associated documents. Together the centroids form an index on the vector data that can be used at query time. For each centroid Rockset will have a posting list that can be used for iteration.

At query time FAISS will gather the set of centroids (clusters/posting lists) in the index that have the smallest _coarse distance_ from the query vector. The number of centroids that are retrieved and searched are based on the `nprobe` parameter whose default is set at index creation but can also be specified for each individual query. From here Rockset must simply iterate the posting lists and return the closest vectors.

`nprobe` Note

When searching `nprobe` posting lists, the results returned may be less than the requested limit due to a selective predicate. In this case increasing the centroids to probe may increase the number of returned results.

## Building the Index

After running the `CREATE` command an RRN, Rockset's globally unique identifier, will be returned as the result. At this point Rockset has started building the index. Building the index requires Rockset to scan the vector data in for training and then perform updates to index all documents.

Index Building Tip

Building an index is a memory intensive operation as clustering is performed in memory. For large collections it is recommended that only _one_ similarity index be trained at a time.

You can check the state of the index by querying Rockset's _\_system_ workspace.

This will return information about the default `nprobe` for the index, the factory string it is associated with and other information that was provided at creation time. It will also print information about the `index_status` which can be in state **'TRAINING'** or **'READY'**. The index is only useable once it has reached the **'READY'** state.

Given the number of centroids `C`, Rockset must read at least `C * 64` vectors before training completes. If not enough vectors are available to be trained on, the index will stay in the **'TRAINING'** state and wait until more vectors are ingested.

If the index is not yet ready, approximate similarity search queries will return an error.

If you would still like to query without an index you may force the optimizer to perform a brute force search using a [HINT](🔗), `HINT(access_path=column_scan)`.

If there are multiple useable indexes on the same field of the collection the oldest available index will be used for the query by default.

You can override this default behavior with a HINT `HINT(similarity_index = '<Name>')`. This allows you to create and train a new index in the background while the old index is used for queries.

## Querying on the index

Querying using the index happens as before but we must specify that approximate distance results are ok. This will let Rockset's [Cost Based Optimizer](🔗) (CBO) decide whether to use a KNN search or if an ANN search will be more efficient. So long as Rockset's CBO is enabled, selecting to use the index happens completely transparent to the user.

Force index use with HINT

There are situations where the CBO may have not collected enough stats to feel confident in using the index, so to force the index to be used you may add the hint `HINT(access_path=index_similarity_search)` after your `FROM` clause.

The functions `APPROX_DOT_PRODUCT` and `APPROX_EUCLIDEAN_DIST` are approximate versions of [`DOT_PRODUCT`](🔗) and [`EUCLIDEAN_DIST`](🔗) respectively. Each will try to use an applicable index if available and if there is no index on the field and distance type required then the sister function will be invoked resulting in a brute force KNN search. To change the number of posting lists iterated in an ANN search you may update the `nprobe` parameter per query by setting the option `options(nprobe=<# of posting lists>)`. You cannot specify more posting lists to query than there are centroids in the index.

When searching, Rockset will try to push any predicate checks into the similarity index scan itself to avoid having to "pre" or "post" filter results. When querying the similarity index a provided result `limit` is required for the index to be used since without a `limit` a full scan of the collection would be performed using the index. Rockset will try to push a limit down to the similarity index, but if it fails due to a predicate that cannot be resolved within the index, then the index will not be used and the optimizer will fall back to a brute force KNN search.

## Dropping an index

To delete an index you must issue the `DROP` command on the index. Once performed, queries will no longer use the index and Rockset will begin the process of cleaning up the index.

For example, to clear the index "book_catalogue_embeddings_ann_index" that we created we would run the following query:

Want to learn more?

Check out this [workshop](🔗) from one of our Solutions Engineers to learn more about utilizing Vector Search in Rockset.

Check out this [blog](🔗) for more information on how Vector Search was integrated into Rockset.