Client

Usage

Client objects allow you to connect securely to the Rockset service. All other API calls require a valid Client object.

In order to create a Client object, you will need a valid Rockset API key. If you have access to the Rockset Console, then you can use the console to create an API key. If not, please contact the Rockset team at support@rockset.io

from rockset import Client

# connect securely to Rockset production API servers
client = Client(api_server='api.rs2.usw2.rockset.com',
                api_key='XKQL6YCU0zDUglhWHPMDDmDYyMxDHrASGk5apCnn3A07twh')

You can manage your api_key credentials using the rock command-line tool. Run the rock configure tool to setup one or more api_key credentials and select the one that you want all rock commands and the Python Rockset Client to use. Once setup, you should expect the following to work.

from rockset import Client

# connect to the active credentials profile
# you can see see the active profile by running ``rock configure ls``
rs = Client()

# connect to credentials profile 'prod' as defined by ``rock configure``
rs = Client(profile='prod')

Example

Connect to Rockset API server and then subsequently use the client object to retrieve collections.

from rockset import Client

# connect securely to Rockset dev API server
rs = Client(api_server='api-us-west-2.rockset.io',
            api_key='adkjf234rksjfa23waejf2')

# list all collections in the account that I have access to
all_collections = rs.Collection.list()

# create a new collection; returns a collection object
new_collection = rs.Collection.create('customer_info')

# get details of an existing collection as a collection object
users = rs.retrieve('users')

Reference

class rockset.Client(api_key=None, api_server=None, profile=None, driver=None, **kwargs)[source]

Securely connect to Rockset using an API key.

Optionally, an alternate API server host can also be provided. If you have configured credentials using the rock configure command, then those credentials will act as fall back values, when none of the api_key/api_server parameters are specified.

Parameters
  • api_key (str) – API key

  • api_server (str) – API server URL. Will default to https if URL does not specify a scheme.

  • profile (str) – Optionally, you can also specify name of your credentials profile setup using rock configure

Returns

Client object

Return type

Client

Raises

ValueError – when API key is not specified and could not be fetched from rock CLI credentials or api_server URL is invalid.

classmethod config_dir()[source]

Returns name of the directory where Rockset credentials, config, and logs are stored.

Defaults to "~/.rockset/"

Can be overriddden via ROCKSET_CONFIG_HOME env variable.

sql(q, **kwargs)[source]

Execute a query against Rockset.

This method prepares the given query object and binds it to a Cursor object, and returns that Cursor object. The request is not actually dispatched to the backend until the results are fetched from the cursor.

Input query needs to be supplied as a Query object.

Cursor objects are iterable, and you can iterate through a cursor to fetch the results. The entire result data set can also be retrieved from the cursor object using a single results() call.

When you iterate through the cursor in a loop, the cursor objects implement automatic pagination behind the scenes. If the query returns a large number of results, with automatic pagination, only a portion of the results are buffered into the cursor at a time. As the cursor iterator reaches the end of the current batch, it will automatically issue a new query to fetch the next batch and seamlessly resume. Cursor’s default iterator uses batch size of 10,000, and you can create an iterator of a different batch size using the iter() method in the cursor object.

Example:

...
rs = Client()
cursor = rs.sql(q)

# fetch all results in 1 go
all_results = cursor.results()

# iterate through all results;
# automatic pagination with default iterator batch size of 100
# if len(all_results) == 21,442, then as part of looping
# through the results, three distinct queries would be
# issued with (limit, skip) of (10000, 0), (10000, 10000),
# (10000, 20000)
for result in cursor:
    print(result)

# iterate through all results;
# automatic pagination with iterator batch size of 20,000
# if len(all_results) == 21,442, then as part of looping
# through the results, two distinct queries would have
# been issued with (limit, skip) of (20000, 0), (20000, 20000).
for result in cursor.iter(20000):
    print(result)
...
Parameters
  • q (Query) – Input Query object

  • timeout (int) – Client side timeout. When specified, RequestTimeout exception will be thrown upon timeout expiration. By default, the client will wait indefinitely until it receives results or an error from the server.

Returns

returns a cursor that can fetch query results with or without automatic pagination

Return type

Cursor

MAX_DOCUMENT_SIZE_BYTES = 41943040

Maximum allowed size of a single document

MAX_FIELD_NAME_LENGTH = 10240

Maximum allowed length of a field name

MAX_FIELD_VALUE_BYTES = 4194304

Maximum allowed size of a field value

MAX_ID_VALUE_LENGTH = 10240

Maximum allowed length of _id field value

MAX_NAME_LENGTH = 2048

Maximum allowed length of a collection name

MAX_NESTED_FIELD_DEPTH = 30

Maximum allowed levels of depth for nested documents


Collection

Usage

Collection objects repesents a single Rockset collection. These objects are generally created using a Rockset Client object using methods such as:

from rockset import Client

# connect to Rockset
rs = Client(api_key=...)

# create a new collection
user_events = rs.Collection.create('user-events')

# retrieve an existing collection
users = rs.Collection.retrieve('users')

You can add documents to the collection using the add_docs() method. Each document in a collection is uniquely identified by its _id field.

If documents added have _id fields that match existing documents, then their contents will be merged. Otherwise, the new documents will be added to the collection.

You can remove documents from a collection using the remove_docs() method.

Refer to the Query module for documentation and examples on how to query collections.

Example

from rockset import Client, Q, F

# connect securely to Rockset
rs = Client()

# retrieve the relevant collection
emails = rs.Collection.retrieve('emails')

# look for all emails to johndoe that contains the term 'secret'
johndoe_secret_q = Q('emails').where(
    (F["to"].startswith('johndoe@')) & (F["body"][:] == 'secret')
)

# query the collection
docs = rs.sql(johndoe_secret_q).results()

Create a new collection

Creating a collection using the Client object is as simple as calling client.Collection.create("my-new-collection"):

from rockset import Client
rs = Client()
new_collection = rs.Collection.create("my-new-collection")

# create a collection in a workspace
event-data-collection = rs.Collection.create("leads",
                                             workspace="marketing")

# create a collection and map timestamp field to event-time
field_mappings = [
  rs.FieldMapping.mapping(
    name="transformation1",
    input_fields=[
      rs.FieldMapping.input_field(
        field_name="ts",
          if_missing="PASS",
          is_drop=True,
          param="ts"
      )
    ],
    output_field=rs.FieldMapping.output_field(
        field_name="_event_time",
        sql_expression="CAST(:ts AS TIMESTAMP)",
        on_error="SKIP"
    )
  )
]

event-data-collection = rs.Collection.create("event-data-collection",
                                             field_mappings=field_mappings)

Creating a collection with a retention duration of 10 days:

from rockset import Client
rs = Client()
new_collection_with_retention = rs.Collection.create("my-event-collection",
                                                retention_secs=10*24*60*60)

List all collections

List all collections using the Client object using:

from rockset import Client
rs = Client()

# list all collections
collections = rs.Collection.list()

Retrieve an existing collection

Retrive a collection to run various operations on that collection such as adding or removing documents or executing queries:

from rockset import Client
rs = Client()
users = rs.Collection.retrieve('users')

# retrieve a collection in a workspace
users = rs.Collection.retrieve('users', workspace='marketing')

Describe an existing collection

The describe method can be used to fetch all the details about the collection such as what data sets act as the collection’s sources, various performance and usage statistics:

from rockset import Client
rs = Client()
users = rs.Collection.retrieve('users')
print(users.describe())

Drop a collection

Use the drop() method to remove a collection permanently from Rockset.

Note

This is a permanent and non-recoverable operation. Beware.

from rockset import Client
rs = Client()
users = rs.Collection.retrieve('users')
users.drop()

Add documents to a collection

Python dicts can be added as documents to a collection using the add_docs method. Documents are uniquely identified by the _id field. If an input document does not have an _id field, then an unique id will be assigned by Rockset.

If the _id field of an input document does not match an existing document, then a new document will be created.

If the _id field of an input document matches an existing document, then the new document will be merged with the existing document:

from rockset import Client
import json

rs = Client()
with open('my-json-array-of-dicts.json') as data_fh:
    ret = rs.Collection.add_docs('users', json.load(data_fh))

Delete documents from a collection

Remove documents from a collection using the remove_docs method:

from rockset import Client

rs = Client()
users_to_remove = ['user007', 'user042', 'user435']
docs_to_remove = [{'_id': u} for u in users_to_remove]
ret = rs.Collection.remove_docs('users', docs_to_remove)

Reference

class rockset.collection.Collection(*args, **kwargs)[source]

Collection objects represent a single Rockset collection.

Objects of this class are never instantiated directly and are generally returned by methods such as:

from rockset import Client
rs = Client()
first = rs.Collection.create('my-first-collection')
another = rs.Collection.retrieve('another-collection')

See more examples and documentation here.

add_docs(docs, timeout=None)[source]

Adds or merges documents to the collection. Provides document level atomicity.

Documents within a collection are uniquely identified by the _id field. If input document does not specify _id, then an unique UUID will be assigned to the document.

If the _id field of an input document does not match with any existing collection documents, then the input document will be inserted.

If the _id field of an input document matches with an existing collection document, then the input document will be merged atomically as described below:

  • All fields present in both the input document and the collection document will be updated to values from the input document.

  • Fields present in the input document but not the collection document will be inserted.

  • Fields present in the collection document but not the input document will be left untouched.

All fields within every input document will be inserted or updated atomically. No atomicity guarantees are provided across two different documents added.

Example:

from rockset import Client

rs = Client()
docs = [
    {"_id": "u42", "name": {"first": "Jim", "last": "Gray"}},
    {"_id": "u1201", "name": {"first": "David", "last": "DeWitt"}},
]
results = rs.Collection.add_docs("my-favorite-collection", docs)
...
Parameters
  • name (str) – Name of the collection

  • docs (list of dicts) – New documents to be added or merged

  • timeout (int) – Client side timeout. When specified, RequestTimeout exception will be thrown upon timeout expiration. By default, the client will wait indefinitely until it receives results or an error from the server.

Returns

The response dict will have 1 field: data

The data field will be a list of document status records, one for each input document indexed in the same order as the list of input documents provided as part of the request. Each of those document status records will have fields such as the document _id, _collection name, status describing if that particular document add request succeeded or not, and an optional error field with more details.

Return type

Dict

classmethod create(name, workspace='commons', description=None, sources=None, retention_secs=None, field_mappings=None, **kwargs)[source]

Creates a new Rockset collection.

Use it via rockset.Client().Collection.create()

Only alphanumeric characters, _, and - are allowed in collection names.

Parameters
  • name (str) – name of the collection to be created.

  • description (str) – a human readable description of the collection

  • sources (Source) – array of Source objects that defines the set of input data sources for this collection

  • retention_secs (int) – an integer representing minimum duration (in seconds), for which documents are retained in this collection before being automatically deleted.

  • field_mappings (FieldMapping) – array of FieldMapping objects that defines the set of transformations to apply on all documents

Returns

Collection object

Return type

Collection

describe()[source]

Returns all properties of the collection as a dict.

Returns

properties of the collection

Return type

dict

drop()[source]

Deletes the collection represented by this object.

If successful, the current object will contain a property named dropped with value True

Example:

...
print(my_coll.asdict())
my_coll.drop()
print(my_coll.dropped)       # will print True
...
classmethod list(**kwargs)[source]

Returns list of all collections.

Use it via rockset.Client().Collection.list()

Returns

A list of Collection objects

Return type

List

patch_docs(docpatches, timeout=None)[source]

Updates documents in the collection.

This method expects a list of docpatches, where each docpatch describes a set of updates that need to be applied to a single document in the collection.

All updates specified in a single docpatch will be applied atomically to the document. If a single patch operation specified in a patch fails, the entire patch operation will be aborted.

Each docpatch is a dict that is required to have 2 fields:

  1. _id that holds the _id field of the document to be updated

  2. patch that holds a list of patch operations to be applied to that document, following the JSONPatch standard defined at http://jsonpatch.com

Example:

from rockset import Client
rs = Client()

docpatch = {
    "_id": "u42",
    "patch": [
        {"op": "replace", "path": "/name/middle", "value": "Nicholas"}
    ]
}
rs.Collection.patch_docs('my-collection', [docpatch])

Each patch is a list of individual patch operations, where each patch operation specifies how a particular field or field path within the target document should be updated.

Each patch operation is a dict with a key called “op” (for operation) and few more keys that act as arguments to the “op”, which differ from one operation type to another. The JSONPatch standard defines several types of patch operations, their arguments and their behavior. Refer to http://jsonpatch.com/#operations for more details.

If a patch opertion’s argument is a field path, then it is specified using the JSON Pointer standard defined at https://tools.ietf.org/html/rfc6901 In essence, field paths are represented as a string of tokens separated by / characters. These tokens either specify keys in objects or indexes into arrays, and arrays are 0-based.

For example, in this document:

{
    "biscuits": [
        { "name": "Digestive" },
        { "name": "Choco Leibniz" }
    ]
}

"/biscuits" would point to the array of biscuits
"/biscuits/1/name" would point to "Choco Leibniz".

Here are some examples of individual patch operations:

  • Add

    Example:

    { "op": "add", "path": "/biscuits/1", "value": { "name": "Ginger Nut" } }
    

    Adds a value to an object or inserts it into an array. In the case of an array, the value is inserted before the given index. The - character can be used instead of an index to insert at the end of an array.

  • Remove

    Example:

    { "op": "remove", "path": "/biscuits" }
    

    Removes a value from an object or array.

    Another Example:

    { "op": "remove", "path": "/biscuits/0" }
    

    Removes the first element of the array at biscuits (or just removes the “0” key if biscuits is an object)

  • Replace

    Example:

    { "op": "replace", "path": "/biscuits/0/name", "value": "Chocolate Digestive" }
    

    Replaces a value. Equivalent to a “remove” followed by an “add”.

  • Copy

    Example:

    { "op": "copy", "from": "/biscuits/0", "path": "/best_biscuit" }
    

    Copies a value from one location to another within the JSON document. Both “from” and “path” are JSON Pointers.

  • Move

    Example:

    { "op": "move", "from": "/biscuits", "path": "/cookies" }
    

    Moves a value from one location to the other. Both “from” and “path” are JSON Pointers.

  • Test

    Example:

    { "op": "test", "path": "/best_biscuit/name", "value": "Choco Leibniz" }
    

    Tests that the specified value is set in the document. If the test fails, then the patch as a whole will not apply.

Parameters
  • name (str) – Name of the collection

  • docpatches (list of dicts) – List of document patches to be applied.

  • timeout (int) – Client side timeout. When specified, RequestTimeout exception will be thrown upon timeout expiration. By default, the client will wait indefinitely until it receives results or an error from the server.

Returns

The response dict will have 1 field: data.

The data field will be a list of document status records, one for each input document indexed in the same order as the list of input documents provided as part of the request. Each of those document status records will have fields such as the document _id, _collection name, status describing if that particular document add request succeeded or not, and an optional error field with more details.

Return type

Dict

remove_docs(docs, timeout=None)[source]

Deletes documents from the collection. The _id field needs to be populated in each input document. Other fields in each document will be ignored.

Parameters
  • name (str) – Name of the collection

  • docs (list of dicts) – Documents to be deleted.

  • timeout (int) – Client side timeout. When specified, RequestTimeout exception will be thrown upon timeout expiration. By default, the client will wait indefinitely until it receives results or an error from the server.

Returns

The response dict will have 1 field: data.

The data field will be a list of document status records, one for each input document indexed in the same order as the list of input documents provided as part of the request. Each of those document status records will have fields such as the document _id, _collection name, status describing if that particular document add request succeeded or not, and an optional error field with more details.

Return type

Dict

classmethod retrieve(name, **kwargs)[source]

Retrieves details of a single collection

Use it via rockset.Client().Collection.retrieve()

Parameters

name (str) – Name of the collection

Returns

Collection object

Return type

Collection


Field Mappings

Introduction

FieldMapping objects allow you to specify transformations to be applied on all documents inserted into a collection. Any valid SQL can be applied on any field in a document.

For more information on field mappings, refer to the official documentation.

Example of basic field mapping

from rockset import Client

rs = Client()

mappings = [
    rs.FieldMapping.mapping(
        name="anonymize_name",
        input_fields=[
            rs.FieldMapping.input_field(
                field_name="name",
                if_missing="SKIP",
                is_drop=True,
                param="name"
            )
        ],
        output_field=rs.FieldMapping.output_field(
            field_name="name", sql_expression="SHA256(:name)", on_error="FAIL"
        )
    )
]

# This collection will have all its `name` fields anonymized through the SQL expression
# in the output field above.
collection = rs.Collection.create(name="collection", field_mappings=mappings)

Example of field whitelisting

from rockset import Client

rs = Client()

mappings = [
    rs.FieldMapping.mapping(name="drop_all_fields", is_drop_all_fields=True),
    rs.FieldMapping.mapping(
        name="whitelist_name",
        input_fields=[
            rs.FieldMapping.input_field(
                field_name="name",
                if_missing="SKIP",
                is_drop=True,
                param="name"
            )
        ],
        output_field=rs.FieldMapping.output_field(
            field_name="name", sql_expression=":name", on_error="FAIL"
        )
    )
]

# This collection will have `name` as a whitelisted field, essentially dropping all fields
# except for `name`.
collection = rs.Collection.create(name="collection", field_mappings=mappings)

Reference

class rockset.field_mapping.FieldMapping[source]
classmethod input_field(field_name, param=None, if_missing=None, is_drop=None)[source]

Create a new InputField object

Parameters
  • field_name (str) – The name of the field, parsed as a SQL qualified name

  • param (str) –

    SQL parameter name (default: same as field name. Required if the field name is nested)

    if_missing (str): Define the behavior if the field is missing from the document or is NULL, one of:

    • SKIP: skip the SQL expression in the output field, i.e. acts as if the field mapping does not exist

    • PASS: pass NULL to the SQL expression specified in the output field

    Default is SKIP.

  • is_drop (boolean) – Set to true if the input field needs to be dropped

classmethod mapping(name, is_drop_all_fields=None, input_fields=None, output_field=None)[source]

Creates a new mapping object

Parameters
  • name (str) – Name of the mapping

  • is_drop_all_fields (bool) – Whether to drop all the fields in a document. This can only be set once in a list of field mappings, and a mapping specifying is_drop_all_fields cannot have any input or output fields

  • input_fields – An Array of InputField objects

  • output – An OutputField object

classmethod output_field(field_name, sql_expression, on_error=None)[source]

Create a new OutputField object

Parameters
  • field_name (str) – The name of the field, parsed as SQL qualified name

  • value (Value) – SQL expression

  • on_error (str) –

    Define the behavior if the SQL expression fails, one of:

    • SKIP: skip the SQL expression, i.e. acts as if the mapping does not exist

    • FAIL: fail the entire mapping, i.e. acts as if the document does not exist

    Default is SKIP.

Cursor

Introduction

Cursor objects are return by the Client.sql() API. A cursor simply binds the query to a particular collection and the query is not executed server side until the application starts to fetch results from the cursor.

Note

Cursors are never instantiated directly by the application and are always instantiated by Rockset Client sql() APIs.

Fetch all results

Use the cursor’s results() method to fetch all the results of the query in one shot:

results = mycollection.sql(q).results()

The above technique will work well, if the number of results returned by the query is limited, say because it employs a LIMIT clause.

For queries, that return a large number of results, please use the cursor iterators as described below.

Iterators with automatic pagination

Cursor objects are iterables, so you can do something like:

results = mycollection.sql(q)
for r in results:
    print(r)

Cursor objects support seamless automatic pagination to iterate over large result sets. The cursor iterator will fetch and buffer a small portion of the results and as the iterator reaches the end of the current batch of buffered results, it will automatically issue the query with the appropriate pagination parameters to buffer the next batch and seamlessly continue results iteration.

The default cursor iterator uses a batch size of 10,000. You can create a cursor iterator with a different batch size by using the iter() method.

Example using the default cursor iterator:

results = mycollection.sql(q)
for r in results:
    print(r)

Example using a custom cursor iterator with batch size 200:

results = mycollection.sql(q)
for r in results.iter(200):
    print(r)
Cursor.iter()[source]

Returns an iterator that does seamless automatic pagination behind the scenes while fetching no more than the specified batch size number of results at a time.

Parameters

batch (int) – maximum number of results fetched at a time

Returns

Iterator that will return all results one Document object at a time with automatic pagination

Return type

Iterator Object

Async requests

Cursors support asyncio.Future to schedule and run queries concurrently along with other async events in your application.

One can create an asyncio.Future from a cursor object using the Cursor.async_request() method. These futures are not scheduled in any async threads and the application have to schedule them in an asyncio event loop. Once the futures are scheduled and run to completion, then the results of their respective queries can be accessed by calling future.result(). The return value of future.result() will be identical to calling Cursor.results() API on the original query.

For example:

jims_future = people.sql(F["first_name"] == "Jim").async_request()
sfmerch_future = merchants.sql(F["zipcode"] == "94107").async_request()

# can schedule these asyncio.futures along with other futures
# issue both queries concurrently and block until both of them finish
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(jims_future, sfmerch_future))

all_jims = jims_future.result()
all_sfmerchants = sfmerch_future.result()

Even if an future was originally issued by an async_requst() API call, one can still call the blocking Cursor.results() API to fetch the results synchronously. Cursor.results() will schedule the future, block until the future finishes execution and then will return those results. A subsequent future.result() call will return the query results immediately.

For example:

jims_cursor = people.sql(F["first_name"] == "Jim")
jims_future = jims_cursor.async_request()

# do a blocking results() will block on the future behind the scenes
results = jims_cursor.results()

# this will return immediately without incurring any server round-trip
results2 = jims_future.result()
Cursor.async_request()[source]

Returns an asyncio.Future object that can be scheduled in an asyncio event loop. Once scheduled and run to completion, the results can be fetched via the future.result() API. The return value of future.result() will be the same as the return value of Cursor.results()

Returns

Returns a Future object that can be scheduled in an asyncio event loop and future.result() will hold the same return value as Cursor.results()

Return type

asyncio.Future

Reference

class rockset.cursor.Cursor(q=None, client=None, generate_warnings=False)[source]

Fetch the results of a query executed against a collection

async_request()[source]

Returns an asyncio.Future object that can be scheduled in an asyncio event loop. Once scheduled and run to completion, the results can be fetched via the future.result() API. The return value of future.result() will be the same as the return value of Cursor.results()

Returns

Returns a Future object that can be scheduled in an asyncio event loop and future.result() will hold the same return value as Cursor.results()

Return type

asyncio.Future

fields()[source]

Once query has been executed, this method will return all the fields that are present in the query result, while preserving the order in which those fields were selected in the input query.

NOTE: this method will return None when called before the query has

actually been executed.

Returns

List of column headers in the query result,

preserving the ordering from the input query.

Return type

list[dicts]

iter(batch=10000)[source]

Returns an iterator that does seamless automatic pagination behind the scenes while fetching no more than the specified batch size number of results at a time.

Parameters

batch (int) – maximum number of results fetched at a time

Returns

Iterator that will return all results one Document object at a time with automatic pagination

Return type

Iterator Object

results()[source]

Execute the query and fetch all the results in one shot.

Returns

Results of the query represent as a list of Document objects

Return type

list[Document]

rowcount()[source]

Number of rows returned by the last query

Returns

The number of rows that the last query produced. Returns None in case no query has been executed yet.

Return type

int

stats()[source]

Query execution stats

Returns

Some key execution stats Returns None in case no query has been executed yet.

Return type

dict

warnings()[source]

Retrieve warnings that occurred during query execution, if any.

Returns

The warnings from the query. Returns None if no warnings exist.

Return type

list[str]


Document

Introduction

Document object represents a single record or row in the result retrieved from executing a query.

Note

Document objects are generally instantiated by the Cursor object and do not need to be instantiated directly by the application while retrieving results of a query.

This class adapts Rockset’s SQL types to standard Python types for all the top level fields retrieved by the query.

Rockset Data Type

Python Data Type

Comments

SQL NULL

None

JSON NULL

None

bool

bool

int

int

float

float

string

str

bytes

str

  • NOTE: will be changed to bytes in the future

array

list

object

dict

date

datetime.date

datetime

datetime.datetime

tzinfo will be None

time

datetime.time

timestamp

str

  • ISO8601 format in UTC timezone

  • eg: ‘2019-11-09T23:14:31.561512Z’

  • NOTE: will be changed to datetime.datetime with tzinfo in the future

month_interval

dict

  • INTERVAL 10 MONTH will return {‘value’: 10}

  • INTERVAL ‘3-4’ YEAR TO MONTH will return {‘value’: 40}

microsecond_interval

datetime.timedelta

geography.Point

geojson.Point

geography.LineString

geojson.LineString

geography.Polygon

geojson.Polygon

Note

Please note that this type adaptation is only done for the top level fields returned in a query. If a top level field retrieved by the query is a map or an array, then fields nested within that map or an array are not adapted to the respective Python data types.

Reference

class rockset.document.Document(*args, **kwargs)[source]

Represents a single record or row in query results.

This is a sub-class of dict. So, treat this object as a dict for all practical purposes.

Only the constructor is overridden to handle the type adaptations shown in the table above.


Exceptions

Introduction

Various Python exceptions thrown by the rockset module are explained in this section, along with possible reasons and remedies to assist in trouble-shooting.

Authentication Errors

The server is rejecting your request because you have either an expired or an invalid token. Ensure you have a valid API key or generate a new one using the Rockset Console before trying your request again.

class rockset.exception.AuthError(**kwargs)[source]

API key or access token is missing, expired or invalid. Re-authenticating with a valid API key should normally fix this.

code

HTTP status code obtained from server

Type

int

message

error message with more details

Type

str

Input Errors

The server is unable to understand the API request as it was sent. This most likely means the API was badly formed (like the input query has a syntax error). When you encounter this error, please refer to the relevant documentation and verify if the request is constructed properly and if the resource is still present.

class rockset.exception.InputError(**kwargs)[source]

User request has a missing or invalid parameter and cannot be processed as is. Syntax errors in queries fall in this category.

code

HTTP status code obtained from server

Type

int

message

error message with more details

Type

str

type

error sub-category

Type

str

Limit Reached

The server could understand the input request but refuses to execute it. This commonly happens when an account limit has been reached. Please reach out to Rockset Support with more details to alter your account limit.

class rockset.exception.LimitReached(**kwargs)[source]

The API request has exceeded some user-defined limit (such as max deadline set for a query) or a system limit. Refer to documentation to increase the limit or reach out to Rockset support with more details to alter your account limit.

code

HTTP status code obtained from server

Type

int

message

error message with more details

Type

str

type

error sub-category

Type

str

Not Yet Implemented

Your API request needs a feature that is not present in your cluster for it to complete. Your cluster needs an upgrade or this feature is in our roadmap but we haven’t gotten around to implementing it yet. Please reach out to Rockset support with more details to help us prioritize this feature.

class rockset.exception.NotYetImplemented(**kwargs)[source]

Your request is expecting a feature that has not been deployed in your cluster or has not yet been implemented. Please reach out to Rockset support with more details to help us prioritize this feature. Thank you.

code

HTTP status code obtained from server

Type

int

message

error message with more details

Type

str

type

error sub-category

Type

str

Request Timeouts

The server did not complete the API request before the timeout you set for the request expired. To troubleshoot, see if your request succeeds when you don’t set a timeout. If it does then you need to recalibrate your timeout value. If it doesn’t, then debug the issue based on the new error you receive.

class rockset.exception.RequestTimeout(**kwargs)[source]

Request timed out.

Many API calls allow a client side timeout to be specified. When specified, this exception will be thrown when the timeout expires and the API call has not received a valid response or an error from the servers.

message

timeout error message

Type

str

timeout

timeout specfied with API call in seconds

Type

int

Server Errors

These errors mean the server correctly parsed the input request, but couldn’t process it for some reason. If a particular request or application is seeing this while other requests are fine, then you probably uncovered a bug with Rockset. Please contact Rockset support to report the bug and we will provide a time estimte for resolution and send you a t-shirt.

class rockset.exception.ServerError(**kwargs)[source]

Something totally unexpected happened on our servers while processing your request and most likely you have encountered a bug in Rockset. Please contact Rockset support and provide all the details you received along with the error for quick diagnosis, resolution, and to collect your t-shirt.

code

HTTP status code obtained from server

Type

int

message

error message with more details

Type

str

type

error sub-category

Type

str

Transient Server Errors

When many of your requests are failing with TransientServerErrors, it means our servers are going through a period of instability or unplanned downtime. This always means our alerts are firing, our pagers are ringing, phones are buzzing, and little adorable kittens are getting lost in the woods. We are actively investigating and fixing this issue. Look for upates on our status page with estimates on time to resolution. Sorry.

class rockset.exception.TransientServerError(**kwargs)[source]

Some transient hiccup made us fail this request. This means our oncall engineers are actively working on this issue and should resolve the issue soon. Please retry after sometime. Sorry.

code

HTTP status code obtained from server

Type

int

message

error message with more details

Type

str

type

error sub-category

Type

str


Limits

This section lists all the system-wide limits such as the biggest document that can be added to a collection or other limits relating to field sizes.

Client.MAX_DOCUMENT_SIZE_BYTES = 41943040

Maximum allowed size of a single document

Client.MAX_FIELD_NAME_LENGTH = 10240

Maximum allowed length of a field name

Client.MAX_FIELD_VALUE_BYTES = 4194304

Maximum allowed size of a field value

Client.MAX_ID_VALUE_LENGTH = 10240

Maximum allowed length of _id field value

Client.MAX_NAME_LENGTH = 2048

Maximum allowed length of a collection name

Client.MAX_NESTED_FIELD_DEPTH = 30

Maximum allowed levels of depth for nested documents


Q for Query and F for FieldRef

Usage

Query module contains a set of APIs that allows you to compose powerful queries over collections.

This module comprises of two major components:

Q : Query Builder Used to compose complex and powerful queries.

>>> from rockset import Q
>>> q = Q('hello-world').limit(10)
>>> (sqltxt, sqlargs) = q.sql()
>>> print(sqltxt)
SELECT *
FROM "hello-world"
LIMIT 10
>>>

F : Field Reference Used to construct field expressions that refer to particular fields within a document.

>>> from rockset import F
>>> (F['answer'] == 42).sqlexpression()
'"answer" = 42'
>>>

Example

from rockset import Client, Q, F

# connect to Rockset
rs = Client(api_key='aB35kDjg931J5nsf4GjwMeErAVd832F7ahsW1S02kfZiab42s11TsfW5Sxt25asT')

# fetch user whose "_id" == "u42"
u = rs.sql(Q('users').where(F["_id"] == "u42"))

# fetch the 100 oldest users in the 'users' collection
q = Q('users').highest(100, F["age"])
old100 = rs.sql(q)

# find the average rating of all songs by "The Beatles"
q = Q('songs').where(F["artist"] == "The Beatles").select(F["rating"].avg())
avg_rating = rs.sql(q)

Query Operators: Overview

rockset.Q(query, alias=None)[source]

All query objects are constructed using the Q(<collection-name>) query builder construct and are then followed by a chain of query operators to build the full query expression.

  • Constructor

    • Q: Specify the collection to be queried

    >>> # return all documents in the logins collection.
    >>> q = Q('logins')
    
  • Filter queries

    • where: Classic selection operator to only return documents that match the given criteria. Use F to construct field expressions to specify the selection criteria.

    >>> # return all docs in logins where field "user_id" is equal to "u42"
    >>> q = Q('logins').where(F['user_id'] == 'u42')
    
  • Projection

    • select: Specify the list of desired fields to be returned

    >>> # will return the fields "user_id" and "login_ip" from all docs in logins
    >>> q = Q('logins').select(F['user_id'], F['login_ip'])
    
  • Pagination

    • limit : Specify limit with skip support

    >>> # return 10 documents from logins after skipping the first 40 results
    >>> q = Q('logins').limit(10, skip=40)
    
  • Sorting

    • highest, lowest: Find the top N or the bottom N

    >>> # will return 10 documents with the most recent "login_time"
    >>> q = Q('logins').highest(10, F['login_time'])
    
  • Aggregation

    • aggregate: Group by and aggregate fields

    >>> # will aggregate all documents in logins by "user_id",
    >>> # and return "user_id", max("login_time") and count(*) after aggregation.
    >>> Q('logins').aggregate(F['user_id'], F['login_time'].max(), F.count())
    

Query Operator: Filters

Where operator

Syntax: <Query>.where(<Query>)

where allows you to chain a new query object as a conjuntion. In most cases, field reference expressions are sufficient, but where comes in especially handy when you want to sub-select documents following another operation such as a sort or an aggregation.

Examples:

# find all "Jim"s who are in the top 100 highest scorers
Q('players')  \
.highest(100, F["score"])  \
.where(F["first_name"] == "Jim")
Query.where(query)[source]

Returns a new query object that when executed will only return documents that match the current query object AND the query object provided as input.

Parameters

query (Query) – the conjunct query object

Returns

new query object that returns documents in self AND query

Return type

Query

Query Operator: Projection

Select operator

Syntax: <Query>.select(<field_ref> [, <field_ref> [, ...]])

Allows you to specify the fields that you wish to include in the query results.

Examples:

Q('authors') \
.where(F["last_name"] == "Gray")  \
.select(F["first_name"], F["last_name"], F["age"])
Query.select(*fields)[source]

Returns a new query object that when executed will only include the list of fields provided as input.

Parameters

fields (list of FieldRef) – fields you wish to select

Returns

new query object that includes the desired field selection

Return type

Query

Query Operator: Pagination

Limit operator

Syntax: <Query>.limit(<max_results> [, <skip_count>])

Limit operator allows you to perform pagination queries and positional filters.

Examples:

# find the "_id" field of the 5 most recently uploaded documents
# since the default sorting is more recently updated first,
# this query will simply be:
Q('uploads').limit(5)

# fetch a third batch of 100 results, for all users older than 18
# i.e., skip the first 200 results
Q('uploads').where(F["age"] >= 18).limit(100, skip=200)
Query.limit(limit, skip=0)[source]

Returns a new query object that when executed will only return a subset of the results. The query when executed will return no more than limit results after skipping the first skip number of results. The limit operator is most commonly used for pagination.

Parameters
  • limit (int) – maximum number of results to return

  • skip (int) – the number of results to skip

Returns

new query object that only returns the desired subset

Return type

Query

Query Operator: Sorting

Highest, Lowest operators

Syntax: <Query>.highest(N, <field_ref> [, <field_ref> [, ...]]), <Query>.lowest(N, <field_ref> [, <field_ref> [, ...]])

Examples:

Q(F["last_name"] == "Gray").highest(5, F["score"], F["first_name"])
Q(F["last_name"] == "Gray").lowest(10, F["salary"])
Query.highest(limit, *fields)[source]

Returns a new query object that when executed will sort the results from the current query object by the list of fields provided as input in descending order and return top N defined by the limit parameter.

Parameters
  • limit (int) – top N results you wish to fetch

  • fields (list of FieldRef) – fields you wish to sort

  • by (descending) –

Returns

new query object that returns top N descending

Return type

Query

Query.lowest(limit, *fields)[source]

Returns a new query object that when executed will sort the results from the current query object by the list of fields provided as input in ascending order and return top N defined by the limit parameter.

Parameters
  • limit (int) – top N results you wish to fetch

  • fields (list of FieldRef) – fields you wish to sort

  • by (ascending) –

Returns

new query object that returns top N ascending

Return type

Query

Query Operator: JOINs

JOINS

Syntax: <Query>.join(<Query>, on=<field_ref expression>) <Query>.left_outer_join(<Query>, on=<field_ref expression>) <Query>.right_outer_join(<Query>, on=<field_ref expression>)

Examples:

Q('emails') \
.join(Q('users'), on=(F['emails']['from'] == F['users']['email']))

Q('emails', alias='e') \
.left_outer_join(Q('users', alias='u'), on=(F['e']['from'] == F['u']['email']))

Query Operator: Unnest - to expand nested array fields

Unnest operators

Syntax: <Query>.unnest(<field_ref>)

Examples:

Q('emails').unnest(F['to'])
(Q('linkedin_people', 'p')
.unnest(F['p']['experiences'], alias='exp')
.where(F['exp']['company_name'] == 'Rockset'))
Query.unnest(field, field_alias=None, alias=None)[source]

Returns a new query object that when executed will unnest the specified array field present in the results of the current query object.

Parameters
  • field (FieldRef) – array field that you wish to unnest

  • field_alias (str) – This field is required if the specified

  • field holds an array of scalar values. This field is (array) –

  • if the specified array field holds an array of objects. (optional) –

  • alias (str) – subquery alias name for the unnested fields.

  • field is optional and defaults to the input array field name. (This) –

Returns

new query object for the unnested query

Return type

Query

Query Operator: Aggregation

Aggregate operator and field ref aggregate functions

Syntax: <Query>.aggregate(<field_ref> [, <field_ref> [, ...]])

Field reference objects can also include any of the following aggregation functions:

  • min

  • max

  • avg

  • sum

  • count

  • countdistinct

  • approximatecountdistinct

  • collect

You can also optionally provide a field name alias in the field reference using the named function. This comes in especially handy for the aggregated fields.

Examples:

# find min and max salaries broken down by age
Q('employees').aggregate(F["age"], F["salary"].min(), F["salary"].max())
# will return documents such as:
# {"age", "18", "min(salary)": 50000, "max(salary)": 150000}
# {"age", "19", "min(salary)": 50000, "max(salary)": 152000}

# example using field name alias
Q('employees').aggregate(F["age"], F["salary"].avg().named("avg_salary"))
# will return documents such as:
# {"age", "18", "avg_salary": 82732}
Query.aggregate(*fields)[source]

Returns a new query object that when executed will aggregate results from the current query object by the list of fields provided as input.

Field reference objects can include one of the supported aggregate functions such as max, min, avg, sum, count, countdistinct, approximatecountdistinct, collect as follows: <field_ref>.max(), <field_ref>.min(), … .

The list of fields provided as input can contain a mix of field references that include an aggregate function and field references that does not.

Parameters

fields (list of FieldRef) – fields you wish to aggregate by

Returns

new query object that includes the desired field aggregations

Return type

Query

FieldRef.min()[source]

Returns a new FieldRef that represents a MIN() aggregation of the given field.

Returns

FieldRef object that represents the desired min aggregation.

Return type

AggFieldRef

FieldRef.max()[source]

Returns a new FieldRef that represents a MAX() aggregation of the given field.

Returns

FieldRef object that represents the desired max aggregation.

Return type

AggFieldRef

FieldRef.avg()[source]

Returns a new FieldRef that represents an AVG() aggregation of the given field.

Returns

FieldRef object that represents the desired avg aggregation.

Return type

AggFieldRef

FieldRef.sum()[source]

Returns a new FieldRef that represents a SUM() aggregation of the given field.

Returns

FieldRef object that represents the desired sum aggregation.

Return type

AggFieldRef

FieldRef.count()[source]

Returns a new FieldRef that represents a COUNT() aggregation of the given field.

When called from a field, say F[‘username’].count(), then the SQL expression generated will be of the following form, which when executed will return the number of rows where “username” IS NOT NULL:

COUNT(“username”)

When called on the root field, say F.count(), then the SQL expression generated will be of the form:

COUNT(*)

Returns

FieldRef object that represents the desired count aggregation.

Return type

AggFieldRef

FieldRef.countdistinct()[source]

Returns a new FieldRef that represents COUNT(DISTINCT <field_ref>) aggregation of the given field.

Returns

FieldRef object that represents the desired countdistinct aggregation.

Return type

AggFieldRef


Field Expressions Overview

rockset.F = <rockset.query.FieldRef object>

F is a field reference object that helps in building query expressions natively using Python language expressions. F uses Python operator overloading heavily and operations on field references generate Query objects that can be used in conjunction with Q to build compose complex queries.

  • Value comparators

    • ==, !=, <, <=, >, >=:

    >>> # match all docs where "first_name" is equal to "Jim"
    >>> F["first_name"] == "Jim"
    >>> # match all docs where "rating" is greater than or equal to 4.5
    >>> F["rating"] >= 4.5
    >>> # match all docs where "title" text is lexographcially greater than "Star Wars"
    >>> F["title"] >= "Star Wars"
    
  • String functions

    • startswith, like: Prefix and classic SQL LIKE expressions

    >>> # match all docs where "title" starts with "Star Wars"
    >>> F["title"].startswith("Stars Wars")
    >>> # match all docs where "title" contains the word "Wars"
    >>> F["title"].like("% Wars %")
    
  • Boolean compositions

    • &, |, ~: AND, OR and NOT expressions

    >>> # match all records with "rating" >= 4.5 AND "title" starts with "Star Wars"
    >>> e1 = (F["rating"] >= 4.5) & F["title"].startswith("Star Wars")
    >>> # match all records with "director" == "George Lucas" OR "title" starts with "Star Wars"
    >>> e2 = (F["director"] == 'George Lucas') | F["title"].startswith("Star Wars")
    >>> # match all records that are not included in expressions e1 or e2
    >>> e1e2_complement = ~(e1 | e2)
    
  • IN and EXISTS operators

    • exists: Construct a query of the form SELECT ... WHERE <field> IN <subquery>

    Can also be used to just check if a subquery returns one or more results

    • exists: Construct a query of the form SELECT ... WHERE EXISTS (<subquery>)

    >>> # find records where "source_ip" is one of the "ip"s in "wolves" collection
    >>> e1 = F["source_ip"].exists(Q("wolves").select(F["ip"]))
    >>> # match all such records in the "logins" collection
    >>> q = Q("logins").where(e1)
    >>> # another way to write the same query
    >>> q = (Q("logins")
    >>>         .where(
    >>>             F.exists(
    >>>                 Q("wolves")
    >>>                 .where(F["wolves"]["ip"] = F["logins"]["source_ip"])
    >>>             )
    >>>         )
    >>>     )
    
  • Field aggregations

    • avg, collect, count, countdistinct, max, min, sum

    >>> # count(*)
    >>> F.count()
    >>> # min(login_time)
    >>> F["login_time"].min()
    >>> # max(login_time) as last_login_time
    >>> F["login_time"].max().named('last_login_time')
    
  • Nested documents and arrays

    • []: The [] notation can be used to refer to fields within nested documents and arrays.

    • Consider a collection where documents looked like this example below.

    {
      "_id": {"u42"},
      "name": {
          "first": "James",
          "middle": "Nicholas",
          "last": "Gray" },
      "tags": [
          "ACID",
          "database locking",
          "two phase commits",
          "five-minute rule",
          "data cube",
          "turing award" ]
     }
    
    • Example field references to access nested documents and arrays:

    >>> # expression to find all documents where field "name" contains a
    >>> # nested field "middle" with value equal to "Nicholas"
    >>> F["name"]["middle"] == "Nicholas"
    >>>
    >>> # similarly, for array fields, you can specify the array offset.
    >>> # expression to find all documents where the first "tags" field
    >>> # is equal to "ACID"
    >>> F["tags"][0] == "ACID"
    
    • In order to match against any element within an array field, you can use Python’s empty slice [:] notation.

    >>> # expression to find all documents where the "tags" array field
    >>> # contains "ACID" as one of the elements
    >>> F["tags"][:] == "ACID"
    >>> # find all documents where one of the "tags" is "turing award"
    >>> F["tags"][:] == "turing award"
    

Field Expression: Value Comparators

Equality operator: ==

Supported types: int, float, bool, str

Syntax: <field_ref> == <value>

Examples:

F["first_name"] == "Jim"
F["year"] == 2017
F["score"] == 5.0
F["tags"][:] == 'critical'

Value comparators: <, <=, >=, >

Supported types: int, float, str

Syntax: <field_ref> < <value>, <field_ref> <= <value>, <field_ref> >= <value>, <field_ref> > <value>

Examples:

F["year"] < 2000
F["year"] >= 2007
F["rating"] >= 4.5
F["title"] >= "Star Wars"

Prefix operator: startswith

Supported types: str

Syntax: <field_ref>.startswith(<prefix>)

Examples:

F["first_name"].startswith("Ben")

Like operator: like

Supported types: str

Syntax: <field_ref>.like(<pattern>)

Examples:

F["address"].like("%State St%")

Field alias: named

Supported types: All field references

Syntax: <field_ref>.named(<new-field-name>)

Examples:

F["full_name"].named("name")
F["login_time"].max().named("last_login_time")

Field existence: is_defined

Supported types: All

Syntax: <field_ref>.is_defined()

Field existence tested with <field_ref>.is_defined() will match all documents where the field is defined, even if it has a null value.

Null comparison: is_not_null

Supported types: All

Syntax: <field_ref>.is_not_null()

Field expression <field_ref>.is_not_null() will match all documents where the field is defined and has a non-null value.

Tip

There is no is_null() because of the potential confusion of calling is_null() on an undefined field. Use ~<field_ref>.is_not_null() or <field_ref>.is_defined() & ~<field_ref>.is_not_null() depending on your use case.

Field Expression: Boolean Compositions

Three different boolean operators (&, |, and ~) are overloaded to allow easy composition of boolean operators.

Note

The boolean operators are NOT and, or, and not, as those are special and cannot be overridden in Python.

AND operator: & (intersection)

Syntax: <Query object> & <Query object>

Examples:

# find all documents where field tags contains the "turing award"
# and the age is greater than 40
(F["tags"][:] == "turing award") & (F["age"] > 40)

OR operator: | (union)

Syntax: <Query object> | <Query object>

Examples:

# find all documents where the first_name is "jim"
# or last_name is "gray"
(F["first_name"] == "jim") | (F["last_name"] == "gray")

NOT operator: ~ (negation)

Syntax: ~<Query object>

Examples:

# find all documents whose title does not contain the term "confidential"
~F["title"][:] == "confidential"

Field Expression: Array operators

nested operator

syntax: <field_ref>.nested(<query>)

nested operator makes it easy to work with nested array of documents.

example:

# find all books authored by 'jim gray'
F["authors"].nested((F["first_name") == "jim") & (F["last_name"] == "gray"))

# find all users who logged in from given ip on june 06, 2006
F["logins"].nested((F["ipv4"] == "10.6.6.6") & (F["login_date"] == "2006-06-06"))

Unnest operator

Syntax: <field_ref>.unnest(alias=None)

unnest operator should be called only on array fields and will return the a query object to represent the SQL expression of the form UNNEST(<field_ref>)

Example:

# unnest all authors from the books collection, so that
# there is a record for every (book x book's author)
Q("books).join(F["authors"].unnest())

# find all books authored by 'jim gray'
(Q("books")
    .join(Q(F["authors"].unnest(), alias="a"))
    .where((F["a"]["first_name"] == "jim") & (F["a"]["last_name"] == "gray"))
)

Reference for Q and Query

rockset.query.Q(query, alias=None)[source]

All query objects are constructed using the Q(<collection-name>) query builder construct and are then followed by a chain of query operators to build the full query expression.

class rockset.query.Query(source=None, alias=None, child=None, children=None)[source]
aggregate(*fields)[source]

Returns a new query object that when executed will aggregate results from the current query object by the list of fields provided as input.

Field reference objects can include one of the supported aggregate functions such as max, min, avg, sum, count, countdistinct, approximatecountdistinct, collect as follows: <field_ref>.max(), <field_ref>.min(), … .

The list of fields provided as input can contain a mix of field references that include an aggregate function and field references that does not.

Parameters

fields (list of FieldRef) – fields you wish to aggregate by

Returns

new query object that includes the desired field aggregations

Return type

Query

highest(limit, *fields)[source]

Returns a new query object that when executed will sort the results from the current query object by the list of fields provided as input in descending order and return top N defined by the limit parameter.

Parameters
  • limit (int) – top N results you wish to fetch

  • fields (list of FieldRef) – fields you wish to sort

  • by (descending) –

Returns

new query object that returns top N descending

Return type

Query

join(query, on=None, join_type='JOIN')[source]

Returns a new query object that when executed will do an INNER JOIN of the current query object with the input query object based on the given join predicate.

Parameters
  • query (Query) – the right hand side of the JOIN

  • on (Query) – the join predicate expressed as a field expression.

  • field is optional and does not need to provided for a (This) –

  • JOIN. (CROSS) –

Returns

new query object that incorporates the join

Return type

Query

left_outer_join(query, on)[source]

Returns a new query object that when executed will do a LEFT OUTER JOIN of the current query object with the input query object based on the given join predicate.

Parameters
  • query (Query) – the right hand side of the LEFT OUTER JOIN

  • on (Query) – the join predicate expressed as a field expression.

  • field is optional and does not need to provided for a (This) –

  • PRODUCT JOIN. (CROSS) –

Returns

new query object that incorporates the left outer join

Return type

Query

limit(limit, skip=0)[source]

Returns a new query object that when executed will only return a subset of the results. The query when executed will return no more than limit results after skipping the first skip number of results. The limit operator is most commonly used for pagination.

Parameters
  • limit (int) – maximum number of results to return

  • skip (int) – the number of results to skip

Returns

new query object that only returns the desired subset

Return type

Query

lowest(limit, *fields)[source]

Returns a new query object that when executed will sort the results from the current query object by the list of fields provided as input in ascending order and return top N defined by the limit parameter.

Parameters
  • limit (int) – top N results you wish to fetch

  • fields (list of FieldRef) – fields you wish to sort

  • by (ascending) –

Returns

new query object that returns top N ascending

Return type

Query

right_outer_join(query, on)[source]

Returns a new query object that when executed will do a RIGHT OUTER JOIN of the current query object with the input query object based on the given join predicate.

Parameters
  • query (Query) – the right hand side of the RIGHT OUTER JOIN

  • on (Query) – the join predicate expressed as a field expression.

  • field is optional and does not need to provided for a (This) –

  • PRODUCT JOIN. (CROSS) –

Returns

new query object that incorporates the right outer join

Return type

Query

select(*fields)[source]

Returns a new query object that when executed will only include the list of fields provided as input.

Parameters

fields (list of FieldRef) – fields you wish to select

Returns

new query object that includes the desired field selection

Return type

Query

sql(**kwargs)[source]

Returns a tuple of (SQL, params) for the underlying query expression.

sqlbuild(sqlsel)[source]

Returns an SQLSelect object, which can be used to generate the SQL text for the query.

sqlexpression(**kwargs)[source]

Returns a text SQL fragment for the underlying query expression.

sqlprepare(sqlsel)[source]

Returns an SQLSelect object, which can be used to build the SQL version of the query.

unnest(field, field_alias=None, alias=None)[source]

Returns a new query object that when executed will unnest the specified array field present in the results of the current query object.

Parameters
  • field (FieldRef) – array field that you wish to unnest

  • field_alias (str) – This field is required if the specified

  • field holds an array of scalar values. This field is (array) –

  • if the specified array field holds an array of objects. (optional) –

  • alias (str) – subquery alias name for the unnested fields.

  • field is optional and defaults to the input array field name. (This) –

Returns

new query object for the unnested query

Return type

Query

where(query)[source]

Returns a new query object that when executed will only return documents that match the current query object AND the query object provided as input.

Parameters

query (Query) – the conjunct query object

Returns

new query object that returns documents in self AND query

Return type

Query

Reference for F and FieldRef

rockset.query.F = <rockset.query.FieldRef object>

F is a field reference object that helps in building query expressions natively using Python language expressions. F uses Python operator overloading heavily and operations on field references generate Query objects that can be used in conjunction with Q to build compose complex queries.

class rockset.query.FieldRef(name=None, parent=None, source=None)[source]
array_agg()[source]

Returns a new FieldRef that represents a ARRAY_AGG() aggregation of the given field.

Returns

FieldRef object that represents the desired collect aggregation.

Return type

AggFieldRef

avg()[source]

Returns a new FieldRef that represents an AVG() aggregation of the given field.

Returns

FieldRef object that represents the desired avg aggregation.

Return type

AggFieldRef

count()[source]

Returns a new FieldRef that represents a COUNT() aggregation of the given field.

When called from a field, say F[‘username’].count(), then the SQL expression generated will be of the following form, which when executed will return the number of rows where “username” IS NOT NULL:

COUNT(“username”)

When called on the root field, say F.count(), then the SQL expression generated will be of the form:

COUNT(*)

Returns

FieldRef object that represents the desired count aggregation.

Return type

AggFieldRef

countdistinct()[source]

Returns a new FieldRef that represents COUNT(DISTINCT <field_ref>) aggregation of the given field.

Returns

FieldRef object that represents the desired countdistinct aggregation.

Return type

AggFieldRef

exists(inner_query)[source]

Returns a new query object that represents a SQL expression as an IN clause or an EXISTS clause.

If the current field ref is just F and represents the root field, then the SQL expression will be of the form:

EXISTS (<inner_query>)

If the current field ref represents a field within the collection, then the SQL expression will be of the form:

<field_ref> IN (<inner_query>)

Example usage:

Q(“logins”).where(F[“source_ip”].exists(Q(“wolves”).select(F[“ip”])))

will construct the following SQL:

SELECT * FROM “logins” WHERE “source_ip” IN (SELECT “ip” FROM “wolves”)

Parameters

inner_query (Query) – query object that represents the nested subquery

Returns

query object that represents the desired SQL expression

Return type

Query

map_agg(fvalue)[source]

Returns a new FieldRef that represents a MAP_AGG() such that MAP_AGG(key=<self>, value=<fvalue>) aggregation of the given fields.

Returns

FieldRef object that represents the desired collect aggregation.

Return type

AggFieldRef

max()[source]

Returns a new FieldRef that represents a MAX() aggregation of the given field.

Returns

FieldRef object that represents the desired max aggregation.

Return type

AggFieldRef

min()[source]

Returns a new FieldRef that represents a MIN() aggregation of the given field.

Returns

FieldRef object that represents the desired min aggregation.

Return type

AggFieldRef

nested(nested_query, field_alias=None, alias=None)[source]

Returns a new query object that matches all documents where the given inner query matches on one or more individual nested documents present within the field path of the given field.

Useful to run complex query expressions on fields that contain an nested array of documents.

Example

Say you have a collection where every document describes a book, and each document has an “authors” field that is a nested array of documents describing each author:

{"title": "Transaction Processing: Concepts and Techniques",
 "authors": [
     {"first_name": "Jim", "last_name": "Gray"},
     {"first_name": "Andreas", "last_name": "Reuter"},
 ],
 "publisher": ... }

If you want to do find all books where ‘Jim Gray’ was one of the authors, you can use the following query:

F["authors"].nested((F["first_name"] == 'Jim') & (F["last_name"] == 'Gray'))

Note: Constructing the same query as follows is incorrect:

# CAUTION: This is not same as the query above
(F["authors"][:]["first_name"] == 'Jim') & (F["authors"][:]["last_name"] == 'Gray')

The incorrect version will return all books which has at least one author with first name ‘Jim’ and at least one author with last name ‘Gray’, but it need not be the same author. A book with two authors named ‘Jim Smith’ and ‘Alice Gray’ will also match, which is not what is intended.

Parameters

nested_query (Query) – query expression to run on every nested document present within the given field path

Returns

query object that represents desired nested operations

Return type

Query

sum()[source]

Returns a new FieldRef that represents a SUM() aggregation of the given field.

Returns

FieldRef object that represents the desired sum aggregation.

Return type

AggFieldRef

unnest(alias=None)[source]

Returns a new query object that represents a SQL expression of the form:

UNNEST(<field_ref>)

or, when alias is not None

UNNEST(<field_ref> as <alias>)

Example usage:

Q(“books”).join(F[“authors”].unnest())

will construct the following SQL:

SELECT * FROM “books” CROSS JOIN UNNEST(“books”.”authors”)

Parameters
  • alias (str) – Required when the fied ref is an array of scalars

  • as ['vacation', 'beach', 'sand', 'dog'] (such) –

  • required if the field is an array of objects. (Not) –

Returns

query object that represents the desired UNNEST() expression

Return type

Query


Query Lambda

Usage

Query Lambdas are named parameterized and versioned SQL queries stored in Rockset and can be executed through a dedicated HTTPS endpoint.

Example

from rockset import Client, Q

# connect securely to Rockset
rs = Client()

# create a Query Lambda
qlambda = rs.QueryLambda.create(
    'myQueryLambda',
    query=Q('SELECT 1'))

# print details about Query Lambda
print(qlambda.workspace, qlambda.name, qlambda.version, qlambda.query)

# execute a Query Lambda
results = qlambda.execute()
print(results)

Create a new Query Lambda

Creating a Query Lambda using the Client object is as simple as calling rs.QueryLambda.create('myQueryLambda', query=Q('SELECT...')):

from rockset import Client, Q, F, P
rs = Client()

# construct a parameterized query
q = Q('_events').where(F['type'] == P['target_type']).limit(100)

# set default value for query parameter
q.P['target_type'] = 'INFO'

# create a new Query Lambda
qlambda = rs.QueryLambda.create(
    'myQueryLambda',
    workspace='commons',
    query=q,
)

List all Query Lambdas

List all Query Lambdas using the Client object using:

from rockset import Client
rs = Client()

# List latest version of all Query Lambdas across all workspaces
qlambdas = rs.QueryLambda.list()

# List latest version of all Query Lambdas in a given workspace
qlambdas = rs.QueryLambda.list(workspace='commons')

Retrieve an existing Query Lambda version by a given tag

Retrieve the version of a given Query Lambda associated with a given tag:

from rockset import Client
rs = Client()

# retrieve Query Lambda
qlambda = rs.QueryLambda.retrieveByTag(
    'myQueryLambda',
    tag='dev',
    workspace='commons')

Retrieve an existing Query Lambda version

Retrieve a particular version of a given Query Lambda:

from rockset import Client
rs = Client()

# retrieve Query Lambda
qlambda = rs.QueryLambda.retrieveByVersion(
    'myQueryLambda',
    version='ac34bfg234ee',
    workspace='commons')

Describe an existing Query Lambda version

Fetch all details about a particular version of a given Query Lambda:

from rockset import Client
rs = Client()

# retrieve Query Lambda
qlambda = rs.QueryLambda.retrieveByTag(
    'myQueryLambda',
    tag='dev',
    workspace='commons')

# print all details about this Query Lambda version
print(qlambda.describe())

Execute a specific Query Lambda version

Execute a Query Lambda version, optionally passing in parameters:

from rockset import Client, ParamDict
rs = Client()

# retrieve Query Lambda
qlambda = rs.QueryLambda.retrieveByTag(
    'myQueryLambda',
    tag='dev',
    workspace='commons')

params = ParamDict()
for target_type in ['INFO', 'DEBUG', 'WARN', 'ERROR']:
    params['target_type'] = target_type
    for result in qlambda.execute(parameters=params).results:
        print(target_type, result)

Update a Query Lambda by creating a new version

Update the query associated with the Query Lambda by creating a new version of it and optionally giving it a tag:

from rockset import Client, Q, F, P, ParamDict
rs = Client()

# retrieve Query Lambda
ql = rs.QueryLambda.retrieveByTag(
    'myQueryLambda',
    tag='dev',
    workspace='commons')

# construct a new query
q = (Q('_events')
    .where(
        (F['type'] == P['target_type']) &
        (F['userEmail'] == P['user_email'])
    )
    .limit(100)
)

# optionally, set default value for some or all of the query parameters
q.P['target_type'] = 'INFO'

# update Query Lambda
# optionally, add a version tag at this time
ql_new_version = ql.update(q, tag='dev')

# execute the new Query Lambda for different user_emails
params = ParamDict()
for email in ['scott@rockset.com', 'veeve@rockset.com']:
    params['user_email'] = email
    results = ql_new_version.execute(params).results
    print(email, results)

Tag a version of a Query Lambda

Apply a tag to a specific Query Lambda version:

from rockset import Client
rs = Client()

# retrieve Query Lambda
ql = rs.QueryLambda.retrieveByVersion(
    'myQueryLambda',
    version='ac34bfg234ee',
    workspace='commons')

# add tag 'dev' to this Query Lambda version
ql.tag('dev')

List all versions of a Query Lambda

Fetch all versions of a given Query Lambda:

from rockset import Client
rs = Client()

# retrieve Query Lambda
ql = rs.QueryLambda.retrieveByTag(
    'myQueryLambda',
    tag='dev',
    workspace='commons')

# fetch all versions of this Query Lambda
all_qlambda_versions = ql.history()

Drop Query Lambda along with all previous versions

Use the drop() method to remove a Query Lambda permanently from Rockset.

Note

This is a permanent and non-recoverable operation. Beware.

from rockset import Client
rs = Client()

qlambda = rs.QueryLambda.retrieveByVersion(
    'myQueryLambda',
    version='ac34bfg234ee',
    workspace='commons')
qlambda.drop()

Source

Introduction

Source objects represent various data sources that could be used to create collections.

Example usage

from rockset import Client, Q, F
import os

rs = Client()

# create a collection from an AWS S3 bucket
integration = rs.Integration.retrieve('aws-rockset-read-only')
s3 = rs.Source.s3(bucket='my-s3-bucket',
    integration=integration)
newcoll = rs.Collection.create(name='newcoll', sources=[s3])

Create AWS S3 source for a collection

AWS S3 buckets can be used as a data source for collections:

from rockset import Client, Q, F
import os

rs = Client()

# create a collection from an AWS S3 bucket
integration = rs.Integration.retrieve('aws-rockset-read-only')
s3 = rs.Source.s3(bucket='my-s3-bucket',
    integration=integration)
newcoll = rs.Collection.create(name='newcoll', sources=[s3])
Source.s3()

Gets the s3 of this Source. # noqa: E501

configuration for ingestion from S3 # noqa: E501

Returns

The s3 of this Source. # noqa: E501

Return type

SourceS3

Reference

class rockset.source.Source(integration, format_params)[source]
classmethod csv_params(separator=None, encoding=None, first_line_as_column_names=None, column_names=None, column_types=None)[source]

Creates a object to represent options needed to parse a CSV file

Parameters
  • separator (str) – The separator between column values in a line

  • encoding (str) – The encoding format of data, one of “UTF-8”, “UTF-16” “US_ASCII” [default: “US-ASCII”]

  • first_line_as_column_names (boolean) – Set to true if the first line of a data object has the names of columns to be used. If this is set to false, the the column names are auto generated. [default: False]

  • column_names (list of strings) – The names of columns

  • column_types (list of strings) – The types of columns

classmethod dynamo(table_name, integration=None)[source]

Creates a source object to represent an AWS DynamoDB table as a data source for a collection.

Parameters
  • table_name (str) – Name of the DynamoDB table

  • integration (rockset.Integration) – An Integration object (optional)

classmethod gcs(bucket=None, prefix=None, integration=None, format_params=None)[source]

Creates a source object to represent an Google Cloud Storage(GCS) bucket as a data source for a collection.

Parameters
  • bucket (str) – Name of the GCS bucket

  • prefix – selects objects whose path matches the specified prefix within the bucket

  • integration (rockset.Integration) – An Integration object (optional)

  • format_params (FormatParams) – the specifications of the format, CsvParams or XmlParams

classmethod kafka(kafka_topic_name, integration)[source]

Creates a source object to represent Apache Kafka as a data source for a collection.

Parameters
  • kafka_topic_name (str) – Kafka topic to be tailed

  • integration (rockset.Integration) – An Integration object

classmethod kinesis(stream_name, integration)[source]

Creates a source object to represent a Kinesis Stream as a data source for a collection

Parameters
  • stream_name (str) – Name of the Kinesis Stream

  • integration (rockset.Integration) – An Integration object (optional)

classmethod mongo(database_name, collection_name, integration)[source]

Creates a source object to represent a MongoDB collection as a data source for a Rockset collection.

Parameters
  • database_name (str) – Name of the MongoDB database

  • collection_name (str) – Name of the MongoDB collection

  • integration (rockset.Integration) – An Integration object

classmethod redshift(database, schema, table_name, integration)[source]

Creates a source object to represent an AWS Redshift table as a data source for a collection.

Parameters
  • database (str) – Name of the Redshift database

  • schema (str) – Name of the Redshift schema

  • table_name (str) – Name of the Redshift table

  • integration (rockset.Integration) – An Integration object (optional)

classmethod s3(bucket, prefixes=None, prefix=None, pattern=None, integration=None, format_params=None)[source]

Creates a source object to represent an AWS S3 bucket as a data source for a collection.

Parameters
  • bucket (str) – Name of the S3 bucket

  • prefix – Path prefix to only source S3 objects that are recursively within the given path. (optional)

  • pattern – Path pattern to only source S3 objects that match the given pattern. (optional)

  • integration (rockset.Integration) – An Integration object (optional)

  • format_params (FormatParams) – the specifications of the format, CsvParams or XmlParams

classmethod xml_params(root_tag=None, doc_tag=None, encoding=None, value_tag=None, attribute_prefix=None)[source]

Creates a object to represent options needed to parse a XML file

Parameters
  • root_tag (str) – Outermost tag within an XML file to be treated as the root. Any content outside the root tag is ignored.

  • doc_tag (str) – Every rockset document is contained between <doc_tag> and a </doc_tag>

  • encoding (str) – The encoding format of data. [default: ‘UTF-8’]

  • value_tag (str) – Tag used for the value when there are attributes in the element having no child. [default: ‘value’]

  • attribute_prefix (str) – Attributes are transformed into key-value pairs in a Rockset document This prefix is used to tell attributes apart from nested tags in a Rockset document.


Workspace

Usage

Workspace objects repesents a container of other workspaces and Rockset collections.

Example

from rockset import Client, Q, F

# connect securely to Rockset
rs = Client()

# create a workspace
rs.Workspace.create('marketing')

# create a collection in the workspace
user_events = rs.Collection.create('user_events', workspace='marketing')

Create a new workspace

Creating a workspace using the Client object is as simple as calling client.Workspace.create("my-new-workspace"):

from rockset import Client
rs = Client()
new_ws = rs.Workspace.create("my-new-workspace")

List all workspaces

List all workspaces using the Client object using:

from rockset import Client
rs = Client()
workspaces = rs.Workspace.list()

Retrieve an existing workspace

Retrive a workspace to run various operations on that workspace:

from rockset import Client
rs = Client()
marketing = rs.retrieve('marketing')

Drop a workspace

Use the drop() method to remove a workspace permanently from Rockset.

Note

This is a permanent and non-recoverable operation. Beware.

from rockset import Client
rs = Client()
marketing = rs.Workspace.retrieve('marketing')
marketing.drop()