SQL Reference > Commands

Commands

This page documents all available SQL commands in Rockset.

In command templates shown here, all uppercase words are keywords and lowercase words are user-provided values. In practice, all keywords and function names are case insensitive; for example, describe "_events" is equivalent to DESCRIBE "_events".

DESCRIBE

DESCRIBE collection_name

The output of DESCRIBE has the following fields:

  • field: Every distinct field name in the collection
  • type: The data type of the field
  • occurrences: The number of documents that have this field in the given type
  • total: Total number of documents in the collection for top level fields, and total number of documents that have the parent field for nested fields

Note that we use the delimiter * (glob) in the field name to indicate descending into an array. It can be replaced by an integer index to construct an instance of the field path. If multiple data types appear for a certain field path, DESCRIBE will return one entry for each type.

Suppose a collection names has the following documents.

{
    "name": "John",
    "age": 31,
    "city": "New York"
},
{
    "name": "Michael",
    "age": "notfound",
    "experiences": [
        {
            "title": "XYZ", 
            "years": 4
        }
    ]
}
DESCRIBE names
+--------------------------------------+---------------+---------+-----------+
| field                                | occurrences   | total   | type      |
|--------------------------------------+---------------+---------+-----------|
| ['name']                             | 2             | 2       | string    |
| ['age']                              | 1             | 2       | string    |
| ['age']                              | 1             | 2       | int       |
| ['city']                             | 1             | 2       | string    |
| ['experiences']                      | 4             | 4       | string    |
| ['experiences', '*']                 | 1             | 1       | object    |
| ['experiences', '*', 'title']        | 1             | 1       | string    |
| ['experiences', '*', 'years']        | 1             | 1       | int       |
+--------------------------------------+---------------+---------+-----------+

Let’s look at how DESCRIBE works with collections that have documents with each of the following properties:

  • Mixed Types
  • Nested Objects
  • Sparse Fields and Null Values
  • Nested Arrays

Mixed Types

Consider a collection called zipcodes with the following documents:

{"zipcode": 94542},
{"zipcode": "91126"},
{"zipcode": 94110.0},
{"zipcode": "94020"}

Every document in this collection has the same field called zipcode, but its type is different across the collection. This is captured below.

DESCRIBE zipcodes
+-------------+-------------+-------+--------+
| field       | occurrences | total | type   |
|-------------+-------------+-------+--------|
| ['zipcode'] | 1           | 4     | int    |
| ['zipcode'] | 2           | 4     | string |
| ['zipcode'] | 1           | 4     | float  |
+-------------+-------------+-------+--------+

We can use the typeof SQL function to get all zipcodes of type 'string’:

SELECT
    zipcode
FROM
    zipcodes
WHERE
    TYPEOF(zipcode) = 'string'
+-----------+
| zipcode   |
|-----------|
| 94020     |
| 91126     |
+-----------+

And to cast field zipcode as string as we retrieve it we can do the following:

SELECT
    zipcode::string zipcode
FROM
    zipcodes
+-----------+
| zipcode   |
|-----------|
| 94020     |
| 94110     |
| 94542     |
| 91126     |
+-----------+

Nested Objects

Documents can have objects with scalars, nested arrays, or nested objects. Consider a simple collection of documents with names of Turing Award winners. We call this collection turing_award_winners.

{"name": {"first": "John", "last": "HopCroft"}},
{"name": {"first": "Robert", "last": "Tarjan"}},
{"name": {"first": "Alan", "last": "Kay"}},
{"name": {"first":"Edsger", "last": "Dijkstra"}}

Each document has an object with a field called name that has nested fields first and last. To access nested fields inside objects, we concatenate the field names with a . (dot) as separator. For example, use name.first and name.last to access the first and last names, respectively, in each of these documents. The fields name.first and name.last are present in all documents as scalars of type 'string’. Hence, the occurrences and total for each field in this document will be the same as the total number of documents.

DESCRIBE turing_award_winners
+--------------------------------------+---------------+---------+-----------+
| field                                | occurrences   | total   | type      |
|--------------------------------------+---------------+---------+-----------|
| ['name']                             | 4             | 4       | object    |
| ['name', 'first']                    | 4             | 4       | string    |
| ['name', 'last']                     | 4             | 4       | string    |
+--------------------------------------+---------------+---------+-----------+

Say we want to list all the first and last names from this collection.

SELECT
    turing_award_winners.name.first,
    turing_award_winners.name.last
FROM
    turing_award_winners
+---------+----------+
| first   | last     |
|---------+----------|
| Alan    | Kay      |
| John    | HopCroft |
| Robert  | Tarjan   |
| Edsger  | Dijkstra |
+---------+----------+

Sparse Fields and Null Values

Suppose collection turing_award_winners now has the following documents.

{"name": {"first": "John", "last": "HopCroft"}},
{"name": {"first": "Robert", "last": "Tarjan"}},
{"name": {"first": "Alan", "middle": "Curtis", "last": "Kay"}},
{"name": {"first": "Edsger", "last": "Dijkstra"}}

Notice that this is similar to the collection we were working with above with an additional nested field middle added to the third document. name.middle is a sparse field which is a string in 1 document. It is treated as a SQL NULL or undefined in the other 3 documents that it is absent in. Note that this is different from JSON NULL. This is clarified in further detail here. DESCRIBE helps capture such sparse fields as well.

DESCRIBE turing_award_winners
+--------------------------------------+---------------+---------+-----------+
| field                                | occurrences   | total   | type      |
|--------------------------------------+---------------+---------+-----------|
| ['name']                             | 4             | 4       | object    |
| ['name', 'first']                    | 4             | 4       | string    |
| ['name', 'last']                     | 4             | 4       | string    |
| ['name', 'middle']                   | 1             | 4       | string    |
+--------------------------------------+---------------+---------+-----------+

While fields with type SQL NULL are not captured in the smart schema, those with type JSON NULL are. Say the last document in the collection was {"name": {"first": "Edsger", "middle": null, "last": "Dijkstra"}} instead.

DESCRIBE turing_award_winners
+-------------------+----------------+---------+-----------+
| field              | occurrences   | total   | type      |
|------------------------------------+---------------------+
| ['name']           | 4             | 4       | object    |
| ['name', 'first']  | 4             | 4       | string    |
| ['name', 'last']   | 4             | 4       | string    |
| ['name', 'middle'] | 1             | 4       | null_type |
| ['name', 'middle'] | 1             | 4       | string    |
+----------------------------------------------+-----------+

The predicate IS/IS NOT NULL can be used to filter SQL NULLs. But it does not filter JSON NULLs (null_type).

SELECT
    COUNT(*)
FROM
    turing_award_winners
WHERE
    turing_award_winners.name.middle IS NOT NULL
+----------+
| ?count   |
|----------|
| 2        |
+----------+
SELECT
    turing_award_winners.name.middle
FROM
    turing_award_winners
WHERE
    turing_award_winners.name.middle IS NOT NULL
+----------------------+
| middle               |
|----------------------|
| Curtis               |
| <null>               |
+----------------------+

This <null> in the second row of the result is a JSON NULL. In order to filter fields with type null_type we can use the typeof SQL function or we can type cast the field using <field>::<type>.

SELECT
     turing_award_winners.name.middle
 FROM
     turing_award_winners
 WHERE
     turing_award_winners.name.middle IS NOT NULL
     and TYPEOF(turing_award_winners.name.middle) != 'null_type'
+----------+
| middle   |
|----------|
| Curtis   |
+----------+
SELECT
     turing_award_winners.name.middle
 FROM
     turing_award_winners
 WHERE
     turing_award_winners.name.middle::string IS NOT NULL
+----------+
| middle   |
|----------|
| Curtis   |
+----------+

Nested Arrays

Just like nested objects, documents can also have arrays comprising of nested fields that are scalars, objects, or arrays. Consider the following collection of documents with Turing Award winners by year. Let’s call this collection turing_award_winners_by_year.

{
    "year" : "1972",
    "winners" : [
                    {
                        "first" : "Edsger",
                        "last" : "Dijkstra",
                        "subjects" : [ "Program Verification", "Programming" ]
                    }
                ]
},
{
    "year" : "1986",
    "winners" : [
                    {
                        "first" : "John",
                        "last" : "HopCroft",
                        "subjects" : [ "Analysis of Algorithms", "Data Structures" ]
                    },
                    {
                        "first" : "Robert",
                        "last" : "Tarjan",
                        "subjects" : [ "Analysis of Algorithms", "Data Structures" ]
                    }
               ]
},
{
    "year" : "2003",
    "winners" : [
                    {
                        "first" : "Alan",
                        "last" : "Kay",
                        "subjects" : [ "Personal Computing", "Programming Languages" ]
                    }
                ]
}

This shape of this collection is succinctly captured in the smart schema below.

DESCRIBE turing_award_winners_by_year
+--------------------------------------------+---------------+---------+-----------+
  | field                                      | occurrences   | total   | type      |
  |--------------------------------------------+---------------+---------+-----------|
  | ['winners']                                | 3             | 3       | array     |
  | ['winners', '*']                           | 4             | 4       | object    |
  | ['winners', '*', 'first']                  | 4             | 4       | string    |
  | ['winners', '*', 'last']                   | 4             | 4       | string    |
  | ['winners', '*', 'subjects']               | 4             | 4       | array     |
  | ['winners', '*', 'subjects', '*']          | 8             | 8       | string    |
  | ['year']                                   | 3             | 3       | string    |
  +--------------------------------------------+---------------+---------+-----------+

The occurrence count for each of the nested elements is equal to the number of array elements of the given type, summed across all documents in the collection.

The total count for each such entry is equal to the total number of array elements at that level of nesting, summed across all documents in the collection. Note that we use the delimiter ‘*’ (glob) in the field name to indicate descending into an array. Further, Rockset treats arrays as virtual collections and allows you to DESCRIBE arrays as well. When using a nested array as a target collection in queries we use the delimiter : (colon) as a separator between the root collection and the nested fields.

DESCRIBE turing_award_winners_by_year:winners[*]
+------------------------+---------------+---------+--------+
| field                  | occurrences   | total   | type   |
|------------------------+---------------+---------+--------|
| ['*']                  | 4             | 4       | object |
| ['*', 'first']         | 4             | 4       | string |
| ['*', 'last']          | 4             | 4       | string |
| ['*', 'subjects']      | 4             | 4       | array  |
| ['*', 'subjects', '*'] | 8             | 8       | string |
+------------------------+---------------+---------+--------+
DESCRIBE turing_award_winners_by_year:winners[*].subjects
+---------+---------------+---------+--------+
| field   | occurrences   | total   | type   |
|---------+---------------+---------+--------|
| ['*']   | 8             | 8       | string |
+---------+---------------+---------+--------+

SELECT

[ WITH with_query [, ...] ]
SELECT [ ALL | DISTINCT ] select_expr [, ...]
[ FROM from_item [, ...] ]
[ WHERE condition ]
[ GROUP BY [ ALL | DISTINCT ] grouping_element [, ...] ]
[ HAVING condition]
[ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select_query ]
[ ORDER BY expression [ ASC | DESC ] [, ...] ]
[ LIMIT [ count | ALL ] ]

Returns result set from one or more collections. May include various other clauses:

  • with_query is of the form alias_name AS select_query.
  • select_expr is a valid SQL expression.
  • from_item is one of:
    • collection_name [ [ AS ] alias [ ( column_alias [, ...] ) ] ] [ TABLESAMPLE BERNOULLI (percentage) ]
      • percentage is the percentage of input to be sampled (ranges from 0 to 100)
    • from_item join_type from_item [ ON join_condition | USING ( join_column [, ...] ) ] where join_type is one of:
      • [ INNER ] JOIN
      • LEFT [ OUTER ] JOIN
      • RIGHT [ OUTER ] JOIN
      • CROSS JOIN (which should not be followed by ON/USING, all other should)
  • condition is a SQL expression that evaluates to a boolean. See the section on conditional operators.
  • grouping_element is one of:
    • ()
    • expression
    • GROUPING SETS ( ( column [, ...] ) [, ...] )
    • CUBE ( column [, ...] )
    • ROLLUP ( column [, ...] )
  • select_query is itself a SELECT command with the same recursive format.
  • expression is a valid SQL expression.
  • count is the number of results to be returned.

Below we cover common uses some clauses in more detail.

To return all available fields, instead of listing them out as select expressions, specify a single select expression *.

SELECT * FROM _events
+--------------------------------------+---------------+--------+---------------+------+
| _id                                  | eventTime     | kind   | label         | type |
+--------------------------------------+---------------+--------+---------------+------+
| cf0e193d-9ee0-4c6a-9f2b-cab9893e97e3 | 1536164080248 | QUERY  | QUERY_SUCCESS | INFO |
| 4a5e941e-7683-4db2-b8e0-ee0e4807d6a7 | 1536164079995 | QUERY  | QUERY_SUCCESS | INFO |
| 6a9c5d49-ee3b-4b78-b653-006c540f70c3 | 1536162113586 | QUERY  | QUERY_ERROR   | INFO |
+--------------------------------------+---------------+--------+---------------+------+

To return only unique results, include DISTINCT after SELECT.

SELECT DISTINCT label FROM _events
+---------------+
| label         |
+---------------+
| QUERY_SUCCESS |
| QUERY_ERROR   |
+---------------+

WITH Clause

The WITH clause can be used to chain queries by defining a named subquery that can be then used within the main query. Using WITH clauses can improve the readablility and maintainability of complex queries.

WITH x AS (SELECT _id, label FROM _events)
SELECT _id, label FROM x
+--------------------------------------+---------------+
| _id                                  | label         |
|--------------------------------------+---------------|
| 23658547-1ae5-4be8-97e5-6e043f682c31 | QUERY_SUCCESS |
| 85e16122-d749-444d-ab63-bb5a0544368b | QUERY_SUCCESS |
| 972485f2-e3b0-43c4-bb7e-c70b7df3e86c | QUERY_SUCCESS |
+--------------------------------------+---------------+

There can be multiple subqueries in a WITH clause, and one can use any of the previous ones.

WITH
    x AS (SELECT _id, label, kind FROM _events),
    y AS (SELECT _id, kind AS c FROM x)
SELECT _id, c FROM y
+--------------------------------------+-------+
| _id                                  | c     |
|--------------------------------------+-------|
| 589eaee9-0c62-46f9-8f7a-0a5a3fcf2e2c | QUERY |
| 3e2e99b9-896f-4852-9669-ab5d1a1a6ebf | QUERY |
| a2b5d818-cad6-49b8-8f9d-bcc77146e060 | QUERY |
+--------------------------------------+-------+

FROM Clause

When querying a single collection, the FROM clause consists simply of the collection name. However, the FROM clause can also be used to reshape the data that is being queried.

UNNEST

UNNEST is a function that can be used to expand arrays of values or documents to be queried.

Suppose collection companies contains documents of the form:

{
    "make": "Ford",
    "models": ["Focus", "Mustang", "F-150"],
    "offices": [
        {"city": "Dearborn", "state": "MI"},
        {"city": "Denver", "state": "CO"}
    ]
},
{
    "make": "Toyota",
    "models": ["Prius", "Highlander", "Camry"],
    "offices": [
        {"city": "Jacksonville", "state": "FL"},
        {"city": "Ann Arbor", "state": "MI"}
    ]
}

All the values in all the models arrays can be extracted and dumped as individual rows available under an alias, as shown in the example below.

SELECT 
    models.models
FROM 
    companies,
    UNNEST(companies.models AS models) AS models
+------------+
| models     |
|------------|
| Prius      |
| Highlander |
| Camry      |
| Focus      |
| Mustang    |
| F-150      |
+------------+

The index within the original array can also be extracted by appending WITH ORDINALITY AS alias_name inside the UNNEST function.

SELECT 
    models.value, models.index
FROM
    companies,
    UNNEST(companies.models AS value WITH ORDINALITY AS index) AS models
+---------+------------+
| index   | value      |
|---------+------------|
| 1       | Prius      |
| 2       | Highlander |
| 3       | Camry      |
| 1       | Focus      |
| 2       | Mustang    |
| 3       | F-150      |
+---------+------------+

When the unnested array contains objects, its fields become available for normal querying.

SELECT 
    offices.state,
    ARRAY_AGG(offices.city) AS cities
FROM 
    companies,
    UNNEST(companies.offices AS value) AS offices
GROUP BY offices.state
+---------------------------+---------+
| cities                    | state   |
|---------------------------+---------|
| ['Denver']                | CO      |
| ['Jacksonville']          | FL      |
| ['Dearborn', 'Ann Arbor'] | MI      |
+---------------------------+---------+

JOIN

Rockset supports standard SQL joins.

  • An INNER JOIN returns only results where there is a match on both sides.
  • A LEFT (or RIGHT) OUTER JOIN returns all results from the left (or right) side, populating the other side’s columns when possible and with null otherwise.
  • A CROSS JOIN returns the Cartesian product of the two sides’ results (i.e all combinations of documents in two collections).
SELECT collection1.name, collection2.name
FROM collection1 JOIN collection2
ON collection1._id = collection2._id

Cross joins can either be specified using the explicit CROSS JOIN syntax or by specifying multiple collections in the FROM clause. The queries below are equivalent.

SELECT collection1.id, collection2.id
FROM collection1, collection2
SELECT collection1.id, collection2.id
FROM collection1
CROSS JOIN collection2

TABLESAMPLE

TABLESAMPLE returns approximately the specified percentage of results.

Rockset only supports BERNOULLI sampling.

SELECT id, name FROM collection1 TABLESAMPLE BERNOULLI (50)

Returns approximately 50% of the results that would have been otherwise obtained.

SELECT avg(height) FROM people TABLESAMPLE BERNOULLI (25)

Returns the average of the height field based on approximately 25% of rows in the table.

GROUP BY Clause

Often, interesting results are obtained by grouping records together on some similar attributes.

GROUP BY is the mechanism by which the result of a SELECT is partitioned based on a particular field. Non-aggregated fields can only be included in aggregate functions in select expressions with the presense of a GROUP BY clause.

A simple GROUP BY clause may contain any expression composed of input columns.

SELECT count(*) AS c, label FROM _events GROUP BY label
+--------+------------------------------+
| c      | label                        |
|--------+------------------------------|
| 41     | API_KEY_DELETED              |
| 35     | API_KEY_CREATED              |
| 88256  | QUERY_SUCCESS                |
| 133    | QUERY_INVALID                |
| 24     | API_KEY_ERROR                |
| 72     | INGEST_WARNING               |
| 16     | COLLECTION_DROPPED           |
| 121988 | INGEST_INFO                  |
+--------+------------------------------+

HAVING is used to filter results, based on the output of aggregate functions.

SELECT count(*) AS c, label FROM _events 
GROUP BY label 
HAVING count(*) < 50
+--------+------------------------------+
| c      | label                        |
|--------+------------------------------|
| 41     | API_KEY_DELETED              |
| 35     | API_KEY_CREATED              |
| 24     | API_KEY_ERROR                |
| 16     | COLLECTION_DROPPED           |
+--------+------------------------------+

For more complex aggregations, Rockset supports GROUPING SETS, CUBE and ROLLUP. These can be used to perform aggregations on multiple fields in a single query.

Functions are not supported in GROUP BY clause. Consider using a subquery instead.

GROUPING SETS

Grouping sets allow users to specify multiple lists of fields to group on. The fields that are not part of a given sublist of grouping fields are set to null. It can logically be thought of as the UNION of a single SELECT query with a simple GROUP BY for each GROUPING SET.

SELECT label, kind FROM _events
GROUP BY GROUPING SETS (
    (label),
    (label, kind)
)
+--------------+------------------------------+
| kind         | label                        |
|--------------+------------------------------|
| null         | API_KEY_DELETED              |
| API_KEY      | API_KEY_DELETED              |
| COLLECTION   | COLLECTION_CREATED           |
| null         | COLLECTION_CREATED           |
| QUERY        | QUERY_SUCCESS                |
| QUERY        | QUERY_COLLECTION_NOT_READY   |
| INGEST       | INGEST_INFO                  |
| INGEST       | INGEST_INITIALIZED           |
+--------------+------------------------------+

CUBE

CUBE is a higher level operator that can be used to generate GROUPING SETS. It generates all possible GROUPING SETS given a set of fields.

SELECT label, kind FROM _events
GROUP BY CUBE (label, kind)

This is equivalent to:

SELECT label, kind FROM _events
GROUP BY GROUPING SETS (
    (label, kind),
    (label),
    (kind),
    ()
)

ROLLUP

The ROLLUP operator is similar to CUBE in that it’s a higher order abstraction over GROUPING SETS. It generates all possible starting subsequences for a given set of fields.

SELECT label, kind FROM _events
GROUP BY ROLLUP (label, kind)

This is equivalent to:

SELECT label, kind FROM _events
GROUP BY GROUPING SETS (
    (label, kind),
    (label),
    ()
)

ORDER BY Clause

The ORDER BY clause is used to sort a result set by one or more output expressions. By default ordering for an expression is ascending. Use DESC to get result set in descending order.

SELECT * FROM users
ORDER BY user.username DESC

Use multiple expressions, each with a different ordering, by separating with commas.

SELECT * FROM users
ORDER BY users.first_name ASC, users.last_name DESC

LIMIT Clause

The LIMIT clause restricts the number of rows in the result. The following query limits the result set to contain a maximum of 10 rows. The result set represents the top 10 customers based on customer rating.

SELECT * FROM customers
LIMIT 10
ORDER BY customers.customer_rating DESC

Combining Result Sets

UNION returns the combination of results of two SELECT queries.

SELECT id, name FROM collection1
UNION
SELECT userId, name FROM collection2

INTERSECT returns only distinct common results of two SELECT queries.

SELECT id, name FROM collection1
INTERSECT
SELECT userId, name FROM collection2

EXCEPT returns only distinct results of the first query that are not in the second query.

SELECT id, name FROM collection1
EXCEPT
SELECT userId, name FROM collection2

INSERT INTO

INSERT INTO collection_name
[ (column_alias [, ...]) ]
select_query

Inserts result of a query into a collection.

  • collection_name is the collection into which query results are ingested.
  • column_alias renames fields in the result rows. If specified, this list must contain as many items as the number of fields selected by the select_query
  • select_query is a valid SELECT command.

Note that ingestion of query results into the specified collection begins after SELECT command finishes successfully. The ingestion happens in the background and for a query that finished successfully it is guaranteed to finish eventually.

Below we cover common uses of INSERT INTO.

To update some documents in a collection. Note that following queries select _id column, so a document with specified id gets updated.

INSERT INTO collection1 SELECT _id, price + 10 AS price FROM collection1 where price < 5
INSERT INTO collection1 SELECT _id, NULL_VALUE AS age FROM collection1

To replicate a collection.

INSERT INTO collection2 SELECT * FROM collection1