Elasticsearch Query Function

🔐
Elasticsearch Query Function is in Private Preview. Contact Rockset support to enable this feature.

The ES() table function allows you to embed Elasticsearch's DSL query natively within a Rockset SQL query. When you are migrating from Elasticsearch to Rockset, you can use the ES() table function to run all of your Elasticsearch queries without having to manually translate them from Elasticsearch DSL to native Rockset SQL.

Syntax

The ES() table function takes two input parameters: the collection name as a string parameter and the Elasticsearch DSL query as an object/map. Similar to any other table function, the ES() function needs to be used in the FROM clause of your SQL query.

SELECT
    *
FROM   
    ES('<collection-name>', <Elasticsearch DSL Query>)

Examples

SELECT
    *
FROM   
    ES('_events', 
       {'query': {'term': {'userEmail': {'value': '[email protected]'}}}}
    )

Collection names have to be passed in as a string. You can use the dot-notation to specify the workspace name and use nested double-quotes to escape special characters (eg: -) inside workspace names or collection names.

Here are some examples with workspace names and collection names:

SELECT * FROM ES('_events', ...)
SELECT * FROM ES('commons._events', ...)
SELECT * FROM ES('"pre-prod"."user-logins"', ...)

If you have Elasticsearch DSL query as an map or an object in your application, then encode that as a JSON string and then use the JSON_PARSE() function in your SQL query.

Example parameterized SQL when your Elasticsearch DSL is a JSON-encoded string:

SELECT
    *
FROM   
    ES('_events', 
       JSON_PARSE('{"query": {"term": {"userEmail": {"value": "[email protected]"}}}}')
    )

Please note that in the example above how single-quotes and double-quotes are used. In JSON encoded strings, double quotes are used to denote strings. In SQL, single-quotes denote strings. So, please keep that in mind when constructing these types of queries by hand or programmatically.

Elasticsearch query example

If you have an ElasticSearch term query such as the one below:

GET /_search
{
  "query": {
    "term": {
      "user_id": {
        "value": "kimchy"
      }
    }
  }
}

You can simply run that exact same query in Rockset as follows assuming the same data is stored in a Rockset collection named foo

SELECT
    *
FROM
    ES('foo',
        {
          'query': {
            'term': {
              'user_id': {
                'value': 'kimchy'
              }
            }
          }
        }
    )

Please note the single-quotes to denote strings in the second parameter.

SQL Template for App Integration

When you want to automatically translate Elasticsearch DSL queries from your application to Rockset SQL, you can use the following parameterized query as your SQL template. This query takes two parameters, one for the collection name and the other for the Elasticsearch's DSL query encoded as a JSON string.

SELECT * FROM ES(:collection, JSON_PARSE(:esquery))

Performance

ES() function takes the input Elasticsearch DSL query and compiles it directly into Rockset’s internal IR code. This is the exact same IR code that all Rockset SQL queries gets compiled into before Rockset’s query optimization and query planning is done. So, queries that run through the ES() function will suffer no performance penalties when compared against the equivalent SQL query that natively runs on Rockset.

For example the query execution plan and hence the query performance of the following two queries will be exactly the same.

Query using the ES() function:

SELECT
    *
FROM
    ES('foo',
       {
         'query': {
           'term': {
             'user_id': {
               'value': 'kimchy'
             }
           }
         }
       }
    )

Same query written natively in SQL

SELECT
    *
FROM
    `foo` f
WHERE
    f.user_id = 'kimchy'

When you run EXPLAIN on both those queries it will yield the same execution plan as below

select *:$2
  reshuffle on_aggregators(1)
    add fields on commons.foo: fields($2=*)
      index filter on commons.foo: fields($1=user_id), query($1:string['kimchy'])

Elasticsearch Join Queries

Elasticsearch join queries (nested, has_child, has_parent) are not supported out of the box. Such queries will have to be broken down into smaller ES fragments and then combined with native SQL JOIN functionality available in Rockset.

Here is an example query in Rockset that JOINs the results of two Elasticsearch term queries in a single SQL query.

-- Find all purchases of product_id = 42 from users who live in Palo Alto 
SELECT
    *
FROM
    ES('users',
       {
         'query': {
           'term': {
             'city': {
               'value': 'Palo Alto'
             }
           }
         }
       }
    )
    JOIN 
    ES('purchases',
       {
         'query': {
           'term': {
             'product_id': {
               'value': 42
             }
           }
         }
       }
    ) ON users.user_id = purchases.customer_id

Elasticsearch Sort Search Result Queries

sort keyword is supported and the results are guaranteed to be in the desired order even if the SQL wrapper is missing an explicit ORDER BY clause.

For example, the following three queries will return the exact same results:

Query using sort keyword within Elasticsearch DSL

SELECT
    *
FROM
    ES('foo',
       {
         'sort' : [
           { 'post_date' : {'order' : 'asc'}},
           'user',
           { 'name' : 'desc' },
           { 'age' : 'desc' }
         ],
         'query' : {
           'term' : { 'user' : 'kimchy' }
         }
       }
    )

Equivalent query using an explicit ORDER BY clause instead of the sort keyword within Elasticsearch DSL

SELECT
    *
FROM
    ES('foo',
       {
         'query' : {
           'term' : { 'user' : 'kimchy' }
         }
       }
    )
ORDER BY 
    post_date ASC, user, name DESC, age DESC

Equivalent native SQL query:

SELECT 
    * 
FROM 
    foo
WHERE 
    user = 'kimchy'
ORDER BY 
    post_date ASC, user, name DESC, age DESC

Elasticsearch Aggregations

ES() function has limited support for aggs keyword. Only 1 top level aggregation is supported with only 1 level of sub-aggregation. Multiple sub-aggregations are allowed.

Here is aggregation query example that is supported by ES()

SELECT
    *
FROM
    ES('foo',
       {
         'aggs': {
           'my-agg-name': {
             'terms': {
               'field': 'my-field'
             },
             'aggs': {
               'avg-other-field': {
                 'avg': {
                   'field': 'my-other-field'
                 }
               },
               'max-third-field': {
                 'max': {
                   'field': 'my-third-field'
                 }
               }
             }
           }
         }
       }
    )

The query above will return the exact results as the following SQL query.

SELECT
    COUNT(*) AS _count,
    AVG("my-other-field") AS "avg-other-field",
    MAX("my-third-field") AS "max-third-field"
FROM 
    foo
GROUP BY 
    "my-field"

Elasticsearch Term-level Queries

Supported keywords:
- exists, ids, prefix, range, regexp, term, wildcard
Unsupported keywords:
- fuzzy, term_set

Elasticsearch Compound Queries

bool: supported without scoring
- must is supported the same way as filter
- should is only supported withminimum_should_match=1
boosting, constant_score, dis_max, function_score are not supported.

Elasticsearch Full Text Queries

Limited support: match, match_phrase, multi_match
- Requires tokenizing string data during ingestion.
- multi_match requires listing the match fields explicitly.

Known Limitations

ES function doesn’t support score, which means any keywords involves calculating scoring are not supported, such as boosting, constant_score, dis_max, or function_score.
Fuzzy search is not supported.
match requires tokenizing the string data during ingest time. Or using prefix or regex function as replacement.
multi_match includes match limitation, and requires explicitly listing the fields to be matched.
query_string and simple_query_string are not supported for Lucene queries.
should is only supported with minimum_should_match=1