Ingestion Examples
Anonymization and Casting
SELECT
*
EXCEPT
-- drop some fields
(extra_data, more_extra_data, email_address),
-- type coercion
TRY_CAST(last_updated_at AS DATETIME) AS last_updated_at,
-- PII/PHI masking
TO_HEX(SHA256(email_address)) AS email_address_hash,
FROM
_input
WHERE
-- filter input documents
email_address IS NOT NULL
This query keeps all input fields by default, except for extra_data
, more_extra_data
, and email_address
. It adds two new fields to the final document: last_updated_at
and email_address_hash
. If there are any input fields named last_updated_at
or email_address_hash
in the input document, their values in the final document will be the values of the specified expressions, not their values in the input document. It also filters out all input documents where email_address
is null
.
Data Clustering
SELECT cntry AS country, cont AS continent, capt AS capital, gdp_usd as gross_domestic_product, ROUND(population / 1e6) AS population_milliions
FROM _input
WHERE population_millions > 0
CLUSTER BY continent, country
This query drops all input fields by default since there is no *
in the SELECT
clause, and instead it explicitly renames several input fields to their cleaner final names country
, continent
, capital
, gross_domestic_product
and computes pouplation_millions
from the input field population
. The WHERE
clause filters out any documents corresponding to countries with a population less than 1M. And data clustering is configured on the Collection with clustering key (continent, country)
which will speed up queries that filter on continent
, or continent
and country
significantly.
CDC Control Records and _op
SELECT
user_id AS _id,
IF(_input.header."delete" = true, 'DELETE', 'UPSERT') AS _op,
IF(_input.header."delete" = true, null, PARSE_TIMESTAMP('%Y-%m-%d %H:%M:%E*S', created_at_ts)) AS _event_time,
* EXCEPT (user_id, header, created_at_ts)
FROM _input
This query maps user_id
from the source documents to _id
in Rockset and interprets _op
based on whether a field is set in the header. In the case of a new record coming in, header.delete
will be false (or null) and both IF
statements will execute their second branch which will set _op = 'UPSERT'
and will parse _event_time
from _created_at_ts
. All fields except the mapped ones and the header will then be inserted into Rockset. When a delete comes through and header.delete
is true, then the first branch of the IF
statements will execute which will set _op = DELETE
and leave _event_time
as null. This will issue a delete for the document corresponding to _id
.
Precomputing Fields
SELECT
twitter_handle,
follower_count,
-- search tokenization
TOKENIZE(tweet_text, 'en_US') AS tweet_text_tokens,
-- functional index
REGEXP_EXTRACT_ALL(
email_address,
'^([a-zA-Z0-9_.+-]+)@([a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)',
2
) AS email_address_domain,
-- native geography type
ST_GEOGPOINT(longitude, latitude) AS geo,
FROM _input
This query drops all input fields by default and only keeps the input fields twitter_handle
and follower_count
. It also adds three new fields to our final document: tweet_text_tokens
, email_address_domain
, and geo
, each of which can be used to speed up queries that would otherwise have to run expensive operations at query time on the raw input fields.
Using UNDEFINED
SELECT
*,
IF(user IS NOT NULL, CONCAT(user, '@rockset.com'), UNDEFINED) AS email_address
FROM _input
This query keeps all input fields by default and adds one new field to the final document: email_address
.
If
user
isnull
, theemail_address
field will be set toundefined
in the final document.Setting a field’s value to
undefined
in an ingest transformation means that the field will not be present in the final document. Thus,email_address
will only be present in the final document if the input document’suser
field is notnull
. See documentation on undefined for more details.
Updated about 1 year ago