This page describes what field mappings are, and how to create collections with them.
A field mapping allows you to specify transformations to be applied on all documents inserted into a collection. Any valid SQL expression can be used to describe the transformation logic in a field mapping definition.
Every ingested document is passed through the set of field mapping transformations specified at collection creation time. Each field mapping takes a set of fields from the incoming document, evaluates some SQL expression with the values of these input fields as named parameters (think Python kwargs), and stores the result of the SQL expression as another field in the document.
Input fields are optional in a field mapping specification. For example,
CURRENT_TIMESTAMP() is a valid field mapping SQL expression that can be used to record the insertion time for every input record.
Output field is also optional in a field mapping specification. This is common when the field mapping was created to drop one or more input fields.
The same field name can be specified as both input and output field, if you wish to overwrite the field with the result of the field mapping SQL expression.
Any field may be used as the output of a field mapping. Special fields require the result of the field mapping SQL expression to be of a specific type:
_id: The field mapping SQL expression must return a
stringthat is used as a primary key (so must be unique across all documents in a collection).
_event_time: The field mapping SQL expression must return a
All field mappings specified will be executed independent of each other and in parallel. So, one field mapping specification should not depend on the output of another field mapping.
Here are some of the common reasons to use field mappings.
You can map an input field to a specific advanced data type, such as mapping an input string field into a
Example field mapping expression:
CAST(:last_updated_at as DATETIME)
If your input dataset has PII (personally identifiable information) or PHI (protected health information) fields, you can use a one-way crypto hash function such as
SHA256() on the input field, so that Rockset only stores the hashed value and not the original PII/PHI field.
Text fields in the input can be tokenized upon ingest so that you can run search queries (still using SQL).
If your application involves queries with complex SQL functions that makes query processing slow, you could use a field mapping to pre-compute the result of the expensive SQL expression at data ingest. Queries on such computed output fields would be much faster than executing the SQL expression at query time.
-- extract the domain from the input email address REGEXP_EXTRACT_ALL( :email_address, '^([a-zA-Z0-9_.+-]+)@([a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)', 2)
If you have unused large fields in your input data set, you can drop them upon ingest to save cost.
-- create a geography from a longitude and latitude field ST_GEOGPOINT(:longitude, :latitude)
The field mappings can be specified at collection creation time. See example below:
The field mappings can also be defined in a YAML file specified at collection creation time, such as the one below.
type: COLLECTION name: users field_mappings: - name: anonymize_name # name of the mapping input_fields: - field_name: 'name' # SQL qualified name # Behavior if field is NULL or UNDEFINED: # - skip: skip the update, drop fields are still dropped (default) # - pass: pass NULL to the update function if_missing: 'SKIP' is_drop: true # drop this field from the doc param: 'name' # exported name/alias for the field. This can be referred to in the SQL expression output_field: field_name: 'name_anon' value: sql: 'TO_HEX(SHA256(:name))' # Any SQL expression # Error behavior: # - skip: Skip this output field (default) # - fail: Fail the update # Note that fields with "is_drop: true" are always dropped # Error behavior must be "fail" for special output fields (_id, _event_time) on_error: "FAIL"
If you are using
SHA256 or any other hashing function, be aware that these
bytes rather than
string, so, if you need the output field
string (such as if the output field is
_id), you may convert to a hex
TO_HEX (see string functions).
To create the collection based on the YAML specification, use this command:
$ rock create -f mappings.yaml Collection "users" was created successfully.
You can specify field mappings in Python as shown below.
from rockset import Client rs = Client() field_mappings = [ rs.FieldMapping.mapping( name="anonymize_name", input_fields=[ rs.FieldMapping.input_field( field_name="name", if_missing="SKIP", is_drop=True, param="name" ) ], output_field=rs.FieldMapping.output_field( field_name="name_anon", sql_expression="TO_HEX(SHA256(:name))", on_error="FAIL" ) ) ] fm_collection=rs.Collection.create("users", field_mappings=field_mappings)