Data Format > Field Mappings

Field Mappings

This page describes how to create a collection with field mappings.

Overview

A field mapping allows you to specify transformations to be applied on all documents inserted into a collection. Any valid SQL expression can be used to describe the transformation logic in a field mapping definition.

Every ingested document is passed through the set of field mapping transformations specified at collection creation time. Each field mapping takes a set of fields from the incoming document, evaluates some SQL expression with the values of these input fields as named parameters (think Python kwargs), and stores the result of the SQL expression as another field in the document.

Input fields are optional in a field mapping specification. For example, CURRENT_TIMESTAMP() is a valid field mapping SQL expression that can be used to record the insertion time for every input record.

Output field is also optional in a field mapping specification. This is common when the field mapping was created to drop one or more input fields.

The same field name can be specified as both input and output field, if you wish to overwrite the field with the result of the field mapping SQL expression.

Any field may be used as the output of a field mapping. Special fields require the result of the field mapping SQL expression to be of a specific type:

  • _id: The field mapping SQL expression must return a string that is used as a primary key (so must be unique across all documents in a collection).
  • _event_time: The field mapping SQL expression must return a timestamp.

All field mappings specified will be executed independent of each other and in parallel. So, one field mapping specification should not depend on the output of another field mapping.

Use Cases

Here are some of the commmon reasons to use field mappings.

Type Coercion

You can map an input field to a specific advanced data type, such as mapping an input string field into a date, datetime or timestamp field.

Example field mapping expression:

CAST(:last_updated_at as DATETIME)

PII/PHI Masking

If your input dataset has PII (personally identifiable information) or PHI (protected health information) fields, you can use a one-way crypto hash function such as SHA256() on the input field, so that Rockset only stores the hashed value and not the original PII/PHI field.

Example:

TO_HEX(SHA256(:email_address))

Search Tokenization

Text fields in the input can be tokenized upon ingest so that you can run search queries (still using SQL).

Example:

TOKENIZE(:tweet_text, 'en_US')

Functional Index

If your application involves queries with complex SQL functions that makes query processing slow, you could use a field mapping to pre-compute the result of the expensive SQL expression at data ingest. Queries on such computed output fields would be much faster than executing the SQL expression at query time.

Example:

-- extract the domain from the input email address
REGEXP_EXTRACT_ALL(
    :email_address,
    '^([a-zA-Z0-9_.+-]+)@([a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)',
    2)

Dropping Fields

If you have unused large fields in your input data set, you can drop them upon ingest to save cost.

Using Console

The field mappings can be specified at collection creation time. See example below: Rockset Create Collection with Field Masking

Using CLI

The field mappings can also be defined in a YAML file specified at collection creation time, such as the one below.

type: COLLECTION
name: users
field_mappings:
- name: anonymize_name        # name of the mapping
  input_fields:
  - field_name: 'name'        # SQL qualified name

    # Behaviour if field is NULL or missing:
    # - skip: skip the update, drop fields are still dropped (default)
    # - pass: pass NULL to the update function
    if_missing: 'SKIP'
    is_drop: true             # drop this field from the doc
    param: 'name'             # exported name/alias for the field. This can be referred to in the SQL expression
  output_field:
    field_name: 'name_anon'
    value:
      sql: 'TO_HEX(SHA256(:name))'    # Any SQL expression

    # Error behavior:
    # - skip: Skip this output field (default)
    # - fail: Fail the update
    # Note that fields with "is_drop: true" are always dropped
    # Error behavior must be "fail" for special output fields (_id, _event_time)
    on_error: "FAIL"

If you are using SHA256 or any other hashing function, be aware that these functions return bytes rather than string, so, if you need the output field to be string (such as if the output field is _id), you may convert to a hex string using TO_HEX (see string functions).

To create the collection based on the YAML specification, use this command:

$ rock create -f mappings.yaml

Collection "users" was created successfully.

Using Python

You can specify field mappings in Python as shown below.

from rockset import Client
rs = Client()

field_mappings = [
    rs.FieldMapping.mapping(
            name="anonymize_name",
            input_fields=[
            rs.FieldMapping.input_field(
                field_name="name",
                if_missing="SKIP",
                is_drop=True,
                param="name"
            )
    ],
    output_field=rs.FieldMapping.output_field(
        field_name="name_anon",
        sql_expression="TO_HEX(SHA256(:name))",
        on_error="FAIL"
        )
    )
]

fm_collection=rs.Collection.create("users")