Data Format > XML Data

XML Data

This page describes how to configure collections to parse and ingest XML files.

Specifying XML Parameters

Using CLI

When using the rock CLI, the configuration for a new collection can be specified in a YAML file. For XML data, the YAML file should look like:

type: COLLECTION
name: 'my-xml-collection'
sources:
- integration_name: 'my-integration'
  s3:
    bucket: 'xml-bucket'
  format_params:
    xml:

      # Outermost tag within an XML file to be treated as the root.
      # Any content outside the root tag is ignored.
      # Default: None
      root_tag: 'root'

      # Every rockset document is contained between <doc_tag> and a </doc_tag>;
      # Default: None
      doc_tag: 'doc'

      # Attributes are transformed into key-value pairs in a Rockset document. 
      # This prefix is used to tell attributes apart from nested tags in a Rockset document.
      # Default: None
      attribute_prefix: '_'

      # Tag name used for data within leaf tags with one or more attributes. Check Example-2 below.
      # Default: 'value'
      value_tag: 'my_value'

      # Supported encodings are UTF-8, UTF-16, ISO 8859-1
      # Default: UTF-8 
      encoding: 'UTF-8'

The YAML file’s name can then be passed in when creating the collection, as shown below.

$ rock create -f xml-collection.yaml
Collection "my-xml-collection" was created successfully.

Using Console

In the console, when creating a collection a source’s format may be specified as XML. Further options may be specified in the same form:

XML Parameters

Examples

Basic XML Document

A basic XML doc with collection configuration and resulting Rockset document are shown below.

<doc attr="attr1">
  <one>val1</one>
  <two>val2</two>
</doc>
type: COLLECTION
name: 'my-xml-collection'
sources:
- integration_name: 'my-integration'
  s3:
    bucket: 'xml-bucket'
  format_params:
    xml:
      doc_tag: 'doc'
{
  "attr": "attr1",
  "one": "val1",
  "two": "val2"
}

XML with Attribute

If the XML leaf tag has an attribute, any attributes are prepended with attribute_prefix and value_tag is used for the contents, all placed within an object. An example XML file, collection config, and Rockset document are shown below.

<doc>
  <one attr="attr1">val1</one>
  <two>val2</two>
</doc>
type: COLLECTION
name: 'my-xml-collection'
sources:
- integration_name: 'my-integration'
  s3:
    bucket: 'xml-bucket'
  format_params:
    xml:
      doc_tag: 'doc'
      attribute_prefix: '_'
      value_tag: 'my_value'
{ 
  "one": { 
    "_attr": "attr1",
    "my_value": "val1"
  },
  "two": "val2"
}

Root and Doc Tags

The root tag specifies where to start parsing documents. If no root_tag is specified, Rockset will ingest at all root tags it finds.

The doc tag specifies what XML level corresponds to a document in Rockset. If no doc_tag is specified, the first tag encountered after the root_tag will be used as the doc_tag.

The example below demonstrates the behavior when the root and doc tags are missing.

<doc>
  <one>first</one>
  <one>second</one>
</doc>
type: COLLECTION
name: 'my-xml-collection'
sources:
- integration_name: 'my-integration'
  s3:
    bucket: 'xml-bucket'
  format_params:
    xml:
      root_tag: ''
      doc_tag: ''
{ 
  "one": ["first", "second"] 
}