Data Format > CSV Data

CSV Data

This page describes how to create collections from CSV files.

Rockset can parse raw CSV data

Using Console

In this section we will create a collection from a dataset hosted on AWS S3. Click on Create Collection in the Overview tab to begin.

CSV Collection Create

Choose an appropriate name, description (optional) and select Amazon S3 as source from the Add Source dropdown. Provide the AWS S3 bucket name, prefix (if any) and select the integration under the Integration Name dropdown or choose None if the bucket is public.

Select ‘CSV’ from the Format dropdown, which will show a few more options to be configured for CSV format support. Configure them as follows:

  • Header

    • First line of file as column names - Select this option if the CSV source contains column names in the first line
    • Specify Columns manually - Select this option if you want to provide custom names for each column in the CSV data source. This option will ask you to provide a name and datatype for each column
    • Generate column names automatically - Rockset will automatically generate unique column names (c1, c2, …) for the CSV data source
  • Separator - The separator used in the CSV data source (default value is Comma)

  • Encoding - Select the encoding format. Supported encodings are UTF-8, UTF-16, ISO 8859-1

  • Quote Character - A one-character string used to quote fields containing special characters, such as the delimiter or quotechar, or which contain new-line characters (default value is ")

Click Create on the top right to create the collection. You should see a new collection in state Created and it can take up to a minute for the collection to become Ready.

Using CLI

CSV CLI options:

  • --csv-separator - separator for columns for csv files (Default “,”)
  • --csv-encoding - encoding, one of "UTF-8", "UTF-16", “ISO-8859-1”
  • --csv-first-line-as-column-names - set to true if first line in CSV file contains column names (default “false”)
  • --csv-column-names - a comma separated list of column names (this requires type to be specified for each column)
  • --csv-column-types - a comma separated list of column types (one-to-one mapping to names specified in csv-column-names)
  • --csv-schema-file - a yaml file that specifies schema of the data in each column. If this option is specified, csv-column-names and csv-column-types are ignored

If --csv-first-line-as-column-names is false and --csv-column-names are not specified, Rockset will create column names automatically.

Use the command below to create a CSV collection from Amazon S3 bucket. This will create a CSV collection with the defaults:

$ rock create collection my-first-csv-collection \
    s3://csv-bucket --format CSV \
    --integration=my-first-integration

Example to create a CSV collection with comma separator and encoding:

$ rock create collection my-first-csv-collection \
    s3://csv-bucket --format CSV \
    --csv-separator ',' \
    --csv-encoding UTF-8 \
    --integration my-first-integration

Example to create a CSV collection with tab separator (use Control-V and Tab to specify a tab in CLI):

$ rock create collection my-first-csv-collection \
    s3://csv-bucket --format CSV \
    --csv-separator '    ' \
    --integration my-first-integration

Example to create a CSV collection with a schema specified in a YAML file:

Sample YAML file contents:

fields:
    - col1: 'INTEGER'
    - col2: 'STRING'
    - col3: 'BOOLEAN'

CLI command:

$ rock create collection my-first-csv-collection \
    s3://csv-bucket --format CSV \
    --csv-schema-file csv-schema.yaml \ 
    --integration=my-first-integration