Data Format > PDF Data

PDF Data

This page describes how to create collections from .pdf files.

Document Format

When a PDF file is ingested into a Rockset collection, Rockset parses metadata and text data from the PDF. Important fields include

  • text - the text of the pdf (typically where most of the information in a PDF resides)
  • _meta - metadata object of information regarding the pdf upload to Rockset.

For example, a collection that ingested data from PDF files could have the schema shown below.

> DESCRIBE pdf_data

+--------------------------------------------+---------------+---------+-----------+
| field                                      | occurrences   | total   | type      |
|--------------------------------------------+---------------+---------+-----------|
| ['Author']                                 | 9             | 9       | string    |
| ['CreationDate']                           | 9             | 9       | string    |
| ['Creator']                                | 9             | 9       | string    |
| ['ModDate']                                | 9             | 9       | string    |
| ['Producer']                               | 9             | 9       | string    |
| ['Subject']                                | 9             | 9       | string    |
| ['Title']                                  | 9             | 9       | string    |
| ['_event_time']                            | 9             | 9       | timestamp |
| ['_id']                                    | 9             | 9       | string    |
| ['_meta']                                  | 9             | 9       | object    |
| ['_meta', 'file_upload']                   | 9             | 9       | object    |
| ['_meta', 'file_upload', 'file']           | 9             | 9       | string    |
| ['_meta', 'file_upload', 'file_upload_id'] | 9             | 9       | string    |
| ['_meta', 'file_upload', 'upload_time']    | 9             | 9       | string    |
| ['author']                                 | 9             | 9       | string    |
| ['creation_date']                          | 9             | 9       | int       |
| ['creator']                                | 9             | 9       | string    |
| ['modification_date']                      | 9             | 9       | int       |
| ['producer']                               | 9             | 9       | string    |
| ['subject']                                | 9             | 9       | string    |
| ['text']                                   | 9             | 9       | string    |
| ['title']                                  | 9             | 9       | string    |
+--------------------------------------------+---------------+---------+-----------+

Note that most of the data in a PDF will be stored in the text field, which may look like this

+--------------------------------------------------------------+
| text                                                         |
|--------------------------------------------------------------|
| ....                                                         |
| ....                                                         |
| Statement Date: 10/11/2018                                   |
| Your Account Summary                                         |
| ....                                                         |
| Total Amount Due:                                            |
| $157.57                                                      |
| Amount Enclosed:                                             |
| ...                                                          |
+--------------------------------------------------------------+

Creating a collection

Using CLI

$ rock create collection pdf_data

Collection "pdf_data" was created successfully.

Using Console

To configure a collection to ingest PDF data, choose “File Upload” as the data source and select the “PDF” option as the format.

Console Create PDF

Uploading to an existing collection

You can upload a PDF file into an existing collection using either the command line or the console.

Using CLI

$ rock upload pdf-collection SampleBill.pdf

Using Console

Console Upload PDF