This page covers how to use a Google Cloud Storage bucket as a data source in Rockset. This includes:
For the following steps, you must have access to a Google Cloud account and be able to manage Google Cloud Service Accounts and Roles. If you do not have access, please invite your GCP account administrator to Rockset.
These instructions explain how to set up a Google Cloud Storage integration using a GCP Service Account. An integration can provide access to one or more GCS buckets within your GCP account. You can use an integration to create collections that sync data from your GCS buckets.
To access your GCP resources, Rockset uses a GCP Service Account with permissioned access to your desired GCS buckets. You can either use an existing service account or create a new one for Rockset to use. Once you complete these steps, you can use the JSON key associated with the service account to create the Rockset integration in the Rockset console.
If you don’t have an existing service account or want to use a new service account, you will need to navigate to the “IAM & Admin” section in the Google Cloud Console sidebar, and then select the “Service Accounts” tab within that section.
From there, you can create a new service account by selecting the “Create Service Account” button at the top and then follow the instructions on the page for completing its creation.
For more details, you can read about how to manage and create service accounts in the GCP documentation found here.
On the service accounts home page in the Google Cloud Console, select your desired service account (if you just created a new service account for Rockset above, select your newly created account) to view its details. Under the “Keys” section, select "Add Key", and then "Create New Key". Select “JSON” for the key type and then click "Create".
Once the key is created successfully, it should trigger an automatic download with your key’s associated JSON. This JSON will be required to create the GCP integration within Rockset Console.
In order to access Google Cloud Storage buckets, you must provide roles to the service account that allow access to specific buckets. To do so, you will need to navigate to the “Storage” section in the Google Cloud Console sidebar, and then select the “Browser” tab within that section.
Find your desired GCS bucket that you would like to sync your Rockset collection to, and then click the three dots on the right-hand side to select "Edit Bucket Permissions".
From here, select the “Add Member” button to give the service account the appropriate permissions. When adding the service account as a new member, be sure to input the full email (e.g. email@example.com) of the account.
For a set of standard roles, you can refer to the GCP IAM permissions documentation. For example, you can use the
Storage Object Viewer role that gives read access to all your GCS buckets.
You can also configure individual buckets to be accessible by the service account you created. The permissions that Rockset needs are:
storage.objects.get- Required to retrieve an object from Google Cloud Storage.
storage.objects.list- Required to list objects within a given bucket in Google Cloud Storage.
You can associate a role that provides these permissions to the service account that you created, or you can set it up for your bucket in specific.
Once you have set up an integration, you can go on to create an Google Cloud Storage sourced collection. When you are creating a collection, you can choose which paths you want to include in your collection by adding multiple sources with distinct path names.
In the Rockset Console, you can create a collection from Workspace > Collections > Create Collection > Add Source > Google Cloud Storage.
Using the CLI, you can run the following:
$ rock create collection my-gcs-collection \ gs://my-bucket/my-path-1 \ --integration=my-gcp-integration Collection "my-gcs-collection" was created successfully.
Note that any of the above operations can also be performed using Rockset Client libraries or REST APIs.