Data Sources > Apache Kafka

Apache Kafka

Introduction

This page will show you how to connect Kafka to Rockset and send data into a Rockset collection that can then be queried using SQL.

How it works

You can use Kafka Connect to send data from one or more Kafka topics into Rocket collections. Kafka Connect is a separate component from your Kafka brokers. Typically, it runs on a separate set of nodes. Kafka connect is the recommended mechanism for reading and writing to/from Kafka topics in a reliable, fault tolerant and scalable manner.

A Kafka connect installation can be configured with different type of connector plugins:

  • Source connector plugin for writing data from an external data source into a Kafka topic
  • Sink connector plugin for writing from a Kafka topic into an external data source

Rockset has a sink connector plugin that can be installed into a Kafka connect cluster and can send JSON and Avro data from Kafka topics to Rockset.

How it works

Set up Kafka Connect

As described above, in order to connect Kafka topics to Rockset, you must have a Kafka connect installation that is connected to your Kafka broker cluster.

  • If you are using Confluent Cloud or Confluent Platform, it may already come with a Kafka Connect installation. In that case, you can skip to the section that describes installing the Rockset sink connector plugin below.
  • If you are using Apache Kafka or Amazon MSK, you must set up and operate your own Kafka Connect installation.

Prototype (Standalone Mode)

To quickly connect an existing Kafka cluster to Rockset, you can run a Kafka connect process in standalone mode. Note that this is good for quickly getting started but not recommended to run in production. You can run the following steps on your local machine, in docker, or a cloud VM instance.

  1. Download the latest version of Apache Kafka binary distribution and extract it to your local directory.
  2. Download the Rockset Kafka Connect plugin.

Finally, you can edit the file $KAFKA_HOME/config/connect-standalone.properties and modify the following properties.

bootstrap.servers=broker1,broker2,broker3
plugin.path=/path/to/kafka-connect-rockset-VERSION-jar-with-dependencies.jar

We are now ready to run the standalone Kafka Connect. You can generate configuration for forwarding specific topics from Rockset into Kafka by setting up a Kafka integration.

Once you have this configuration, you can create a file containing that configuration in $KAFKA_HOME/config/connect-rockset-sink.properties and run the Kafka connect standalone cluster as follows:

$KAFKA_HOME/bin/connect-standalone.sh $KAFKA_HOME/config/connect-standalone.properties $KAFKA_HOME/config/connect-rockset-sink.properties

Production (Distributed Mode)

If you want to run Kafka connect in production, you must run it in distributed mode. This may need setting up separate cloud VM instances, or docker containers that all work together to ensure data is sent from Kafka topics into Rockset in a fault tolerant and scalable manner. You will need to separately download the Rockset Kafka Connect plugin into each node of the Kafka connect distributed cluster.

  • If you are using the official docker images, you can install the above connector into it as described here.
  • If you are using VMs and setting up Kafka Connect from the official binaries on them, you can follow the instructions here.

The configuration of the Rockset Kafka Connect plugin can be done using the REST endpoint. The cURL command to do this can be obtained at the end of setting up a Kafka integration

Download the Rockset Kafka Connect plugin

Navigate to the release page of the plugin’s GitHub repository. Download kafka-connect-rockset-VERSION-jar-with-dependencies.jar, where VERSION is the current latest version. Note that only version 1.1.0 or later is supported.

Set up a Kafka integration

Create a new Kafka integration using the Rockset console by navigating to Integrations > Add Integration > Apache Kafka. You can then name your integration and configure your data format and topics. The only currently supported data formats are JSON and Avro.

Integration details page

After saving the integration, you will see configuration for the Rockset sink plugin that can be used to configure a Kafka Connect cluster in standalone or distributed mode.

Create a Collection

Create a new collection using the Rockset console by navigating to Collections > Create Collection > Apache Kafka on the Rockset Console. In order to create a collection, you must pick an integration that you have created before. In the collection creation, you can choose one or more Kafka topics that you want to sink into the collection.

Configuration Reference

FieldDescription
nameSpecifies the name of the plugin currently running. We use the name of the integration for this, but that is not required.
connector.classUsed by the Kafka Connect framework to instantiate our plugin correctly.
task.maxThe maximum number of sink tasks that can be run concurrently. This is used by the Kafka Connect framework to tune the performance of our plugin.
topicsThe topics we want the sink plugin to consume documents from.
rockset.task.threadsThe number of threads the plugin runs per task.
rockset.apiserver.urlThe URL of the API server the plugin will write the documents to.
rockset.integration.keyUsed by our internal services to make sure your data is tied to the correct integration and thus lands in the correct collections.
formatSpecifies the format our plugin expects your data to conform to.
key.converterConverter class for key data from Kafka Connect. This controls the converter we use to deserialize data from Kafka. The converter is different depending on whether JSON or Avro is specified as the data format.
value.converterConverter class for value data from Kafka Connect. This controls the converter we use to deserialize data from Kafka. The converter is different depending on whether JSON or Avro is specified as the data format.
key.converter.schemas.enable (JSON only)Whether the JSON data in Kafka has a schema associated with it. This is false in our case, as Rockset does not require a fixed schema for your data.
value.converter.schemas.enable (JSON only)Whether the JSON data in Kafka has a schema associated with it. This is false in our case, as Rockset does not require a fixed schema for your data.
key.converter.schema.registry.url (Avro only)The schema registry instances that will be used to look up schemas for key data from Kafka Connect. For more information, see the Confluent documentation.
value.converter.schema.registry.url (Avro only)The schema registry instances that will be used to look up schemas for value data from Kafka Connect. For more information, see the Confluent documentation.