Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions TOC-tidb-cloud-lake.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,13 +32,15 @@
- [MySQL - Credentials](/tidb-cloud-lake/guides/mysql-credentials.md)
- [PostgreSQL - Credentials](/tidb-cloud-lake/guides/postgresql-credentials.md)
- [FeiShuBot](/tidb-cloud-lake/guides/feishubot.md)
- [Kafka - Credentials](/tidb-cloud-lake/guides/kafka-credentials.md) ![BETA](/media/tidb-cloud/blank_transparent_placeholder.png)
- Integration Tasks
- [Overview](/tidb-cloud-lake/guides/integration-tasks.md)
- [Task Management](/tidb-cloud-lake/guides/task-management.md)
- [Amazon S3 Integration Task](/tidb-cloud-lake/guides/integrate-with-amazon-s3.md)
- [Amazon SQS (S3) Integration Task](/tidb-cloud-lake/guides/integrate-with-amazon-sqs-s3.md) ![BETA](/media/tidb-cloud/blank_transparent_placeholder.png)
- [MySQL Integration Task](/tidb-cloud-lake/guides/integrate-with-mysql.md)
- [PostgreSQL Integration Task](/tidb-cloud-lake/guides/integrate-with-postgresql.md)
- [Kafka Consumer Integration Task](/tidb-cloud-lake/guides/integrate-with-kafka.md) ![BETA](/media/tidb-cloud/blank_transparent_placeholder.png)
- Connect
- [Overview](/tidb-cloud-lake/guides/connection-overview.md)
- SQL Clients
Expand Down
11 changes: 6 additions & 5 deletions tidb-cloud-lake/guides/data-integration-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,16 @@ summary: The Data Integration feature in {{{ .lake }}} provides a visual, no-cod

# Data Integration Overview

The Data Integration feature in {{{ .lake }}} provides a visual, no-code interface for importing or synchronizing data from external systems into {{{ .lake }}}. The feature centers around two key concepts: **data sources** and **integration tasks**.
The Data Integration feature in {{{ .lake }}} provides a visual, no-code interface for importing, synchronizing, or consuming data from external systems into {{{ .lake }}}. The feature centers around two key concepts: **data sources** and **integration tasks**.

## Key Concepts

| Concept | Description |
|---------|-------------|
| [Data Sources](/tidb-cloud-lake/guides/data-sources.md) | Reusable connection settings or credentials used to access external systems or send notifications, such as AWS Access Key / Secret Key, MySQL hostname / username / password, SQS (S3) queue URL, or a FeiShu bot webhook. |
| [Integration Tasks](/tidb-cloud-lake/guides/integration-tasks.md) | Executable tasks that define where data comes from, which {{{ .lake }}} table it is written to, which runtime parameters are used, and how the task is started and monitored. |
| [Data Sources](/tidb-cloud-lake/guides/data-sources.md) | Reusable connection settings or credentials used to access external systems or send notifications, such as AWS Access Key / Secret Key, MySQL hostname / username / password, SQS (S3) queue URL, Kafka broker addresses, or a FeiShu bot webhook. |
| [Integration Tasks](/tidb-cloud-lake/guides/integration-tasks.md) | Executable tasks that define where data comes from, where the task writes data or how it saves results, which runtime parameters it uses, and how you start and monitor the task. |

Data sources do not move data by themselves. They only store the information required to access external systems. Integration tasks are the units that actually perform imports, snapshots, and continuous synchronization.
Data sources do not move data by themselves. They only store the information required to access external systems. Integration tasks are the units that actually perform imports, snapshots, continuous synchronization, or message consumption.

<!-- Will add back this note after the service hosting pricing is finalized and published.

Expand All @@ -34,12 +34,13 @@ Not every data source corresponds to an ingestion task. For example, `FeiShuBot`
| [Amazon SQS (S3) (Beta)](/tidb-cloud-lake/guides/integrate-with-amazon-sqs-s3.md) | Consumes S3 object creation events from an SQS queue and writes the corresponding object data into {{{ .lake }}}. |
| [MySQL](/tidb-cloud-lake/guides/integrate-with-mysql.md) | Synchronizes table data from MySQL using `Snapshot`, `CDC Only`, or `Snapshot + CDC` modes. |
| [PostgreSQL](/tidb-cloud-lake/guides/integrate-with-postgresql.md) | Synchronizes table data from PostgreSQL using `Snapshot`, `CDC Only`, or `Snapshot + CDC` modes. |
| [Kafka Consumer Integration Task (Beta)](/tidb-cloud-lake/guides/integrate-with-kafka.md) | Continuously consumes messages from Kafka topics and saves the message content to internal object storage. |

## Recommended Flow

1. Create and test reusable connection settings on the [Data Sources](/tidb-cloud-lake/guides/data-sources.md) page.
2. Review supported task types and their use cases on the [Integration Tasks](/tidb-cloud-lake/guides/integration-tasks.md) page.
3. Read the task-specific guide to configure the source, preview the data, and set the target table.
3. Read the task-specific guide to configure the source, preview the data, and configure the result location or result viewing method.
4. Use the [Task Management](/tidb-cloud-lake/guides/task-management.md) page to start tasks, check status, and troubleshoot execution issues.

## Video Tour
Expand Down
3 changes: 2 additions & 1 deletion tidb-cloud-lake/guides/data-sources.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,9 @@ Data sources do not execute synchronization by themselves. Their role is to cent
| [MySQL - Credentials](/tidb-cloud-lake/guides/mysql-credentials.md) | Stores the host, port, username, password, and database information required to access MySQL. These settings can be reused across multiple MySQL sync tasks. |
| [PostgreSQL - Credentials](/tidb-cloud-lake/guides/postgresql-credentials.md) | Stores the host, port, username, password, and database information required to access PostgreSQL. These settings can be reused across multiple PostgreSQL sync tasks. |
| [FeiShuBot](/tidb-cloud-lake/guides/feishubot.md) | Stores a FeiShu bot webhook and message template for task failure notifications and similar scenarios. |
| [Kafka - Credentials (Beta)](/tidb-cloud-lake/guides/kafka-credentials.md) | Stores the broker addresses, authentication method, and connection credentials required to access Kafka. These settings can be reused by Kafka Consumer tasks. |

Not every data source corresponds to an integration task. For example, `FeiShuBot` is used for notification configuration, while `Amazon S3 - Credentials`, `Amazon SQS (S3) - IAM Role`, `MySQL - Credentials`, and `PostgreSQL - Credentials` are referenced by actual import, synchronization, or event-consuming tasks.
Not every data source corresponds to an integration task. For example, `FeiShuBot` is used for notification configuration, while `Amazon S3 - Credentials`, `Amazon SQS (S3) - IAM Role`, `MySQL - Credentials`, `PostgreSQL - Credentials`, and `Kafka - Credentials` are referenced by actual import, synchronization, or event-consuming tasks.

## Managing Data Sources

Expand Down
131 changes: 131 additions & 0 deletions tidb-cloud-lake/guides/integrate-with-kafka.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
---
title: Kafka Consumer Integration Task (Beta)
summary: Create a Kafka Consumer task to continuously consume messages from Kafka topics and save the message content to internal object storage (tenant Stage).
---

# Kafka Consumer Integration Task (Beta)

This page describes how to create a Kafka Consumer task that continuously consumes messages from Kafka topics and saves the message content to internal object storage (tenant Stage).

Unlike S3, MySQL, or PostgreSQL integration tasks, a Kafka Consumer task does not write directly to a regular target table. After the task is created and started, you can use the `@kafka_consumer/<task_name>/` stage path to view saved message objects and query their content with SQL.
Comment thread
lilin90 marked this conversation as resolved.

If you need to create reusable Kafka connection settings first, see [Kafka - Credentials (Beta)](/tidb-cloud-lake/guides/kafka-credentials.md).

## Use Cases
Comment thread
lilin90 marked this conversation as resolved.

- Continuously ingest JSON messages from Kafka topics
- Land Kafka messages in internal object storage first, then query or process them with downstream SQL
- Preserve raw Kafka message objects for real-time or near-real-time data pipelines

## Workflow

1. An upstream system writes messages to Kafka topics.
2. The Kafka Consumer task reads messages from the specified topics.
3. The task saves messages in batches to internal object storage (tenant Stage).
4. Users view generated objects through `@kafka_consumer/<task_name>/`.
5. Users query message content from the stage and perform downstream loading or transformation as needed.

> **Note:**
>
> Kafka Consumer tasks save object files that contain Kafka message content. If you need to write messages into a business table, run downstream `INSERT INTO ... SELECT`, `COPY INTO`, or other processing based on the stage query results.

## Prerequisites

Before creating a Kafka Consumer task, make sure:

- A **Kafka - Credentials** data source has already been created
- Platform can access the Kafka brokers over the network
- The authentication method, TLS settings, and account information in the Kafka data source are correct
- The Kafka user has permission to read the target topics
- Messages in the target topics match the **Data Format** selected in the task

## Creating a Kafka Consumer Task
Comment thread
lilin90 marked this conversation as resolved.

### Step 1: Basic Info
Comment thread
lilin90 marked this conversation as resolved.

1. Navigate to **Data** > **Data Integration** and click **Create Task**.
2. Select a Kafka data source, then configure the basic parameters:

| Field | Required | Description |
|-------|----------|-------------|
| **Data Source** | Yes | Select an existing **Kafka - Credentials** data source from the dropdown |
| **Name** | Yes | Name of the Kafka Consumer task |
| **Topics** | Yes | Kafka topics to consume. Separate multiple topics with commas, for example `topic-1,topic-2` |
| **Data Format** | Yes | Kafka message data format. Currently, this is **JSON** |
| **Start Position** | Yes | Start position when no committed offset exists. Supports **Latest** and **Earliest** |
| **Max Batch Bytes** | No | Maximum data size per batch. The default value is **16 MiB** |
| **Max Batch Wait Interval** | No | Maximum wait time per batch. The default value is **1 Minute** |

> **Note:**
>
> **Latest** consumes only new messages, while **Earliest** starts from the earliest retained messages in Kafka. This setting applies only when the Consumer Group has no committed offset and does not reset existing offsets.

### Step 2: Preview Data
Comment thread
lilin90 marked this conversation as resolved.

After completing the basic settings, click **Next** to enter **Preview Data Info**.

The system attempts to read sample messages from the specified Kafka topics. If messages are available, the page displays 1 to 2 JSON messages so you can verify the topics, data format, and message structure.

If no previewable messages are available, the page displays **No sample data available**. You can still continue creating the task, but we recommend checking whether the topics already contain messages and whether the selected **Start Position** can read sample data.

### Step 3: Result Viewing
Comment thread
lilin90 marked this conversation as resolved.

In the **Result Viewing** step, select the **Warehouse** to run the Kafka Consumer task.

After the task starts, it reads Kafka messages and saves them to internal object storage (tenant Stage). The page provides SQL examples. You can use `LIST @kafka_consumer/<task_name>/` to view generated objects and use stage queries to read message content.

```sql
-- List stage objects:
LIST @kafka_consumer/<task_name>/;

-- Query object data (replace with the correct PATTERN path):
SELECT $1
FROM @kafka_consumer (
FILE_FORMAT=>'ndjson',
PATTERN=>'<task_name>/year=YYYY/month=MM/day=DD/hour=HH/.*[.]ndjson'
);
```

Click **Create** to create the task.

## Task Behavior
Comment thread
lilin90 marked this conversation as resolved.

A Kafka Consumer task runs continuously. After it starts, it consumes messages from the specified topics and saves them in batches as object files in internal object storage until you stop it manually.

| Scenario | Behavior |
|----------|----------|
| New messages exist in the topics | Reads messages and writes them to the tenant Stage |
| The batch size reaches **Max Batch Bytes** | Writes the current batch to object storage |
| The wait time reaches **Max Batch Wait Interval** | Writes the current batch to object storage even if the batch does not reach the size limit |
| The write operation succeeds | Saves the consumption progress for later continuation |
| You stop the task manually | Stops consuming and keeps the saved message objects |

## Query Saved Messages
Comment thread
lilin90 marked this conversation as resolved.

Kafka Consumer tasks save message objects under the `@kafka_consumer/<task_name>/` path. After the task starts and writes objects, open the task details page and switch to the **Data Browsing** tab to view the object count and object list by UTC hour.

You can also use SQL to list objects first, then query their content based on the actual path:

```sql
LIST @kafka_consumer/<task_name>/;
```

```sql
SELECT $1
FROM @kafka_consumer (
FILE_FORMAT=>'ndjson',
PATTERN=>'<task_name>/year=YYYY/month=MM/day=DD/hour=HH/.*[.]ndjson'
);
```

If you need to write messages into a business table, continue with downstream transformation or loading based on the query result.

## Advanced Configuration
Comment thread
lilin90 marked this conversation as resolved.

### Runtime Size
Comment thread
lilin90 marked this conversation as resolved.

Kafka Consumer tasks support changing the runtime size. Before changing Runtime Size, stop the task, then open the edit page from the **Edit** menu, select an appropriate runtime size in the **Runtime Size** section, and save the change. After you restart the task, it runs with the new runtime size.
Comment thread
lilin90 marked this conversation as resolved.

> **Note:**
>
> The available runtime sizes and prices depend on your billing plan. Use the options shown in the console and the pricing documentation as the source of truth.
6 changes: 4 additions & 2 deletions tidb-cloud-lake/guides/integration-tasks.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@ summary: This page provides an overview of integration tasks in {{{ .lake }}}. I

# Integration Tasks

An integration task in {{{ .lake }}} defines how data flows from a source into a target table in {{{ .lake }}}. Each task references an existing data source and specifies source settings, a target warehouse, a target database / table, and runtime parameters that are specific to the task type.
An integration task in {{{ .lake }}} defines how data flows from a source into {{{ .lake }}}. Each task references an existing data source and specifies source settings, a target location or result viewing method, and runtime parameters that are specific to the task type.

Unlike data sources, integration tasks are the executable units that actually perform data movement and synchronization. Data sources store access settings, while tasks handle scheduling, ingestion, synchronization, stopping, resuming, and monitoring.
Unlike data sources, integration tasks are the executable units that actually perform data movement, synchronization, or message consumption. Data sources store access settings, while tasks handle scheduling, ingestion, synchronization, consumption, stopping, resuming, and monitoring.

## Supported Task Types

Expand All @@ -17,6 +17,7 @@ Unlike data sources, integration tasks are the executable units that actually pe
| [Amazon SQS (S3) (Beta)](/tidb-cloud-lake/guides/integrate-with-amazon-sqs-s3.md) | Consumes S3 object creation events from an SQS queue and writes the corresponding object data into {{{ .lake }}}. |
| [MySQL](/tidb-cloud-lake/guides/integrate-with-mysql.md) | Synchronizes table data from MySQL using `Snapshot`, `CDC Only`, or `Snapshot + CDC`. |
| [PostgreSQL](/tidb-cloud-lake/guides/integrate-with-postgresql.md) | Synchronizes table data from PostgreSQL using `Snapshot`, `CDC Only`, or `Snapshot + CDC`. |
| [Kafka Consumer Integration Task (Beta)](/tidb-cloud-lake/guides/integrate-with-kafka.md) | Continuously consumes messages from Kafka topics and saves the message content to internal object storage. |

## Reading Guide

Expand All @@ -30,3 +31,4 @@ Recommended reading order:
- S3 tasks are designed for file import scenarios and mainly focus on file path patterns, file formats, and ingestion behavior.
- SQS (S3) tasks are designed for S3 event-driven data ingestion and mainly focus on the SQS queue, S3 event filters, IAM Role, and target table.
- MySQL and PostgreSQL tasks are designed for table synchronization scenarios and mainly focus on sync modes, primary keys, incremental capture, and archive scheduling.
- Kafka Consumer tasks are designed for message consumption scenarios and mainly focus on topics, start position, batch size, batch wait interval, and tenant Stage queries.
Comment thread
lilin90 marked this conversation as resolved.
43 changes: 43 additions & 0 deletions tidb-cloud-lake/guides/kafka-credentials.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
---
title: Kafka - Credentials (Beta)
summary: Create a "Kafka - Credentials" data source to store Kafka connection information for reuse in Kafka Consumer integration tasks.
---

# Kafka - Credentials (Beta)

This page describes how to create a `Kafka - Credentials` data source. This data source stores the broker addresses, authentication method, and connection credentials required to access a Kafka cluster. You can reuse these settings across multiple Kafka Consumer integration tasks.

`Kafka - Credentials` only stores Kafka connection information. It does not consume messages by itself. The actual process of reading Kafka topic messages and writing them to internal object storage is performed by a [Kafka Consumer Integration Task (Beta)](/tidb-cloud-lake/guides/integrate-with-kafka.md).
Comment thread
lilin90 marked this conversation as resolved.

## Use Cases
Comment thread
lilin90 marked this conversation as resolved.

- Centrally manage Kafka broker addresses and authentication settings
- Reuse the same Kafka connection settings across multiple Kafka Consumer tasks
- Update the Kafka addresses, authentication method, or account information in one place when referenced by multiple tasks

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Avoid passive voice by rewriting the clause in active voice.

Suggested change
- Update the Kafka addresses, authentication method, or account information in one place when referenced by multiple tasks
- Update the Kafka addresses, authentication method, or account information in one place when multiple tasks reference them


## Create Kafka - Credentials
Comment thread
lilin90 marked this conversation as resolved.

1. Navigate to **Data** > **Data Sources**, then click **Create Data Source**.
2. Select **Kafka - Credentials** as the service type, then fill in the connection details:

| Field | Required | Description |
|-------|----------|-------------|
| **Name** | Yes | A descriptive name for the data source |
| **Brokers** | Yes | Kafka broker address list. Separate multiple addresses with commas, for example `broker-1:9092,broker-2:9093,broker-3:9092` |
| **Authentication** | Yes | Kafka authentication method. Supported options are **None** and **SASL/PLAIN** |
| **TLS encryption** | No | Whether to enable TLS encryption |
| **Username** | Required if applicable | Kafka username. Required when **SASL/PLAIN** is selected |
| **Password** | Required if applicable | Kafka password. Required when **SASL/PLAIN** is selected |
Comment thread
lilin90 marked this conversation as resolved.

3. Click **Test Connectivity** to validate the connection. If the test succeeds, click **OK** to save the data source.

## Configuration Recommendations
Comment thread
lilin90 marked this conversation as resolved.

- Create a dedicated Kafka user for the platform instead of sharing an application account.
- Enable **TLS encryption** if your Kafka cluster requires encrypted connections.
- If you select **SASL/PLAIN**, make sure the Kafka user is allowed to read the topics that will be consumed by downstream tasks.
Comment thread
lilin90 marked this conversation as resolved.
- Run **Test Connectivity** before saving the data source to verify the broker addresses, network access, and authentication settings.

## Next Steps
Comment thread
lilin90 marked this conversation as resolved.

After creating the data source, you can use it to create a [Kafka Consumer Integration Task (Beta)](/tidb-cloud-lake/guides/integrate-with-kafka.md).
Loading
Loading