Skip to content

Add chdb in-process backend via interface="chdb"#753

Open
wudidapaopao wants to merge 8 commits into
ClickHouse:mainfrom
wudidapaopao:feat/chdb-backend
Open

Add chdb in-process backend via interface="chdb"#753
wudidapaopao wants to merge 8 commits into
ClickHouse:mainfrom
wudidapaopao:feat/chdb-backend

Conversation

@wudidapaopao
Copy link
Copy Markdown

@wudidapaopao wudidapaopao commented May 21, 2026

Summary

Adds an in-process backend that uses the embedded chdb engine instead of HTTP. Selected via clickhouse_connect.get_client(interface="chdb"). No ClickHouse server required.

The same NativeTransform byte parser the HTTP client uses is reused verbatim, so all existing type / dtype / streaming / DB-API / SQLAlchemy code paths work unchanged.

Usage examples

In-memory (default):

client = clickhouse_connect.get_client(interface="chdb")

Persistent file path:

client = clickhouse_connect.get_client(interface="chdb", chdb_path="/var/data/mydb")

Engine startup options as a dict:

client = clickhouse_connect.get_client(
    interface="chdb",
    chdb_path="/var/data/mydb",
    chdb_options={"mode": "ro", "max_threads": 4},
)

Or inline in the path itself:

client = clickhouse_connect.get_client(
    interface="chdb",
    chdb_path="/var/data/mydb?mode=ro&max_threads=4",
)

ClickHouse server settings applied for the lifetime of the client (issued via SET k=v at construction):

client = clickhouse_connect.get_client(
    interface="chdb",
    chdb_path=":memory:",
    database="analytics",
    settings={"max_block_size": 65536, "date_time_output_format": "iso"},
)

Async usage is symmetric:

async with await clickhouse_connect.get_async_client(interface="chdb", chdb_path="/var/data/mydb") as c:
    r = await c.query("SELECT count() FROM events")

Checklist

Delete items not relevant to your PR:

  • Unit tests covering the common scenarios were added
  • A human-readable description of the changes was provided to include in CHANGELOG

@joe-clickhouse
Copy link
Copy Markdown
Contributor

@wudidapaopao thanks! This is something I've been wanting to do for a while.

Before we merge, I want to do some more research on both sides, chdb-core and clickhouse-connect, to figure out the right architectural fit here. In its current form this adds another full client surface for us to maintain, including both sync and async paths, which is not ideal long term. Every new backend risks duplicating public methods, settings handling, streaming behavior, inserts, error handling, and tests.

In an ideal world, chdb-core could expose a loopback-only ephemeral HTTP endpoint and our existing HttpClient could consume it like any other ClickHouse server. I had an agent take a quick look, and it tells me the ClickHouse HTTP server/handler code exists in the chdb-core tree and looks structurally reusable, but it is not currently wired into EmbeddedServer, and chDB's embedded path is intentionally no-networking today. If upstream considers that viable, it would keep the clickhouse-connect side much simpler.

If that is not viable, I think we should consider this as a backend refactor on our side rather than adding a separate client subclass. Currently, I'm thinking one public client API with pluggable execution backends, so chDB support is implemented behind the existing client instead of as a parallel client family. (Separate concern, but this would also allow room for future TCP native support as well.)

Either way, I'd like to spend some time on this before merging. Thanks again for putting this together and for the very thorough test coverage. I'll post back here as I get through the research.

@auxten
Copy link
Copy Markdown
Member

auxten commented May 22, 2026

Thanks @joe-clickhouse for the thoughtful response, and for taking the time to think through the architectural fit on both sides — really appreciate it.

I strongly agree with the overall direction. If I may add a couple of thoughts from the chDB side:

One thing we've consistently tried hard to preserve in chDB is minimizing serialization/deserialization overhead — it's arguably one of the main reasons users reach for an embedded engine in the first place. A loopback-only HTTP endpoint inside chdb-core would definitely make the clickhouse-connect side much simpler, and that's genuinely appealing. But it would also reintroduce serialize/deserialize round-trips that aren't strictly necessary in the local/embedded mode, which can noticeably hurt performance and increase memory usage — exactly the kind of cost chDB users tend to come to chDB to avoid.

Relatedly, chDB already supports zero-copy read/write for pandas DataFrames. Keeping the in-process path (i.e. not going through a server boundary) preserves that property end-to-end, and I think it also opens up nicer downstream integrations — both deeper interop with clickhouse-connect itself, and a much smoother experience for users who live in the pandas ecosystem.

So just my two cents (please take it as just a suggestion): if we can let users switch the execution engine by changing a single place — without touching any of their existing code — I think that would offer the best developer experience. Your pluggable-backend idea actually sounds very aligned with this: existing clickhouse-connect code stays untouched while the zero-copy in-process path is available under the hood.

Happy to dig deeper on either direction with you — whatever helps the research move forward. Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants