feat(converters): add parquet to csv conversion support#553
feat(converters): add parquet to csv conversion support#553pedro3pv wants to merge 1 commit intoC4illin:mainfrom
Conversation
This change introduces support for converting Parquet files to CSV using hyparquet for better version 2 support and memory-efficient streaming. It includes: - Registration of the parquet converter in main.ts - Implementation of row-group based processing in parquet.ts - Comprehensive tests for the new converter
There was a problem hiding this comment.
Pull request overview
Adds a new Parquet → CSV converter to the converters subsystem, wiring it into the central converter registry and introducing dependencies needed for Parquet parsing and CSV output.
Changes:
- Register the new
parquetconverter insrc/converters/main.ts. - Implement Parquet row-group reading and CSV streaming in
src/converters/parquet.tsusinghyparquet+csv-stringify. - Add a basic Bun test for the parquet converter and update dependencies/lockfile.
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/converters/parquet.test.ts | Adds initial tests for the new converter (currently minimal and has an async assertion issue). |
| src/converters/parquet.ts | New Parquet→CSV converter implementation using hyparquet metadata + row-group reads and CSV stringification. |
| src/converters/main.ts | Registers the parquet converter so it can be selected/auto-matched by the main conversion flow. |
| package.json | Adds csv-stringify, hyparquet, and (currently unused) duckdb. |
| bun.lock | Lockfile updates for the new dependencies. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| writeStream.on("finish", async () => { | ||
| await fileHandle.close(); | ||
| resolve("Done"); | ||
| }); | ||
| writeStream.on("error", async (err) => { | ||
| await fileHandle.close(); | ||
| reject(err); | ||
| }); | ||
| stringifier.on("error", async (err) => { | ||
| await fileHandle.close(); | ||
| reject(err); | ||
| }); |
There was a problem hiding this comment.
The writeStream/stringifier event handlers are declared async and await fileHandle.close(). If close() rejects, the handler throws before calling resolve/reject, which can leave the Promise returned by convert() pending and trigger an unhandled rejection. Prefer non-async listeners that always settle the outer Promise (e.g., fileHandle.close().catch(() => {}).finally(() => resolve(...))), and ensure close() is only attempted once.
| onComplete: (rows) => { | ||
| if (Array.isArray(rows)) { | ||
| for (const row of rows) { | ||
| stringifier.write(row); | ||
| } | ||
| } |
There was a problem hiding this comment.
stringifier.write(row) ignores backpressure (it can return false). For large Parquet files this can buffer unbounded data in memory and defeat the intended streaming behavior. Handle the false case by awaiting a drain event (or use a pipeline/async iterator approach) before continuing to write more rows.
| "csv-stringify": "^6.7.0", | ||
| "duckdb": "^1.4.4", | ||
| "elysia": "1.4.22", | ||
| "hyparquet": "^1.25.6", |
There was a problem hiding this comment.
duckdb is added as a dependency but is not referenced anywhere in the repository (no imports/usages found). Since it’s a large native dependency, please remove it unless it’s required for this feature, or add the code path that uses it.
| test("convert rejects when metadata fails", async () => { | ||
| mockParquetMetadataAsync.mockRejectedValueOnce(new Error("Metadata error")); | ||
|
|
||
| expect( | ||
| convert("invalid.parquet", "parquet", "csv", "output.csv") | ||
| ).rejects.toThrow("Metadata error"); | ||
| }); |
There was a problem hiding this comment.
This rejection assertion isn’t awaited/returned. In Bun’s Jest-like API, the test can finish before the promise settles, leading to false positives. Use await expect(convert(...)).rejects.toThrow(...) (or return expect(...)).
| describe("parquet converter", () => { | ||
| test("convert resolves when process succeeds", async () => { | ||
| const result = await convert("input.parquet", "parquet", "csv", "output.csv"); | ||
| expect(result).toBe("Done"); | ||
| }); |
There was a problem hiding this comment.
The PR description mentions “comprehensive tests”, but the current tests only assert a generic success and a metadata failure. They don’t validate that row-groups are iterated correctly, that rows are actually written to CSV, or that streams are closed on error. Consider asserting written CSV content and adding a test for a parquetRead failure path.
There was a problem hiding this comment.
4 issues found across 5 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="src/converters/parquet.ts">
<violation number="1" location="src/converters/parquet.ts:46">
P1: The `finish` and `error` event handlers are `async` and `await fileHandle.close()`. If `close()` rejects, the thrown error prevents `resolve`/`reject` from being called, leaving the outer Promise permanently pending and producing an unhandled rejection. Use non-async handlers that always settle the outer promise, e.g. `fileHandle.close().catch(() => {}).finally(() => resolve(...))`.</violation>
<violation number="2" location="src/converters/parquet.ts:74">
P1: Row writes ignore stream backpressure, which can cause excessive buffering/memory growth on large parquet files.</violation>
</file>
<file name="package.json">
<violation number="1" location="package.json:24">
P2: `duckdb` is added as a dependency but is not imported or used anywhere in the codebase. Since it's a large native dependency with a native build step (`node-gyp`), it should be removed to avoid unnecessary install time, binary size, and potential build failures.</violation>
</file>
<file name="tests/converters/parquet.test.ts">
<violation number="1" location="tests/converters/parquet.test.ts:67">
P2: Rejection assertion is not awaited/returned, so the async failure-path test may pass without validating the expected rejection.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
| return new Promise((resolve, reject) => { | ||
| stringifier.pipe(writeStream); | ||
|
|
||
| writeStream.on("finish", async () => { |
There was a problem hiding this comment.
P1: The finish and error event handlers are async and await fileHandle.close(). If close() rejects, the thrown error prevents resolve/reject from being called, leaving the outer Promise permanently pending and producing an unhandled rejection. Use non-async handlers that always settle the outer promise, e.g. fileHandle.close().catch(() => {}).finally(() => resolve(...)).
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/converters/parquet.ts, line 46:
<comment>The `finish` and `error` event handlers are `async` and `await fileHandle.close()`. If `close()` rejects, the thrown error prevents `resolve`/`reject` from being called, leaving the outer Promise permanently pending and producing an unhandled rejection. Use non-async handlers that always settle the outer promise, e.g. `fileHandle.close().catch(() => {}).finally(() => resolve(...))`.</comment>
<file context>
@@ -0,0 +1,93 @@
+ return new Promise((resolve, reject) => {
+ stringifier.pipe(writeStream);
+
+ writeStream.on("finish", async () => {
+ await fileHandle.close();
+ resolve("Done");
</file context>
| onComplete: (rows) => { | ||
| if (Array.isArray(rows)) { | ||
| for (const row of rows) { | ||
| stringifier.write(row); |
There was a problem hiding this comment.
P1: Row writes ignore stream backpressure, which can cause excessive buffering/memory growth on large parquet files.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/converters/parquet.ts, line 74:
<comment>Row writes ignore stream backpressure, which can cause excessive buffering/memory growth on large parquet files.</comment>
<file context>
@@ -0,0 +1,93 @@
+ onComplete: (rows) => {
+ if (Array.isArray(rows)) {
+ for (const row of rows) {
+ stringifier.write(row);
+ }
+ }
</file context>
| "@elysiajs/static": "^1.4.7", | ||
| "@kitajs/html": "^4.2.13", | ||
| "csv-stringify": "^6.7.0", | ||
| "duckdb": "^1.4.4", |
There was a problem hiding this comment.
P2: duckdb is added as a dependency but is not imported or used anywhere in the codebase. Since it's a large native dependency with a native build step (node-gyp), it should be removed to avoid unnecessary install time, binary size, and potential build failures.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At package.json, line 24:
<comment>`duckdb` is added as a dependency but is not imported or used anywhere in the codebase. Since it's a large native dependency with a native build step (`node-gyp`), it should be removed to avoid unnecessary install time, binary size, and potential build failures.</comment>
<file context>
@@ -20,7 +20,10 @@
"@elysiajs/static": "^1.4.7",
"@kitajs/html": "^4.2.13",
+ "csv-stringify": "^6.7.0",
+ "duckdb": "^1.4.4",
"elysia": "1.4.22",
+ "hyparquet": "^1.25.6",
</file context>
| ).rejects.toThrow("Metadata error"); | ||
| }); | ||
|
|
There was a problem hiding this comment.
P2: Rejection assertion is not awaited/returned, so the async failure-path test may pass without validating the expected rejection.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At tests/converters/parquet.test.ts, line 67:
<comment>Rejection assertion is not awaited/returned, so the async failure-path test may pass without validating the expected rejection.</comment>
<file context>
@@ -0,0 +1,70 @@
+
+ expect(
+ convert("invalid.parquet", "parquet", "csv", "output.csv")
+ ).rejects.toThrow("Metadata error");
+ });
+
</file context>
| ).rejects.toThrow("Metadata error"); | |
| }); | |
| await expect( | |
| convert("invalid.parquet", "parquet", "csv", "output.csv") | |
| ).rejects.toThrow("Metadata error"); |
This change introduces support for converting Parquet files to CSV using hyparquet for better version 2 support and memory-efficient streaming. It includes:
Summary by cubic
Adds Parquet→CSV conversion using
hyparquetwith row‑group streaming for low memory usage and Parquet v2 support. Registers the converter and includes tests.New Features
parquetconverter streams row groups to CSV viacsv-stringifyand writes a header.src/converters/main.tswith tests covering success and metadata error paths.Dependencies
hyparquet,csv-stringify, andduckdb.Written for commit 53b936c. Summary will update on new commits.