-
Notifications
You must be signed in to change notification settings - Fork 1.8k
fix: emit empty RecordBatch for empty file writes #19370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
kosiew
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this.
| } | ||
|
|
||
| // if there is no batch send but with a single file, send an empty batch | ||
| if single_file_output && !is_batch_received { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DataFrameWriteOptions::with_single_file_output() method should also be updated about empty DataFrame behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Updated the doc comment for with_single_file_output()
kosiew
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
with some minor comments.
| df.write_csv(&path, crate::dataframe::DataFrameWriteOptions::new(), None) | ||
| .await?; | ||
| // Expected the file to exist | ||
| assert!(std::path::Path::new(&path).exists()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why there is no assertion of 0 lines here like you did for arrow, parquet files?
| df.write_json(&path, crate::dataframe::DataFrameWriteOptions::new(), None) | ||
| .await?; | ||
| // Expected the file to exist | ||
| assert!(std::path::Path::new(&path).exists()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why there is no assertion of 0 lines here like you did for arrow, parquet files?
Which issue does this PR close?
Rationale for this change
If the input stream yields no RecordBatch at all, nothing gets sent downstream, and the writer never has a chance to produce a valid file. I added a small fallback: when single_file_output is enabled and no batches were received, we send a single empty RecordBatch with the input schema.
Are these changes tested?
Yes.
Are there any user-facing changes?