-
Notifications
You must be signed in to change notification settings - Fork 3.9k
GH-47918: [Format] Clarify that empty compressed buffers can omit the length header #48541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…it the length header
|
|
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @pitrou -- I think this seems like a reasonable change to the format (if it is one) and since the C/C++ implementation has been doing it I think it could/should become the defacto standard
| /// uncompressed length may be set to -1 to indicate that the data that | ||
| /// follows is not compressed, which can be useful for cases where | ||
| /// compression does not yield appreciable savings. | ||
| /// Also, empty buffers can optionally omit the 8-byte length header. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if there is any data file in https://github.com/apache/arrow-testing that has this property (empty body that omits the 8-byte length)?
I would like to make sure we test that the rust reader can read it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also interested in such a file. It might be nice in this comment to take a stance on whether an implementation should do this for new implementors that don't have an opinion. It seems like the rationale is that you save 8 bytes per column in most cases (since this would probably get used for every null buffer in compressed output)...does anybody need that level of optimization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know about the rationale, I agree that we probably don't care about such an optimization.
|
To answer the questions:
So, such IPC streams have already been emitted by the C++ IPC writer, and quite commonly so it seems (a non-null batch column is sufficient, though a zero-sized batch would also trigger it). |
|
We could add compressed files with empty batches in the |
|
Speaking of which, I don't think our integration tests actually exercise compression expect with these predefined files. |
Rationale for this change
The 8-byte length header is considered optional for empty compressed buffers in Arrow C++ and other implementations.
Are these changes tested?
Not applicable.
Are there any user-facing changes?
No.