GH-47918: [Format] Clarify that empty compressed buffers can omit the length header #48541

pitrou · 2025-12-15T15:45:01Z

Rationale for this change

The 8-byte length header is considered optional for empty compressed buffers in Arrow C++ and other implementations.

Are these changes tested?

Not applicable.

Are there any user-facing changes?

No.

GitHub Issue: [Format] Clarify that a compressed IPC buffer can be empty for empty data #47918

…it the length header

github-actions · 2025-12-15T15:45:26Z

⚠️ GitHub issue #47918 has been automatically assigned in GitHub to PR creator.

alamb

Thanks @pitrou -- I think this seems like a reasonable change to the format (if it is one) and since the C/C++ implementation has been doing it I think it could/should become the defacto standard

alamb · 2025-12-16T19:35:47Z

format/Message.fbs

  /// uncompressed length may be set to -1 to indicate that the data that
  /// follows is not compressed, which can be useful for cases where
  /// compression does not yield appreciable savings.
+  /// Also, empty buffers can optionally omit the 8-byte length header.


Do you know if there is any data file in https://github.com/apache/arrow-testing that has this property (empty body that omits the 8-byte length)?

I would like to make sure we test that the rust reader can read it

I'm also interested in such a file. It might be nice in this comment to take a stance on whether an implementation should do this for new implementors that don't have an opinion. It seems like the rationale is that you save 8 bytes per column in most cases (since this would probably get used for every null buffer in compressed output)...does anybody need that level of optimization?

I don't know about the rationale, I agree that we probably don't care about such an optimization.

pitrou · 2025-12-17T09:49:42Z

To answer the questions:

I don't think the 0-length exception was actually discussed. It was briefly mentioned in @wesm 's original proposal ("Compress each non-zero-length constituent Buffer ..."), but that didn't trigger any particular reaction.
I've just checked: the compressed streams in https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream/integration/2.0.0-compression (in particular generated_lz4.stream and generated_zstd.stream) have a 0-length buffer at index 0 of their record batches (it's the null bitmap of the first column, which doesn't have any nulls)

So, such IPC streams have already been emitted by the C++ IPC writer, and quite commonly so it seems (a non-null batch column is sufficient, though a zero-sized batch would also trigger it).

pitrou · 2025-12-17T09:51:10Z

We could add compressed files with empty batches in the arrow-testing repo if the non-null column case is not sufficient to exercise all readers, by the way.

pitrou · 2025-12-17T09:53:33Z

Speaking of which, I don't think our integration tests actually exercise compression expect with these predefined files.

apacheGH-47918: [Format] Clarify that empty compressed buffers can om…

b9e3ccc

…it the length header

pitrou marked this pull request as ready for review December 15, 2025 15:45

pitrou requested a review from raulcd December 15, 2025 15:45

pitrou assigned alamb Dec 15, 2025

github-actions bot added the awaiting review Awaiting review label Dec 15, 2025

pitrou unassigned alamb Dec 15, 2025

pitrou requested review from alamb, paleolimbot and tustvold December 15, 2025 17:05

alamb approved these changes Dec 16, 2025

View reviewed changes

github-actions bot added awaiting merge Awaiting merge awaiting changes Awaiting changes and removed awaiting review Awaiting review awaiting merge Awaiting merge labels Dec 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-47918: [Format] Clarify that empty compressed buffers can omit the length header #48541

GH-47918: [Format] Clarify that empty compressed buffers can omit the length header #48541

pitrou commented Dec 15, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Dec 15, 2025

Uh oh!

alamb left a comment

Uh oh!

alamb Dec 16, 2025

Uh oh!

paleolimbot Dec 17, 2025

Uh oh!

pitrou Dec 17, 2025

Uh oh!

pitrou commented Dec 17, 2025 •

edited

Loading

Uh oh!

pitrou commented Dec 17, 2025

Uh oh!

pitrou commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

GH-47918: [Format] Clarify that empty compressed buffers can omit the length header #48541

Are you sure you want to change the base?

GH-47918: [Format] Clarify that empty compressed buffers can omit the length header #48541

Conversation

pitrou commented Dec 15, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Dec 15, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

paleolimbot Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

pitrou Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

pitrou commented Dec 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pitrou commented Dec 17, 2025

Uh oh!

pitrou commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pitrou commented Dec 15, 2025 •

edited by github-actions bot

Loading

pitrou commented Dec 17, 2025 •

edited

Loading