Skip to content

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Dec 15, 2025

Rationale for this change

The 8-byte length header is considered optional for empty compressed buffers in Arrow C++ and other implementations.

Are these changes tested?

Not applicable.

Are there any user-facing changes?

No.

@pitrou pitrou marked this pull request as ready for review December 15, 2025 15:45
@pitrou pitrou requested a review from raulcd December 15, 2025 15:45
@github-actions
Copy link

⚠️ GitHub issue #47918 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions bot added the awaiting review Awaiting review label Dec 15, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @pitrou -- I think this seems like a reasonable change to the format (if it is one) and since the C/C++ implementation has been doing it I think it could/should become the defacto standard

/// uncompressed length may be set to -1 to indicate that the data that
/// follows is not compressed, which can be useful for cases where
/// compression does not yield appreciable savings.
/// Also, empty buffers can optionally omit the 8-byte length header.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if there is any data file in https://github.com/apache/arrow-testing that has this property (empty body that omits the 8-byte length)?

I would like to make sure we test that the rust reader can read it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also interested in such a file. It might be nice in this comment to take a stance on whether an implementation should do this for new implementors that don't have an opinion. It seems like the rationale is that you save 8 bytes per column in most cases (since this would probably get used for every null buffer in compressed output)...does anybody need that level of optimization?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know about the rationale, I agree that we probably don't care about such an optimization.

@github-actions github-actions bot added awaiting merge Awaiting merge awaiting changes Awaiting changes and removed awaiting review Awaiting review awaiting merge Awaiting merge labels Dec 16, 2025
@pitrou
Copy link
Member Author

pitrou commented Dec 17, 2025

To answer the questions:

  1. I don't think the 0-length exception was actually discussed. It was briefly mentioned in @wesm 's original proposal ("Compress each non-zero-length constituent Buffer ..."), but that didn't trigger any particular reaction.
  2. I've just checked: the compressed streams in https://github.com/apache/arrow-testing/tree/master/data/arrow-ipc-stream/integration/2.0.0-compression (in particular generated_lz4.stream and generated_zstd.stream) have a 0-length buffer at index 0 of their record batches (it's the null bitmap of the first column, which doesn't have any nulls)

So, such IPC streams have already been emitted by the C++ IPC writer, and quite commonly so it seems (a non-null batch column is sufficient, though a zero-sized batch would also trigger it).

@pitrou
Copy link
Member Author

pitrou commented Dec 17, 2025

We could add compressed files with empty batches in the arrow-testing repo if the non-null column case is not sufficient to exercise all readers, by the way.

@pitrou
Copy link
Member Author

pitrou commented Dec 17, 2025

Speaking of which, I don't think our integration tests actually exercise compression expect with these predefined files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting changes Awaiting changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants