Support moved compressed rows in SAS data files#365
Open
hpoettker wants to merge 1 commit intoWizardMac:devfrom
Open
Support moved compressed rows in SAS data files#365hpoettker wants to merge 1 commit intoWizardMac:devfrom
hpoettker wants to merge 1 commit intoWizardMac:devfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Introduction
This PR adds support for "moved" compressed rows in SAS data files.
I'm not aware of a public description of this feature but I've investigated the hex dumps of SAS files that ReadStat currently cannot read, reverse-engineered the logic, and then validated the read data against exports of the data files as produced by SAS.
I'm not aware what the exact conditions are that trigger the "moving" of rows but these conditions seem to be necessary:
The technical term "moved row" is something that I've made up on the basis of what I've observed. The naming is up to discussion but I'll drop the quotation marks in the text below.
Compression Types
The most widely known compression types that can be read in subheader pointers of SAS data files are
0x00, indicating no compression of the linked content0x01, indicating that the linked content can be skipped0x04, indicating that the linked content contains a compressed rowWith the feature of moved rows, there are three additional compression types:
0x03, indicating the logical position of a row that is actually on a different page of the data file0x06, indicating the physical position of a row that is referred to by a0x03compression type subheader pointer0x0d, indicating subheader pointers that can be skipped similarly to0x01for currently unknown reasonsA speculative interpretation of this list of compression types is that the compression type in subheader pointers is actually a bitmap with the following meanings of the bits:
0x01- any data that the subheader pointer may directly refer to shall be ignored0x02- the subheader pointer is related to rows that have been moved0x04- the subheader pointer directly refers to a compressed row0x08- unknown (but I've only observed it as part of the type0x0d, which also matches the bit0x01and can thus be ignored)Compression type
0x03The typical subheader pointer in a SAS data file contains the following pieces of information:
With compression type
0x03, the byte positions and lengths within the subheader pointer are the same but the meaning of the values is different:0x03The order of rows in a SAS data file is normally defined by the order in which a pass of the file encounters them. But when a subheader pointer with compression type
0x03is encountered, this only defines the logical position of the row in the order of encounter while the actual data is on a different (and as far as I can tell later to be encountered) page.For subheader pointers of compression type
0x03, the previous and the next pointer will usually refer to neighboring areas within the same page.Compression type
0x06The reference from a subheader pointer with compression type
0x03is always to a subheader pointer on another page that has the compression type0x06.A subheader pointer with compression type
0x06does not represent any logical position of a row in the usual order of encounter. But the compressed row that is at the phyical position that the pointer refers to can be read exactly like rows of compression type0x04.Compression type
0x0dI don't have a good explanation for this compression type. But just skipping subheaders with this type as one does with compression type
0x01leads to the correct result when comparing the exports of ReadStat with those of SAS itself.Implementation alternative
The implementation proposed in this PR respects the difference between the logical and the physical order of rows in a SAS data file, and replicates the order in which SAS itself presents the rows of a data file.
An alternative implementation that would be more efficient but would loose the faithfulness to the logical order of rows would be to
0x06the same as0x040x03and0x0djust like0x01For many use cases, this would be good enough.
Validation
The change proposed in this PR works as expected on the SAS data files that I have access to. But I can share neither the data files themselves nor the SAS code that produces them.
As stated in the introduction, I'm not aware of what the precise trigger for SAS is to move rows to another page. So I'm also unable to provide generic SAS code that would create a toy example of a SAS data file with moved rows.
I've implemented the proposed change with the goal to introduce minimal risk. It should not affect affect any SAS data file that ReadStat currently reads successfully and contains validations against all constraints that I've observed.
I'm opening this PR in the hope of either someone of the community stepping forward with supporting information on moved rows in SAS data files or a leap of faith on part of the maintainer.
If there is any kind of follow-up question on this proposed change, I'd be happy to engage.