-
Notifications
You must be signed in to change notification settings - Fork 1.1k
feat: support array indices in VariantPath dot notation #9012
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: support array indices in VariantPath dot notation #9012
Conversation
mhilton
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| } else { | ||
| VariantPath::new(path.split('.').map(Into::into).collect()) | ||
| } | ||
| VariantPath::new(path.split(".").flat_map(parse_path).collect()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is beautiful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the flat_map mean this will parse paths like foo..bar as [foo] [bar]?
I am not sure exactly what should happen when parsing a path like that 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that you mention, it seems like the existing code already had that bug?
In theory, that should be a malformed path that causes an error. Ditto for mismatched [ or ], or e.g. [a.b] (which this PR would incorrectly attempt to treat as [a and b]).
It seems like we'll need a more sophisticated (and integrated) parsing approach here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit of quality time with an LLM produced the following (along with a bunch of unit tests to validate it works):
Details
/// Create from &str with support for dot notation and array indices.
///
/// # Example
/// ```
/// # use parquet_variant::VariantPath;
/// let path: VariantPath = "foo.bar[0]".try_into().unwrap();
/// ```
impl<'a> TryFrom<&'a str> for VariantPath<'a> {
type Error = ArrowError;
fn try_from(path: &'a str) -> Result<Self, Self::Error> {
parse_path(path).map(VariantPath::new)
}
}
/// Parse a path string into a vector of [`VariantPathElement`].
///
/// Supports the following syntax:
/// - `""` - empty path
/// - `"foo"` - single field
/// - `"foo.bar"` - nested fields
/// - `"[1]"` - single array index
/// - `"[1][2]"` - multiple array indices
/// - `"foo[1]"` - field with array index
/// - `"foo[1][2]"` - field with multiple array indices
/// - `"foo[1].bar"` - mixed field and array access
/// - etc.
///
/// Field names can contain any characters except `.`, `[`, and `]`.
fn parse_path(s: &str) -> Result<Vec<VariantPathElement<'_>>, ArrowError> {
let scan_field = |start: usize| {
s[start..].find(['.', '[', ']']).map_or_else(|| s.len(), |p| start + p)
};
let bytes = s.as_bytes();
if let Some(b'.') = bytes.first() {
return Err(ArrowError::ParseError("unexpected leading '.'".into()));
}
let mut elements = Vec::new();
let mut i = 0;
while i < bytes.len() {
let (elem, end) = match bytes[i] {
b'.' => {
i += 1; // skip the dot; a field must follow
let end = scan_field(i);
if end == i {
return Err(ArrowError::ParseError(match bytes.get(i) {
None => "unexpected trailing '.'".into(),
Some(&c) => format!("unexpected '{}' at byte {i}", c as char),
}));
}
(VariantPathElement::field(&s[i..end]), end)
}
b'[' => {
let (idx, end) = parse_index(s, i)?;
(VariantPathElement::index(idx), end)
}
b']' => {
return Err(ArrowError::ParseError(format!(
"unexpected ']' at byte {i}"
)));
}
_ => {
let end = scan_field(i);
(VariantPathElement::field(&s[i..end]), end)
}
};
elements.push(elem);
i = end;
}
Ok(elements)
}
/// Parse `[digits]` starting at `i` (which points to `[`).
/// Returns (index_value, position after `]`).
fn parse_index(s: &str, i: usize) -> Result<(usize, usize), ArrowError> {
let start = i + 1; // skip '['
// Find closing ']'
let end = match s[start..].find(']') {
Some(p) => start + p,
None => return Err(ArrowError::ParseError(format!("unterminated '[' at byte {i}"))),
};
let idx = s[start..end]
.parse()
.map_err(|_| ArrowError::ParseError(format!("invalid index at byte {start}")))?;
Ok((idx, end + 1))
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially considered using TryFrom with explicit error boundaries , but doing so would require changing an existing public API, which would be a breaking change
arrow-rs/parquet-variant-compute/src/shred_variant.rs
Lines 709 to 717 in 240cbf4
| pub fn with_path<'a, P, F>(mut self, path: P, field: F) -> Self | |
| where | |
| P: Into<VariantPath<'a>>, | |
| F: IntoShreddingField, | |
| { | |
| let path: VariantPath<'a> = path.into(); | |
| self.root.insert_path(&path, field.into_shredding_field()); | |
| self | |
| } |
and as @scovich mentioned the current approach can also parse invalid inputs as fields, making debugging hard. So I'm a bit skeptical about how to proceed from here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, variant is still experimental, so breaking changes are ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meanwhile, even the parser I cooked up above is technically incomplete, because JSON allows field names to contain any character (including ., [ and ] which my suggestion would reject as ambiguous). The only way to avoid that is with some kind of escaping syntax. I believe Iceberg uses canonical JSONpath, for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an open issue regarding the escaping syntax #8954
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. It looks like the mini-parser I posted above is a strict subset of the databricks syntax linked above. Would it be a reasonable starting point to merge this PR and add the missing ['fieldName'] syntax as a follow-up? Or just expand the above suggestion into a full parser right away?
Which issue does this PR close?
VariantPath#8946What changes are included in this PR?
The PR adds support for parsing array index (eg.
foo.bar[3]) with the help of parse_path fn. Currently the parser silently parses invalid segments as Field (eg.,foo[0,[0](parsed as index),foo0],foo[0][)Feedback requested
Whether to add stricter validation (throw an error) and reject the segment ? Or to keep the current behavior ?
Are these changes tested?
yes, only for valid inputs
Are there any user-facing changes?
no