Skip to content

Conversation

@foskey51
Copy link

@foskey51 foskey51 commented Dec 17, 2025

Which issue does this PR close?

What changes are included in this PR?

The PR adds support for parsing array index (eg. foo.bar[3]) with the help of parse_path fn. Currently the parser silently parses invalid segments as Field (eg., foo[0, [0](parsed as index), foo0], foo[0][)

Feedback requested

Whether to add stricter validation (throw an error) and reject the segment ? Or to keep the current behavior ?

Are these changes tested?

yes, only for valid inputs

Are there any user-facing changes?

no

@github-actions github-actions bot added the parquet-variant parquet-variant* crates label Dec 17, 2025
Copy link
Contributor

@mhilton mhilton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like an improvement to me. Thank you @foskey51 and @mhilton

My only question is about empty path segments

cc @scovich

} else {
VariantPath::new(path.split('.').map(Into::into).collect())
}
VariantPath::new(path.split(".").flat_map(parse_path).collect())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is beautiful

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the flat_map mean this will parse paths like foo..bar as [foo] [bar]?

I am not sure exactly what should happen when parsing a path like that 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that you mention, it seems like the existing code already had that bug?

In theory, that should be a malformed path that causes an error. Ditto for mismatched [ or ], or e.g. [a.b] (which this PR would incorrectly attempt to treat as [a and b]).

It seems like we'll need a more sophisticated (and integrated) parsing approach here?

Copy link
Contributor

@scovich scovich Dec 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of quality time with an LLM produced the following (along with a bunch of unit tests to validate it works):

Details
/// Create from &str with support for dot notation and array indices.
///
/// # Example
/// ```
/// # use parquet_variant::VariantPath;
/// let path: VariantPath = "foo.bar[0]".try_into().unwrap();
/// ```
impl<'a> TryFrom<&'a str> for VariantPath<'a> {
    type Error = ArrowError;

    fn try_from(path: &'a str) -> Result<Self, Self::Error> {
        parse_path(path).map(VariantPath::new)
    }
}

/// Parse a path string into a vector of [`VariantPathElement`].
///
/// Supports the following syntax:
/// - `""` - empty path
/// - `"foo"` - single field
/// - `"foo.bar"` - nested fields
/// - `"[1]"` - single array index
/// - `"[1][2]"` - multiple array indices
/// - `"foo[1]"` - field with array index
/// - `"foo[1][2]"` - field with multiple array indices
/// - `"foo[1].bar"` - mixed field and array access
/// - etc.
///
/// Field names can contain any characters except `.`, `[`, and `]`.
fn parse_path(s: &str) -> Result<Vec<VariantPathElement<'_>>, ArrowError> {
    let scan_field = |start: usize| {
        s[start..].find(['.', '[', ']']).map_or_else(|| s.len(), |p| start + p)
    };

    let bytes = s.as_bytes();
    if let Some(b'.') = bytes.first() {
        return Err(ArrowError::ParseError("unexpected leading '.'".into()));
    }

    let mut elements = Vec::new();
    let mut i = 0;

    while i < bytes.len() {
        let (elem, end) = match bytes[i] {
            b'.' => {
                i += 1; // skip the dot; a field must follow
                let end = scan_field(i);
                if end == i {
                    return Err(ArrowError::ParseError(match bytes.get(i) {
                        None => "unexpected trailing '.'".into(),
                        Some(&c) => format!("unexpected '{}' at byte {i}", c as char),
                    }));
                }
                (VariantPathElement::field(&s[i..end]), end)
            }
            b'[' => {
                let (idx, end) = parse_index(s, i)?;
                (VariantPathElement::index(idx), end)
            }
            b']' => {
                return Err(ArrowError::ParseError(format!(
                    "unexpected ']' at byte {i}"
                )));
            }
            _ => {
                let end = scan_field(i);
                (VariantPathElement::field(&s[i..end]), end)
            }
        };
        elements.push(elem);
        i = end;
    }

    Ok(elements)
}


/// Parse `[digits]` starting at `i` (which points to `[`).
/// Returns (index_value, position after `]`).
fn parse_index(s: &str, i: usize) -> Result<(usize, usize), ArrowError> {
    let start = i + 1; // skip '['

    // Find closing ']'
    let end = match s[start..].find(']') {
        Some(p) => start + p,
        None => return Err(ArrowError::ParseError(format!("unterminated '[' at byte {i}"))),
    };

    let idx = s[start..end]
        .parse()
        .map_err(|_| ArrowError::ParseError(format!("invalid index at byte {start}")))?;

    Ok((idx, end + 1))
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially considered using TryFrom with explicit error boundaries , but doing so would require changing an existing public API, which would be a breaking change

pub fn with_path<'a, P, F>(mut self, path: P, field: F) -> Self
where
P: Into<VariantPath<'a>>,
F: IntoShreddingField,
{
let path: VariantPath<'a> = path.into();
self.root.insert_path(&path, field.into_shredding_field());
self
}

and as @scovich mentioned the current approach can also parse invalid inputs as fields, making debugging hard. So I'm a bit skeptical about how to proceed from here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, variant is still experimental, so breaking changes are ok?

Copy link
Contributor

@scovich scovich Dec 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meanwhile, even the parser I cooked up above is technically incomplete, because JSON allows field names to contain any character (including ., [ and ] which my suggestion would reject as ambiguous). The only way to avoid that is with some kind of escaping syntax. I believe Iceberg uses canonical JSONpath, for example.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an open issue regarding the escaping syntax #8954

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. It looks like the mini-parser I posted above is a strict subset of the databricks syntax linked above. Would it be a reasonable starting point to merge this PR and add the missing ['fieldName'] syntax as a follow-up? Or just expand the above suggestion into a full parser right away?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] Support array access when parsing VariantPath

4 participants