Skip to content

Propagate SortedValue to Polars set_sorted() on index column#3049

Open
G-D-Petrov wants to merge 6 commits into
masterfrom
gpetrov/set_order
Open

Propagate SortedValue to Polars set_sorted() on index column#3049
G-D-Petrov wants to merge 6 commits into
masterfrom
gpetrov/set_order

Conversation

@G-D-Petrov

Copy link
Copy Markdown
Collaborator

Reference Issues/PRs

Monday ticket ref: 10659398340

What does this implement or fix?

When reading data from ArcticDB with output_format="polars", the resulting Polars DataFrame had no sorted flag set on the index column, even though ArcticDB tracks sort order internally via SortedValue in the C++ StreamDescriptor. This means Polars couldn't take advantage of optimized operations (e.g., fast joins, binary search filtering) that rely on knowing a column is sorted.

Changes

C++ layer — propagate SortedValue through the read path:

  • cpp/arcticdb/entity/read_result.hpp — Added a SortedValue sorted field to both NodeReadResult and ReadResult, defaulting to UNKNOWN.
  • cpp/arcticdb/pipeline/pipeline_utils.hpp — In create_python_read_result, extracts the sorted value from the stream descriptor and passes it into NodeReadResult and the top-level ReadResult.
  • cpp/arcticdb/python/adapt_read_dataframe.hpp — Includes sorted (cast to int) in the tuple returned to Python from adapt_read_df.
  • cpp/arcticdb/python/python_utils.hpp — Includes sorted in the tuples produced by node_results_to_python_list and adapt_read_dfs.

Python layer — receive sorted value and apply Polars flag:

  • python/arcticdb/version_store/read_result.pyReadResult and NodeReadResult now accept and store a sorted parameter (converted from int to SortedValue enum).
  • python/arcticdb/version_store/_store.py:
    • Added _get_index_col_from_norm() helper to extract the index column name from normalization metadata (handles pandas DataFrames, Series, and experimental Arrow formats).
    • Added _apply_polars_sorted_flag_to_index() static method that calls polars.col(...).set_sorted() on the index column when the stored sort order is ASCENDING or DESCENDING.
    • Integrated the flag application into _adapt_read_res() so it runs automatically on every read.

Tests:

  • python/tests/unit/arcticdb/version_store/test_polars_set_sorted.py — New test file with 6 tests covering:
    • Unnamed DatetimeIndex__index__ column gets SORTED_ASC
    • Named DatetimeIndex → named column gets SORTED_ASC
    • RangeIndex (not physically stored) → no sorted flag set
    • Pandas output path still works without issues
    • Stage + finalize workflow preserves sorted flag
    • Value columns do not get the sorted flag (only the index does)

Benchmark Results

Before vs After (stock ArcticDB 6.13.0 vs branch dev build)

The benchmark script reads 1M rows as Polars and measures various operations. On stock ArcticDB, the index has no sorted flag (SORTED_ASC: False). With the branch's dev build, the sorted flag is automatically applied (SORTED_ASC: True).

Comparing stock (6.13.0) vs branch (dev) on key index operations:

Benchmark Stock 6.13.0 (ms) Branch dev (ms) Speedup
sort_by_index 18.90 3.06 6.18x
unique_index 3.71 0.69 5.38x
upsample 25.20 8.32 3.03x
filter_range 1.93 1.89 1.02x
filter_after 1.30 1.27 1.02x
group_by_dynamic 9.40 8.83 1.06x
rolling_mean 25.59 24.87 1.03x

The branch delivers significant speedups on sort_by_index (6.2x), unique_index (5.4x), and upsample (3x). These are operations where Polars can skip work entirely when it knows the column is sorted.

Branch dev build: detailed benchmark (Polars 1.35.2, 1M rows, 10 iters)

Note: In the branch build, "Current (no flag)" already has SORTED_ASC: True on the index (the branch's feature), so applying set_sorted() again is a no-op. This confirms the feature is working automatically.

Part 2: Value Columns — Future Work

These measure the potential benefit of also setting set_sorted() on value columns, which would require populating FieldStats.sorted_ in C++. Not implemented in this branch.

Benchmark Current/branch (ms) + index + values (ms) Speedup Notes
sort_by_price 22.71 3.42 6.63x Big win
filter_price_range 0.91 0.79 1.14x
unique_price 4.07 0.70 5.82x Big win
search_sorted_price 0.11 0.10 1.10x
join_asof_on_price 5.81 6.42 0.90x
sort_by_volume 20.77 4.19 4.96x Big win
unique_volume 3.31 0.64 5.14x Big win

Overhead

The set_sorted() call adds negligible overhead (~0.3 ms) relative to the read cost (~399 ms debug build):

Operation Time (ms)
set_sorted (index only) 0.28
set_sorted (index + 2 cols) 0.65

from arcticdb_ext.version_store import PandasOutputFrame
from arcticdb_ext.version_store import PandasOutputFrame, SortedValue
from arcticdb.version_store._normalization import FrameData

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorted shadows the Python built-in sorted(). This can cause subtle bugs in any method that needs to call the built-in while also accessing self.sorted. Suggest renaming to sort_order or sort_value throughout (including the ReadResult counterpart and the C++-side parameter names).

Suggested change
def __init__(self, sym, frame_data, norm, sort_order=SortedValue.UNKNOWN):

assert isinstance(result, pl.DataFrame)
assert result["__index__"].flags["SORTED_ASC"] is True
assert result["sorted_val"].flags["SORTED_ASC"] is False
assert result["another"].flags["SORTED_ASC"] is False

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DESCENDING code path in _apply_polars_sorted_flag_to_index is exercised by production code but not covered by any test here. A simple case — write data with a descending datetime index (or use update() with sort_on_index=False to produce a reverse-sorted symbol) — would verify that result["__index__"].flags["SORTED_DESC"] is True and SORTED_ASC is False. Without this, a regression in the descending=True branch would go undetected.

Comment thread python/arcticdb/version_store/_store.py Outdated
elif input_type in ("df", "series"):
common = norm.df.common if input_type == "df" else norm.series.common
index_type = common.WhichOneof("index_type")
if index_type == "index" and common.index.is_physically_stored:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WhichOneof("index_type") can also return "multi_index" for MultiIndex DataFrames. In that case index_type != "index" and the function returns None, so no sorted flag is set — which is probably the right conservative choice since multi-level index sorting is more complex. Worth adding a comment here (or in the PR description) acknowledging this as a known limitation so future maintainers don't think it was accidentally missed.

@claude

claude Bot commented Apr 20, 2026

Copy link
Copy Markdown
Contributor

ArcticDB Code Review Summary

PR propagates ArcticDB's internal SortedValue through the read path to call Polars' set_sorted() on the index column, and (new commits 23c0f12, b107ff3) adds O(n) Arrow index sortedness verification so validate_index/update work for Arrow input. Most previously flagged issues are resolved. One new behavioural concern raised below.

Changes since last review (post-rebase commits 23c0f12, b107ff3)

  • ✅ Removed default SortedValue::UNKNOWN args from ReadResult/NodeReadResult ctors — all construction sites now pass an explicit value (addresses IvoDD's default-value worry)
  • static_cast<int>(...) removed at the Python boundary; SortedValue passed directly via its pybind enum (python_bindings.cpp:547) and read_result.py no longer does the int→enum conversion
  • read_index / batch read_index paths now also propagate tsd.sorted()
  • _get_index_col_from_norm replaced by _has_physically_stored_index; multi_index now included (first level flagged) and index assumed first column — validated by real round-trip tests (named index, value columns, multi-index)
  • ✅ New SortednessScan enum threaded through all write paths; scan only runs when sortedness is actually needed (validate_index, update/batch_update always, write_parallel only when not auto-sorting)
  • ✅ Comprehensive new tests: test_validate_index_arrow, test_validate_index_arrow_batch, polars-flag tests for UNKNOWN/UNSORTED/multi-index

Correctness — ⚠️ one concern

  • Arrow write without validate_index now records UNKNOWN (was unconditionally ASCENDING), which append rejects. sorted_data_check_append (version_core.cpp:144-148) requires existing tsd().sorted() == ASCENDING strictly, whereas check_update_data_is_sorted (version_core.cpp:433-434) accepts ASCENDING || UNKNOWN. Result: write(arrow, validate_index=False) then append(arrow, validate_index=True) raises a spurious UnsortedDataException on genuinely-sorted data — diverging from both pandas (records ASCENDING even unvalidated) and the update path. Not covered by tests (append tests seed with validate_index=True). See inline comment on version_store_api.cpp:827.

API & Compatibility

  • ⚠️ On-disk semantic change: Arrow timeseries writes no longer unconditionally record ASCENDING; they record the computed value (when validated) or UNKNOWN. This is a correctness improvement (previously unsorted Arrow data was mislabelled ASCENDING) but should be called out in the PR description as a recorded-metadata behaviour change. Format/enum unchanged; readable by older clients.
  • ✅ No breaking changes to public Python API signatures
  • ✅ Protobuf schema unchanged

Memory & Safety

  • compute_index_sortedness is a simple single-pass scan over the index column; handles empty (→ ASCENDING) and equal-valued (→ ASCENDING) inputs correctly
  • ✅ GIL / RAII handling unchanged at the boundary

Testing

  • ✅ Behavioural paths well covered (write/append/update/stage × validate_index × sorted, single and batch)
  • ⚠️ Missing: write-without-validate followed by append/update-with-validate for Arrow (see Correctness concern)

Documentation

  • validate_index docstrings updated across _store.py / library.py ("For Arrow input data, ArcticDB checks the index column directly")
  • ✅ Polars sorted-flag note consolidated onto OutputFormat.POLARS docs (addresses IvoDD's docs-placement comment)

Possible follow-up (non-blocking)

  • NaT handling: the Arrow scan treats NaT (INT64_MIN) as the smallest value; worth confirming this matches pandas is_monotonic_increasing semantics for indexes containing NaT, for consistency with the NaT work in PR 3085.

@G-D-Petrov G-D-Petrov added patch Small change, should increase patch version enhancement New feature or request labels Apr 20, 2026
@G-D-Petrov G-D-Petrov changed the title Apply set_order for polars dataframes Propagate SortedValue to Polars set_sorted() on index column Apr 20, 2026
assert result[col].flags["SORTED_ASC"] is False, f"Column {col} should not have SORTED_ASC"


def test_sorted_flag_not_set_for_pandas_output(lmdb_library):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this test? We have a lot of other pandas coverage of pandas reads.


result = NativeVersionStore._apply_polars_sorted_flag_to_index(data, read_result)

assert result["__index__"].flags["SORTED_ASC"] is False

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A meta point worth discussing is whether we should treat UNKNOWN as SORTED_ASC.

For all other purposes we treat UNKOWN (written by old arctcdb versions before sortedness was introduced) as sorted. (e.g. for update we use this code).

I'm personally in favor of treating unknown constistently and set sorted even if unkown.

We should instead add a test for UNSORTED


class ReadResult:
def __init__(self, version, frame_data, norm, udm, mmeta, node_read_results):
def __init__(self, version, frame_data, norm, udm, mmeta, node_read_results, sort_order=SortedValue.UNKNOWN):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm slightly worried if we're using a default value we might have missed some place where this is used incorrectly.

Output format for the returned dataframe.
If `None`, uses the output format from the `Library` instance.
See `OutputFormat` documentation for details on available formats.
When using ``POLARS`` output format, the index column (if physically stored) will automatically have

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably move these docs to OutputFormat docs. The same description applies to methods like head and tail all of which reference the OutputFormat docs.

Comment thread python/arcticdb/version_store/_store.py Outdated
return True


def _get_index_col_from_norm(norm):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will get drastically simpler after #3037 as the index column will always be first.

Comment thread python/arcticdb/version_store/_store.py Outdated

@staticmethod
def _apply_polars_sorted_flag_to_index(data, read_result):
if pl is None or not isinstance(data, pl.DataFrame):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think it would be slightly cleaner if we call this only for polars DataFrames here, then this check would not be needed.

Comment thread python/arcticdb/version_store/_store.py Outdated
elif input_type in ("df", "series"):
common = norm.df.common if input_type == "df" else norm.series.common
index_type = common.WhichOneof("index_type")
# multi_index intentionally excluded: multi-level sorting semantics are more complex

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably still want to include multi_index as long as first index is sorted.

pyuser_meta,
multi_key_meta,
std::move(node_results),
static_cast<int>(ret.sorted)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why cast to int? Can't we expose SortedValue in python bindings?

Comment thread cpp/arcticdb/entity/read_result.hpp Outdated
NodeReadResult(
const StreamId& symbol, OutputFrame&& frame_data,
arcticdb::proto::descriptors::NormalizationMetadata&& norm_meta
arcticdb::proto::descriptors::NormalizationMetadata&& norm_meta, SortedValue sorted = SortedValue::UNKNOWN

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Again slightly worried about this having a default value, we might have missed some constructions which should specify a sorted value.

This only matters if we decide to tread UNKNOWN as sorted. Otherwise it's probably fine

Georgi Petrov and others added 6 commits June 17, 2026 13:19
Arrow writes previously hard-coded the index SortedValue to ASCENDING, which
could misreport sortedness (e.g. the polars read path would set_sorted() on an
unsorted column). Arrow input now stores UNKNOWN by default; when a sorted index
is required it walks the index column once (O(n)) to determine the actual sort
order, mirroring how pandas computes it during normalization.

- New SortednessScan{SKIP, SCAN_IF_UNKNOWN} threaded through py_ndf_to_frame /
  record_batches_to_frame / InputFrame::set_segment with no default, so every
  call site is explicit: SCAN_IF_UNKNOWN for write/append/update/stage and the
  batch_write/batch_append/batch_update variants; SKIP for read-from-file,
  partitioned, composite, merge and pandas-tuple paths.
- The walk sets the descriptor's SortedValue, so it both validates the index and
  persists the sort order, which lets the polars set_sorted() flag work for Arrow.
- Fixes batch_update silently skipping the sortedness check.
- Docs: drop the stale "no checks are performed for Arrow input data" note.
- Tests: validate_index x sorted/unsorted across write/append/update/stage and
  the batch variants, plus the polars flag being set only when validation ran.

Cost of the walk relative to the write (release build, single timestamp index):

In-memory (zero-copy write, so this is the worst case for relative overhead):

| rows | batches | no-verify | verify   | overhead |
|------|---------|-----------|----------|----------|
| 1M   | 1       | 3.36 ms   | 4.49 ms  | +34%     |
| 1M   | 100     | 4.12 ms   | 5.30 ms  | +29%     |
| 10M  | 1       | 5.53 ms   | 15.87 ms | +187%    |
| 10M  | 100     | 6.07 ms   | 16.16 ms | +166%    |

Real pure-S3 (network-bound, parallel segment encode/write):

| library          | rows | no-verify  | verify     | overhead |
|------------------|------|------------|------------|----------|
| default segment  | 1M   | 81.7 ms    | 82.6 ms    | +1.1%    |
| default segment  | 10M  | 214.2 ms   | 226.1 ms   | +5.6%    |
| 5000-row segment | 1M   | 329.9 ms   | 346.2 ms   | +1.8%    |
| 5000-row segment | 10M  | 1699.9 ms  | 1726.8 ms  | +1.6%    |

The walk is only paid when validate_index requests it; on real storage it is a
low single-digit percentage of the write, and more segment parallelism dilutes
it further rather than amplifying it.
@IvoDD IvoDD force-pushed the gpetrov/set_order branch from 72e11c9 to b107ff3 Compare June 18, 2026 15:23
norm,
user_meta,
cfg().write_options().empty_types(),
sortedness_scan_for(validate_index)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recording UNKNOWN for non-validated Arrow writes (the SortednessScan::SKIP default) interacts badly with append's existing-data check. sorted_data_check_append (version_core.cpp:144-148) requires the existing data's tsd().sorted() == SortedValue::ASCENDING strictly — unlike check_update_data_is_sorted (version_core.cpp:433-434), which accepts ASCENDING || UNKNOWN.

So this sequence now raises a spurious UnsortedDataException on genuinely-sorted data:

lib.write("sym", sorted_arrow_table, index_column=True)            # validate_index=False -> descriptor UNKNOWN
lib.append("sym", more_sorted_arrow, index_column=True, validate_index=True)  # "existing data must be sorted" -> raises

The equivalent pandas workflow succeeds (pandas records ASCENDING even without validate_index), and the equivalent update succeeds (UNKNOWN existing allowed). The existing append tests don't catch this because they seed the symbol with validate_index=True.

Please either (a) seed the append/update tests with a non-validated initial write to cover this case, and (b) align sorted_data_check_append to accept UNKNOWN existing data like the update path does — or confirm this stricter Arrow behaviour is intended and document the pandas/Arrow divergence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request patch Small change, should increase patch version

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants