Propagate SortedValue to Polars set_sorted() on index column by G-D-Petrov · Pull Request #3049 · man-group/ArcticDB

G-D-Petrov · 2026-04-20T09:40:43Z

Reference Issues/PRs

Monday ticket ref: 10659398340

What does this implement or fix?

When reading data from ArcticDB with output_format="polars", the resulting Polars DataFrame had no sorted flag set on the index column, even though ArcticDB tracks sort order internally via SortedValue in the C++ StreamDescriptor. This means Polars couldn't take advantage of optimized operations (e.g., fast joins, binary search filtering) that rely on knowing a column is sorted.

Changes

C++ layer — propagate SortedValue through the read path:

cpp/arcticdb/entity/read_result.hpp — Added a SortedValue sorted field to both NodeReadResult and ReadResult, defaulting to UNKNOWN.
cpp/arcticdb/pipeline/pipeline_utils.hpp — In create_python_read_result, extracts the sorted value from the stream descriptor and passes it into NodeReadResult and the top-level ReadResult.
cpp/arcticdb/python/adapt_read_dataframe.hpp — Includes sorted (cast to int) in the tuple returned to Python from adapt_read_df.
cpp/arcticdb/python/python_utils.hpp — Includes sorted in the tuples produced by node_results_to_python_list and adapt_read_dfs.

Python layer — receive sorted value and apply Polars flag:

python/arcticdb/version_store/read_result.py — ReadResult and NodeReadResult now accept and store a sorted parameter (converted from int to SortedValue enum).
python/arcticdb/version_store/_store.py:
- Added _get_index_col_from_norm() helper to extract the index column name from normalization metadata (handles pandas DataFrames, Series, and experimental Arrow formats).
- Added _apply_polars_sorted_flag_to_index() static method that calls polars.col(...).set_sorted() on the index column when the stored sort order is ASCENDING or DESCENDING.
- Integrated the flag application into _adapt_read_res() so it runs automatically on every read.

Tests:

python/tests/unit/arcticdb/version_store/test_polars_set_sorted.py — New test file with 6 tests covering:
- Unnamed DatetimeIndex → __index__ column gets SORTED_ASC
- Named DatetimeIndex → named column gets SORTED_ASC
- RangeIndex (not physically stored) → no sorted flag set
- Pandas output path still works without issues
- Stage + finalize workflow preserves sorted flag
- Value columns do not get the sorted flag (only the index does)

Benchmark Results

Before vs After (stock ArcticDB 6.13.0 vs branch dev build)

The benchmark script reads 1M rows as Polars and measures various operations. On stock ArcticDB, the index has no sorted flag (SORTED_ASC: False). With the branch's dev build, the sorted flag is automatically applied (SORTED_ASC: True).

Comparing stock (6.13.0) vs branch (dev) on key index operations:

Benchmark	Stock 6.13.0 (ms)	Branch dev (ms)	Speedup
sort_by_index	18.90	3.06	6.18x
unique_index	3.71	0.69	5.38x
upsample	25.20	8.32	3.03x
filter_range	1.93	1.89	1.02x
filter_after	1.30	1.27	1.02x
group_by_dynamic	9.40	8.83	1.06x
rolling_mean	25.59	24.87	1.03x

The branch delivers significant speedups on sort_by_index (6.2x), unique_index (5.4x), and upsample (3x). These are operations where Polars can skip work entirely when it knows the column is sorted.

Branch dev build: detailed benchmark (Polars 1.35.2, 1M rows, 10 iters)

Note: In the branch build, "Current (no flag)" already has SORTED_ASC: True on the index (the branch's feature), so applying set_sorted() again is a no-op. This confirms the feature is working automatically.

Part 2: Value Columns — Future Work

These measure the potential benefit of also setting set_sorted() on value columns, which would require populating FieldStats.sorted_ in C++. Not implemented in this branch.

Benchmark	Current/branch (ms)	+ index + values (ms)	Speedup	Notes
sort_by_price	22.71	3.42	6.63x	Big win
filter_price_range	0.91	0.79	1.14x
unique_price	4.07	0.70	5.82x	Big win
search_sorted_price	0.11	0.10	1.10x
join_asof_on_price	5.81	6.42	0.90x
sort_by_volume	20.77	4.19	4.96x	Big win
unique_volume	3.31	0.64	5.14x	Big win

Overhead

The set_sorted() call adds negligible overhead (~0.3 ms) relative to the read cost (~399 ms debug build):

Operation	Time (ms)
`set_sorted` (index only)	0.28
`set_sorted` (index + 2 cols)	0.65

claude · 2026-04-20T09:44:53Z

-from arcticdb_ext.version_store import PandasOutputFrame
+from arcticdb_ext.version_store import PandasOutputFrame, SortedValue
 from arcticdb.version_store._normalization import FrameData



sorted shadows the Python built-in sorted(). This can cause subtle bugs in any method that needs to call the built-in while also accessing self.sorted. Suggest renaming to sort_order or sort_value throughout (including the ReadResult counterpart and the C++-side parameter names).

Suggested change

def __init__(self, sym, frame_data, norm, sort_order=SortedValue.UNKNOWN):

claude · 2026-04-20T09:44:58Z

+    assert isinstance(result, pl.DataFrame)
+    assert result["__index__"].flags["SORTED_ASC"] is True
+    assert result["sorted_val"].flags["SORTED_ASC"] is False
+    assert result["another"].flags["SORTED_ASC"] is False


The DESCENDING code path in _apply_polars_sorted_flag_to_index is exercised by production code but not covered by any test here. A simple case — write data with a descending datetime index (or use update() with sort_on_index=False to produce a reverse-sorted symbol) — would verify that result["__index__"].flags["SORTED_DESC"] is True and SORTED_ASC is False. Without this, a regression in the descending=True branch would go undetected.

claude · 2026-04-20T09:45:01Z

+    elif input_type in ("df", "series"):
+        common = norm.df.common if input_type == "df" else norm.series.common
+        index_type = common.WhichOneof("index_type")
+        if index_type == "index" and common.index.is_physically_stored:


WhichOneof("index_type") can also return "multi_index" for MultiIndex DataFrames. In that case index_type != "index" and the function returns None, so no sorted flag is set — which is probably the right conservative choice since multi-level index sorting is more complex. Worth adding a comment here (or in the PR description) acknowledging this as a known limitation so future maintainers don't think it was accidentally missed.

claude · 2026-04-20T09:45:33Z

ArcticDB Code Review Summary

PR propagates ArcticDB's internal SortedValue through the read path to call Polars' set_sorted() on the index column, and (new commits 23c0f12, b107ff3) adds O(n) Arrow index sortedness verification so validate_index/update work for Arrow input. Most previously flagged issues are resolved. One new behavioural concern raised below.

Changes since last review (post-rebase commits `23c0f12`, `b107ff3`)

✅ Removed default SortedValue::UNKNOWN args from ReadResult/NodeReadResult ctors — all construction sites now pass an explicit value (addresses IvoDD's default-value worry)
✅ static_cast<int>(...) removed at the Python boundary; SortedValue passed directly via its pybind enum (python_bindings.cpp:547) and read_result.py no longer does the int→enum conversion
✅ read_index / batch read_index paths now also propagate tsd.sorted()
✅ _get_index_col_from_norm replaced by _has_physically_stored_index; multi_index now included (first level flagged) and index assumed first column — validated by real round-trip tests (named index, value columns, multi-index)
✅ New SortednessScan enum threaded through all write paths; scan only runs when sortedness is actually needed (validate_index, update/batch_update always, write_parallel only when not auto-sorting)
✅ Comprehensive new tests: test_validate_index_arrow, test_validate_index_arrow_batch, polars-flag tests for UNKNOWN/UNSORTED/multi-index

Correctness — ⚠️ one concern

❌ Arrow write without validate_index now records UNKNOWN (was unconditionally ASCENDING), which append rejects. sorted_data_check_append (version_core.cpp:144-148) requires existing tsd().sorted() == ASCENDING strictly, whereas check_update_data_is_sorted (version_core.cpp:433-434) accepts ASCENDING || UNKNOWN. Result: write(arrow, validate_index=False) then append(arrow, validate_index=True) raises a spurious UnsortedDataException on genuinely-sorted data — diverging from both pandas (records ASCENDING even unvalidated) and the update path. Not covered by tests (append tests seed with validate_index=True). See inline comment on version_store_api.cpp:827.

API & Compatibility

⚠️ On-disk semantic change: Arrow timeseries writes no longer unconditionally record ASCENDING; they record the computed value (when validated) or UNKNOWN. This is a correctness improvement (previously unsorted Arrow data was mislabelled ASCENDING) but should be called out in the PR description as a recorded-metadata behaviour change. Format/enum unchanged; readable by older clients.
✅ No breaking changes to public Python API signatures
✅ Protobuf schema unchanged

Memory & Safety

✅ compute_index_sortedness is a simple single-pass scan over the index column; handles empty (→ ASCENDING) and equal-valued (→ ASCENDING) inputs correctly
✅ GIL / RAII handling unchanged at the boundary

Testing

✅ Behavioural paths well covered (write/append/update/stage × validate_index × sorted, single and batch)
⚠️ Missing: write-without-validate followed by append/update-with-validate for Arrow (see Correctness concern)

Documentation

✅ validate_index docstrings updated across _store.py / library.py ("For Arrow input data, ArcticDB checks the index column directly")
✅ Polars sorted-flag note consolidated onto OutputFormat.POLARS docs (addresses IvoDD's docs-placement comment)

Possible follow-up (non-blocking)

NaT handling: the Arrow scan treats NaT (INT64_MIN) as the smallest value; worth confirming this matches pandas is_monotonic_increasing semantics for indexes containing NaT, for consistency with the NaT work in PR 3085.

IvoDD · 2026-04-21T14:09:59Z

+        assert result[col].flags["SORTED_ASC"] is False, f"Column {col} should not have SORTED_ASC"
+
+
+def test_sorted_flag_not_set_for_pandas_output(lmdb_library):


Do we need this test? We have a lot of other pandas coverage of pandas reads.

IvoDD · 2026-04-21T14:37:32Z

+
+    result = NativeVersionStore._apply_polars_sorted_flag_to_index(data, read_result)
+
+    assert result["__index__"].flags["SORTED_ASC"] is False


A meta point worth discussing is whether we should treat UNKNOWN as SORTED_ASC.

For all other purposes we treat UNKOWN (written by old arctcdb versions before sortedness was introduced) as sorted. (e.g. for update we use this code).

I'm personally in favor of treating unknown constistently and set sorted even if unkown.

We should instead add a test for UNSORTED

IvoDD · 2026-04-21T14:39:17Z


 class ReadResult:
-    def __init__(self, version, frame_data, norm, udm, mmeta, node_read_results):
+    def __init__(self, version, frame_data, norm, udm, mmeta, node_read_results, sort_order=SortedValue.UNKNOWN):


I'm slightly worried if we're using a default value we might have missed some place where this is used incorrectly.

IvoDD · 2026-04-21T14:42:51Z

            Output format for the returned dataframe.
            If `None`, uses the output format from the `Library` instance.
            See `OutputFormat` documentation for details on available formats.
+            When using ``POLARS`` output format, the index column (if physically stored) will automatically have


We should probably move these docs to OutputFormat docs. The same description applies to methods like head and tail all of which reference the OutputFormat docs.

IvoDD · 2026-04-21T14:44:19Z

        return True


+def _get_index_col_from_norm(norm):


This will get drastically simpler after #3037 as the index column will always be first.

IvoDD · 2026-04-21T14:47:25Z


+    @staticmethod
+    def _apply_polars_sorted_flag_to_index(data, read_result):
+        if pl is None or not isinstance(data, pl.DataFrame):


Nit: I think it would be slightly cleaner if we call this only for polars DataFrames here, then this check would not be needed.

IvoDD · 2026-04-21T14:48:03Z

+    elif input_type in ("df", "series"):
+        common = norm.df.common if input_type == "df" else norm.series.common
+        index_type = common.WhichOneof("index_type")
+        # multi_index intentionally excluded: multi-level sorting semantics are more complex


We probably still want to include multi_index as long as first index is sorted.

IvoDD · 2026-04-21T14:49:29Z

+            pyuser_meta,
+            multi_key_meta,
+            std::move(node_results),
+            static_cast<int>(ret.sorted)


Why cast to int? Can't we expose SortedValue in python bindings?

IvoDD · 2026-04-21T14:52:16Z

    NodeReadResult(
            const StreamId& symbol, OutputFrame&& frame_data,
-            arcticdb::proto::descriptors::NormalizationMetadata&& norm_meta
+            arcticdb::proto::descriptors::NormalizationMetadata&& norm_meta, SortedValue sorted = SortedValue::UNKNOWN


Nit: Again slightly worried about this having a default value, we might have missed some constructions which should specify a sorted value.

This only matters if we decide to tread UNKNOWN as sorted. Otherwise it's probably fine

Arrow writes previously hard-coded the index SortedValue to ASCENDING, which could misreport sortedness (e.g. the polars read path would set_sorted() on an unsorted column). Arrow input now stores UNKNOWN by default; when a sorted index is required it walks the index column once (O(n)) to determine the actual sort order, mirroring how pandas computes it during normalization. - New SortednessScan{SKIP, SCAN_IF_UNKNOWN} threaded through py_ndf_to_frame / record_batches_to_frame / InputFrame::set_segment with no default, so every call site is explicit: SCAN_IF_UNKNOWN for write/append/update/stage and the batch_write/batch_append/batch_update variants; SKIP for read-from-file, partitioned, composite, merge and pandas-tuple paths. - The walk sets the descriptor's SortedValue, so it both validates the index and persists the sort order, which lets the polars set_sorted() flag work for Arrow. - Fixes batch_update silently skipping the sortedness check. - Docs: drop the stale "no checks are performed for Arrow input data" note. - Tests: validate_index x sorted/unsorted across write/append/update/stage and the batch variants, plus the polars flag being set only when validation ran. Cost of the walk relative to the write (release build, single timestamp index): In-memory (zero-copy write, so this is the worst case for relative overhead): | rows | batches | no-verify | verify | overhead | |------|---------|-----------|----------|----------| | 1M | 1 | 3.36 ms | 4.49 ms | +34% | | 1M | 100 | 4.12 ms | 5.30 ms | +29% | | 10M | 1 | 5.53 ms | 15.87 ms | +187% | | 10M | 100 | 6.07 ms | 16.16 ms | +166% | Real pure-S3 (network-bound, parallel segment encode/write): | library | rows | no-verify | verify | overhead | |------------------|------|------------|------------|----------| | default segment | 1M | 81.7 ms | 82.6 ms | +1.1% | | default segment | 10M | 214.2 ms | 226.1 ms | +5.6% | | 5000-row segment | 1M | 329.9 ms | 346.2 ms | +1.8% | | 5000-row segment | 10M | 1699.9 ms | 1726.8 ms | +1.6% | The walk is only paid when validate_index requests it; on real storage it is a low single-digit percentage of the write, and more segment parallelism dilutes it further rather than amplifying it.

claude · 2026-06-18T15:29:42Z

+                    norm,
+                    user_meta,
+                    cfg().write_options().empty_types(),
+                    sortedness_scan_for(validate_index)


Recording UNKNOWN for non-validated Arrow writes (the SortednessScan::SKIP default) interacts badly with append's existing-data check. sorted_data_check_append (version_core.cpp:144-148) requires the existing data's tsd().sorted() == SortedValue::ASCENDING strictly — unlike check_update_data_is_sorted (version_core.cpp:433-434), which accepts ASCENDING || UNKNOWN.

So this sequence now raises a spurious UnsortedDataException on genuinely-sorted data:

lib.write("sym", sorted_arrow_table, index_column=True) # validate_index=False -> descriptor UNKNOWN lib.append("sym", more_sorted_arrow, index_column=True, validate_index=True) # "existing data must be sorted" -> raises

The equivalent pandas workflow succeeds (pandas records ASCENDING even without validate_index), and the equivalent update succeeds (UNKNOWN existing allowed). The existing append tests don't catch this because they seed the symbol with validate_index=True.

Please either (a) seed the append/update tests with a non-validated initial write to cover this case, and (b) align sorted_data_check_append to accept UNKNOWN existing data like the update path does — or confirm this stricter Arrow behaviour is intended and document the pandas/Arrow divergence.

G-D-Petrov requested review from IvoDD, alexowens90 and poodlewars as code owners April 20, 2026 09:40

claude Bot reviewed Apr 20, 2026

View reviewed changes

G-D-Petrov added patch Small change, should increase patch version enhancement New feature or request labels Apr 20, 2026

G-D-Petrov changed the title ~~Apply set_order for polars dataframes~~ Propagate SortedValue to Polars set_sorted() on index column Apr 20, 2026

IvoDD reviewed Apr 21, 2026

View reviewed changes

Georgi Petrov and others added 6 commits June 17, 2026 13:19

Apply set_order for polars dataframes

b729552

Fix test failures

b21c697

Address PR comments

ea28d54

Fix tests

8d372dc

Address my (ivo's) comments

23c0f12

IvoDD force-pushed the gpetrov/set_order branch from 72e11c9 to b107ff3 Compare June 18, 2026 15:23

claude Bot reviewed Jun 18, 2026

View reviewed changes


	def __init__(self, sym, frame_data, norm, sort_order=SortedValue.UNKNOWN):

		assert result[col].flags["SORTED_ASC"] is False, f"Column {col} should not have SORTED_ASC"


		def test_sorted_flag_not_set_for_pandas_output(lmdb_library):


		result = NativeVersionStore._apply_polars_sorted_flag_to_index(data, read_result)

		assert result["__index__"].flags["SORTED_ASC"] is False

Conversation

G-D-Petrov commented Apr 20, 2026

Reference Issues/PRs

What does this implement or fix?

Changes

Benchmark Results

Before vs After (stock ArcticDB 6.13.0 vs branch dev build)

Branch dev build: detailed benchmark (Polars 1.35.2, 1M rows, 10 iters)

Part 2: Value Columns — Future Work

Overhead

Uh oh!

claude Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ArcticDB Code Review Summary

Changes since last review (post-rebase commits 23c0f12, b107ff3)

Correctness — ⚠️ one concern

API & Compatibility

Memory & Safety

Testing

Documentation

Possible follow-up (non-blocking)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude Bot commented Apr 20, 2026 •

edited

Loading

Changes since last review (post-rebase commits `23c0f12`, `b107ff3`)