[ EXPERIMENTAL ] Rename API For Col Stats To _experimental And Update Logic by markovskipetar · Pull Request #3141 · man-group/ArcticDB

markovskipetar · 2026-06-01T09:01:41Z

Experimental: rename `create_column_stats` and default to all numeric columns

Summary

Rename NativeVersionStore.create_column_stats to create_column_stats_experimental (Python-only — no C++ rename) and make the column_stats argument optional. When omitted, the call computes MINMAX for every column of the symbol, instead of forcing the user to specify columns.

Before: column dict was required

lib.write(sym, df)
lib.create_column_stats(sym, {"col_1": {"MINMAX"}, "col_2": {"MINMAX"}, ...})   # mandatory dict
stats = lib.read_column_stats(sym)

The user had to (a) remember to call it after write, and (b) hand-roll the per-column dict.

Now: dict is optional, default = every supported column

lib.write(sym, df)
lib.create_column_stats_experimental(sym)              # auto-MINMAX over every numeric/temporal col
stats = lib.read_column_stats(sym)

Explicit dicts still work unchanged:

lib.create_column_stats_experimental(sym, {"col_1": {"MINMAX"}})

Changes

`python/arcticdb/version_store/_store.py`

create_column_stats → create_column_stats_experimental.
column_stats: Dict[str, Set[str]] is now Optional[..] = None.
New private helper _auto_column_stats(symbol, as_of): reads the symbol descriptor via get_info and returns {col: {"MINMAX"}} for every column
Cross-references in drop_column_stats and get_column_stats_info docstrings updated to point at the new name.

Co-authored-by: Copilot <copilot@github.com>

github-actions · 2026-06-01T09:01:53Z

Label error. Requires exactly 1 of: patch, minor, major. Found:

claude · 2026-06-01T09:06:10Z

+        ARCTICDB_DEBUG(log::version(), "Auto column stats skipped (user input): {}", e.what());
+    }
+}
+


The catch list is too narrow. create_column_stats_impl can throw beyond SchemaException/UserInputException:

storage::NoDataFoundException if the index key read fails (e.g., transient storage error).

internal::check failures (e.g., the Could not unpack metadata of old_header util::check).

The final store->update(column_stats_key, ...).get() will throw any StorageException subtype on a network/quota/permission failure.

Because this is called after the data write has already completed, any uncaught exception here makes the user's write/append/update appear to fail even though the symbol was written successfully — a confusing partial-success state. Either:

Catch std::exception (or ArcticBaseException) at this boundary and log a warning so the data write is never observable as a failure, or

Document and test the failure semantics explicitly.

Given the goal is "experiment, don't break user writes" the broader catch is the safer call.

claude · 2026-06-01T09:06:16Z

            async_write_dataframe_impl(store, version_id, frame, options, de_dup_map, sparsify_floats, validate_index);
-    return {std::move(atom_key_fut).get()};
+    VersionedItem versioned_item{std::move(atom_key_fut).get()};
+    create_auto_column_stats_impl_experimental(store, versioned_item, frame->desc());


The hook only fires from the sync write_dataframe_impl / append_impl / update_impl paths. Batch operations in local_versioned_engine.cpp (batch_write_versioned_dataframe_internal at L1790, batch_append_internal at L1946/1958, batch_update_internal at L2031/2048) call async_write_dataframe_impl / async_append_impl / async_update_impl directly and bypass this hook entirely.

That means lib.write(sym, df) auto-creates stats, but lib.write_batch([sym], [df]) does not — same data, different outcome depending on which API the user picks. If the goal of the experiment is to measure the end-to-end impact, the inconsistency makes the measurement misleading and gives users a footgun (silently divergent on-disk state between batch and non-batch writes).

Either also hook the async paths (probably via .thenValue after the AtomKey future) or document the limitation prominently in the PR description.

claude · 2026-06-01T09:06:22Z

+    size_t pos = 0;
+
+    for (const auto& field : desc.fields()) {
+        const size_t this_pos = pos++;
+
+        if (this_pos < index_field_count) {
+            continue;
+        }
+        if (!arcticdb::is_numeric_type(field.type().data_type())) {
+            continue;
+        }
+
+        if (field.name().empty()) {
+            continue;
+        }
+
+        stats_map.emplace(std::move(std::string(field.name())), std::unordered_set<std::string>{"MINMAX"});
+    }
+
+    if (stats_map.empty()) {
+        return std::nullopt;
+    }
+
+    return ColumnStats(stats_map);
+}
+
+// EXPERIMENTAL: invoked automatically after every write/append/update. Builds
+// the auto stats from the input frame's descriptor and delegates to


No handling for dynamic_schema. When dynamic_schema=true, an append can introduce new columns whose offsets differ from the index segment's descriptor. create_column_stats_impl then resolves names against the merged TimeseriesDescriptor via ColumnStats::calculate_offsets (column_stats.cpp:183), and on a mismatch — or when stats already exist for this version with different offsets — it goes through the column_stats.drop(old_column_stats, false) merge path which assumes consistent offsets.

This isn't exercised by the new test (which uses static schema). Please add a dynamic_schema=True case with append introducing a new numeric column, and verify the auto-stats are correct (or at least that the write doesn't fail).

Also note: index_field_count is taken from the just-written frame's descriptor, but create_column_stats_impl later resolves names against the index key's TSD. If those differ (e.g., column reordering under dynamic schema), the skip count is computed against a different descriptor than the one used to look up the columns — risk of off-by-one against the wrong layout.

claude · 2026-06-01T09:06:25Z

    );
    ARCTICDB_DEBUG(
            log::version(), "updated stream_id: {} , version_id: {}", frame->desc().id(), update_info.next_version_id_
    );


Hook added to update_impl, but the new test only exercises write + append. update_impl is the most complex of the three (partial-row replacement, intersecting segments, dynamic schema branches) and is the most likely to surface edge cases — please add a test that exercises an update and verifies the resulting auto-stats reflect the post-update data (not just the updating frame's range).

claude · 2026-06-01T09:06:31Z

@@ -97,7 +157,9 @@
    );
    auto atom_key_fut =
            async_write_dataframe_impl(store, version_id, frame, options, de_dup_map, sparsify_floats, validate_index);
-    return {std::move(atom_key_fut).get()};
+    VersionedItem versioned_item{std::move(atom_key_fut).get()};
+    create_auto_column_stats_impl_experimental(store, versioned_item, frame->desc());


Always-on with no opt-out. Even framed as an experiment, this changes the cost profile of every write/append/update for every user upgrading past this commit: an extra full-symbol read, MINMAX compute, and COLUMN_STATS key write per call. For large frames or high-throughput writers this is a significant regression with no way to disable it short of patching.

Strongly recommend gating behind a runtime flag — either a WriteOptions field or a config setting (e.g. write_options.auto_create_column_stats) defaulting to false. That preserves the experimental intent (you can flip it on in benchmark runs and measure cleanly) without imposing the cost on everyone who happens to build from this branch. It also gives you a clean revert/cleanup path if the experiment doesn't pan out.

claude · 2026-06-01T09:06:59Z

ArcticDB Code Review Summary

Latest commits (delta): this delta reverts the experimental date_range/row_range range feature and the range-restricted RMW path. It removes merge_column_stats_with_range_replacement (column_stats.cpp/.hpp), the is_filtered branch in create_column_stats_impl, and the read_query parameter from the C++ create_column_stats_version / create_column_stats_version_internal / create_column_stats_internal / create_column_stats_impl signatures, plus the date_range/row_range kwargs from create_column_stats_experimental in _store.py. The C++ side is now internally consistent — but the Python revert was left incomplete.

Correctness

❌ New regression — TypeError on the headline path. _store.py:1310 still calls self.version_store.create_column_stats_version(symbol, column_stats, version_query, read_query) with 4 positional args, but this delta removed read_query from the C++ method (version_store_api.hpp:107 → now (stream_id, column_stats, version_query)), and python_bindings.cpp:663 binds the method pointer directly. Calling with 4 args raises TypeError: incompatible function arguments. The dead read_query = _PythonVersionStoreReadQuery() at _store.py:1308 must also be dropped. This breaks every call to create_column_stats_experimental (including the default column_stats=None path). (Mirror of the earlier-flagged TypeError, now re-introduced by the partial revert.)
✅ Resolved (delta). The previous new_segment / new_col_stats_segment compile error in create_column_stats_impl is gone — the revert restored consistent new_segment naming throughout version_core.cpp.
✅ Resolved (delta). The range-restricted RMW mixed-stat-presence concern is moot — the is_filtered path and merge_column_stats_with_range_replacement are removed. No dangling references remain.

Testing

❌ Still no test exercising create_column_stats_experimental. A trivial test of the default path (create_column_stats_experimental(sym)) and an explicit column_stats={...} call would have caught the TypeError above. Per the repo TDD guideline, add these before merging.

Documentation

⚠️ _store.py docstring correctly dropped the date_range/row_range parameters. Confirm docs/claude/ and any user-facing docs no longer reference the now-removed range parameters.

Style

⚠️ Trailing whitespace remains in _store.py (the if column_stats is None: line and the blank line before the assignment). Run make lint.

… all columns brnach

claude · 2026-06-04T14:04:25Z

+
+        self.version_store.create_column_stats_version(symbol, column_stats, version_query, read_query)
+
+    def get_eligible_column_stats_spec(


This method is defined as get_eligible_column_stats_spec (no leading underscore), but it is called as self._get_eligible_column_stats_spec(symbol, as_of) at line 1319. The call will raise AttributeError — and this is exactly the default code path (column_stats is None) that is the headline feature of this PR ("default to all numeric columns"). Rename the definition to match the caller (it was previously the private _auto_column_stats):

Suggested change

def get_eligible_column_stats_spec(

def _get_eligible_column_stats_spec(

…ts_version call

claude · 2026-06-05T07:10:08Z

-        self.version_store.create_column_stats_version(symbol, column_stats, version_query)
+        read_query = _PythonVersionStoreReadQuery()
+
+        self.version_store.create_column_stats_version(symbol, column_stats, version_query, read_query)


This call will raise TypeError at runtime. This commit removed the read_query parameter from the C++ PythonVersionStore::create_column_stats_version (now (stream_id, column_stats, version_query) — see version_store_api.hpp:107), and the pybind binding binds the method pointer directly (python_bindings.cpp:683). But this line still passes a 4th positional argument read_query, so every call to create_column_stats_experimental will fail with incompatible function arguments.

Drop the now-unused read_query construction (line 1308) and the argument here:

Suggested change

self.version_store.create_column_stats_version(symbol, column_stats, version_query, read_query)

self.version_store.create_column_stats_version(symbol, column_stats, version_query)

(also remove read_query = _PythonVersionStoreReadQuery() above). This is the headline default code path — a test exercising create_column_stats_experimental(sym) would have caught it.

claude · 2026-06-05T07:45:24Z

        validate_column_stats_header_version(old_header);
        arcticc::pb2::column_stats_pb2::ColumnStatsHeader new_header;
-        unpacked = new_segment.metadata()->UnpackTo(&new_header);
+        unpacked = new_col_stats_segment.metadata()->UnpackTo(&new_header);


Compile error introduced by the rename in this commit. Seven lines above (the if (!old_metadata) branch, line 2097) still references the old variable name:

store->update(column_stats_key, std::move(new_segment), update_opts).get();

new_segment no longer exists — it was renamed to new_col_stats_segment (declared at line 2051, updated everywhere else). This file will not compile. Rename this last reference to new_col_stats_segment.

…ol stats

…umn stats

…tal"

… and skip index columns

…ons of the functions and change the parameters passed to them

poodlewars · 2026-06-16T10:06:04Z

-        lib.create_column_stats(sym, column_stats_dict)
-
-
-def test_column_stats_multiple_creates(version_store_factory, lib_name, encoding_version, any_output_format):


Why removing this one?

test_column_stats_multiple_creates was removed because its purpose was to call create column stats with different column across different calls: {"col_1": {"MINMAX"}}, then {"col_2": {"MINMAX"}} and verify the merge was successful. But whit the new updated API, we cannot achieve this behavior because we create stats over all columns, without specifying particular columns.

test_column_stats_string_column_minmax was removed because its purpose was to test that creating stats over string column raises SchemaException. But with the new update API, the python layer iterates over all column and collects only the eligible columns for stats (numeric and bools)

poodlewars · 2026-06-16T10:07:35Z

-    The inner index level can be referenced by the user-facing name ("level"), the mangled name ("__idx__level"),
-    or the __fkidx__ name for unnamed index levels."""
+    """Column stats on a multiindex DataFrame: auto-discovery picks up data columns; the inner
+    index level is not included by default."""


This is wrong, we definitely need to create stats over the inner index levels.

I don't think this comment has been addressed properly, we still want to create stats over the multi-index columns,

expected = index_columns_to_pl(lib, sym).with_columns( pl.Series(f"v1_MIN({stored_col_name})", [10, 30]), pl.Series(f"v1_MAX({stored_col_name})", [20, 40]), pl.Series("v1_MIN(val)", [100, 300]), pl.Series("v1_MAX(val)", [200, 400]), )

But this test just has:

expected = index_columns_to_pl(lib, sym).with_columns( pl.Series("v1_MIN(val)", [100, 300]), pl.Series("v1_MAX(val)", [200, 400]), )

…ng col stats

markovskipetar and others added 2 commits June 1, 2026 09:52

Implement auto col stats creation

03daef3

Co-authored-by: Copilot <copilot@github.com>

Add auto col stats test

9bf8581

markovskipetar requested review from IvoDD, alexowens90 and poodlewars as code owners June 1, 2026 09:01

claude Bot reviewed Jun 1, 2026

View reviewed changes

markovskipetar added 3 commits June 2, 2026 10:20

revert version_core

4558fb7

rename col_stats to col_stats_experimental and add creating stats for…

e7fff4e

… all columns brnach

remove test

0ae504a

markovskipetar changed the title ~~[ EXPERIMENTAL ] Auto Columns Stats Creation Over All Columns~~ [ EXPERIMENTAL ] Rename API For Col Stats To _experimental And Update Logic Jun 2, 2026

claude Bot reviewed Jun 4, 2026

View reviewed changes

Refactor column_stats helpers and add read_query to create_column_sta…

1cafa99

…ts_version call

markovskipetar force-pushed the experimental_auto_columns_stats_creation branch 2 times, most recently from 18f1128 to 1cafa99 Compare June 5, 2026 07:07

claude Bot reviewed Jun 5, 2026

View reviewed changes

markovskipetar force-pushed the experimental_auto_columns_stats_creation branch from 542f033 to 1cafa99 Compare June 5, 2026 07:59

markovskipetar added 5 commits June 5, 2026 10:05

remove optional parameter column_stats and update logic from create c…

808f0e9

…ol stats

remove optional parameter column_stats and update logic from drop col…

d57bd13

…umn stats

rename all columns stats API functions to: current name + "_experimen…

181e559

…tal"

Refactor '_store.py' : Add bool columns to eligible columns for stats…

10bf231

… and skip index columns

Refactor usage of col stats API functions - use the new renamed versi…

0ffd108

…ons of the functions and change the parameters passed to them

poodlewars requested changes Jun 16, 2026

View reviewed changes

Add implementation for adding inner multi index columns for calculati…

fe05a0a

…ng col stats


		self.version_store.create_column_stats_version(symbol, column_stats, version_query, read_query)

		def get_eligible_column_stats_spec(

	def get_eligible_column_stats_spec(
	def _get_eligible_column_stats_spec(

		lib.create_column_stats(sym, column_stats_dict)


		def test_column_stats_multiple_creates(version_store_factory, lib_name, encoding_version, any_output_format):

Conversation

markovskipetar commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Experimental: rename create_column_stats and default to all numeric columns

Summary

Before: column dict was required

Now: dict is optional, default = every supported column

Changes

python/arcticdb/version_store/_store.py

Uh oh!

github-actions Bot commented Jun 1, 2026

Uh oh!

claude Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ArcticDB Code Review Summary

Uh oh!

claude Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

poodlewars Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

markovskipetar Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

poodlewars Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

poodlewars Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

markovskipetar commented Jun 1, 2026 •

edited

Loading

Experimental: rename `create_column_stats` and default to all numeric columns

`python/arcticdb/version_store/_store.py`

claude Bot commented Jun 1, 2026 •

edited

Loading