Skip to content

[ EXPERIMENTAL ] Rename API For Col Stats To _experimental And Update Logic#3141

Open
markovskipetar wants to merge 12 commits into
masterfrom
experimental_auto_columns_stats_creation
Open

[ EXPERIMENTAL ] Rename API For Col Stats To _experimental And Update Logic#3141
markovskipetar wants to merge 12 commits into
masterfrom
experimental_auto_columns_stats_creation

Conversation

@markovskipetar

@markovskipetar markovskipetar commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

Experimental: rename create_column_stats and default to all numeric columns

Summary

Rename NativeVersionStore.create_column_stats to create_column_stats_experimental (Python-only — no C++ rename) and make the column_stats argument optional. When omitted, the call computes MINMAX for every column of the symbol, instead of forcing the user to specify columns.

Before: column dict was required

lib.write(sym, df)
lib.create_column_stats(sym, {"col_1": {"MINMAX"}, "col_2": {"MINMAX"}, ...})   # mandatory dict
stats = lib.read_column_stats(sym)

The user had to (a) remember to call it after write, and (b) hand-roll the per-column dict.

Now: dict is optional, default = every supported column

lib.write(sym, df)
lib.create_column_stats_experimental(sym)              # auto-MINMAX over every numeric/temporal col
stats = lib.read_column_stats(sym)

Explicit dicts still work unchanged:

lib.create_column_stats_experimental(sym, {"col_1": {"MINMAX"}})

Changes

python/arcticdb/version_store/_store.py

  • create_column_statscreate_column_stats_experimental.
  • column_stats: Dict[str, Set[str]] is now Optional[..] = None.
  • New private helper _auto_column_stats(symbol, as_of): reads the symbol descriptor via get_info and returns {col: {"MINMAX"}} for every column
  • Cross-references in drop_column_stats and get_column_stats_info docstrings updated to point at the new name.

markovskipetar and others added 2 commits June 1, 2026 09:52
Co-authored-by: Copilot <copilot@github.com>
@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown

Label error. Requires exactly 1 of: patch, minor, major. Found:

Comment thread cpp/arcticdb/version/version_core.cpp Outdated
Comment on lines +139 to +142
ARCTICDB_DEBUG(log::version(), "Auto column stats skipped (user input): {}", e.what());
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The catch list is too narrow. create_column_stats_impl can throw beyond SchemaException/UserInputException:

  • storage::NoDataFoundException if the index key read fails (e.g., transient storage error).
  • internal::check failures (e.g., the Could not unpack metadata of old_header util::check).
  • The final store->update(column_stats_key, ...).get() will throw any StorageException subtype on a network/quota/permission failure.

Because this is called after the data write has already completed, any uncaught exception here makes the user's write/append/update appear to fail even though the symbol was written successfully — a confusing partial-success state. Either:

  1. Catch std::exception (or ArcticBaseException) at this boundary and log a warning so the data write is never observable as a failure, or
  2. Document and test the failure semantics explicitly.

Given the goal is "experiment, don't break user writes" the broader catch is the safer call.

Comment thread cpp/arcticdb/version/version_core.cpp Outdated
async_write_dataframe_impl(store, version_id, frame, options, de_dup_map, sparsify_floats, validate_index);
return {std::move(atom_key_fut).get()};
VersionedItem versioned_item{std::move(atom_key_fut).get()};
create_auto_column_stats_impl_experimental(store, versioned_item, frame->desc());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hook only fires from the sync write_dataframe_impl / append_impl / update_impl paths. Batch operations in local_versioned_engine.cpp (batch_write_versioned_dataframe_internal at L1790, batch_append_internal at L1946/1958, batch_update_internal at L2031/2048) call async_write_dataframe_impl / async_append_impl / async_update_impl directly and bypass this hook entirely.

That means lib.write(sym, df) auto-creates stats, but lib.write_batch([sym], [df]) does not — same data, different outcome depending on which API the user picks. If the goal of the experiment is to measure the end-to-end impact, the inconsistency makes the measurement misleading and gives users a footgun (silently divergent on-disk state between batch and non-batch writes).

Either also hook the async paths (probably via .thenValue after the AtomKey future) or document the limitation prominently in the PR description.

Comment thread cpp/arcticdb/version/version_core.cpp Outdated
Comment on lines +94 to +121
size_t pos = 0;

for (const auto& field : desc.fields()) {
const size_t this_pos = pos++;

if (this_pos < index_field_count) {
continue;
}
if (!arcticdb::is_numeric_type(field.type().data_type())) {
continue;
}

if (field.name().empty()) {
continue;
}

stats_map.emplace(std::move(std::string(field.name())), std::unordered_set<std::string>{"MINMAX"});
}

if (stats_map.empty()) {
return std::nullopt;
}

return ColumnStats(stats_map);
}

// EXPERIMENTAL: invoked automatically after every write/append/update. Builds
// the auto stats from the input frame's descriptor and delegates to

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No handling for dynamic_schema. When dynamic_schema=true, an append can introduce new columns whose offsets differ from the index segment's descriptor. create_column_stats_impl then resolves names against the merged TimeseriesDescriptor via ColumnStats::calculate_offsets (column_stats.cpp:183), and on a mismatch — or when stats already exist for this version with different offsets — it goes through the column_stats.drop(old_column_stats, false) merge path which assumes consistent offsets.

This isn't exercised by the new test (which uses static schema). Please add a dynamic_schema=True case with append introducing a new numeric column, and verify the auto-stats are correct (or at least that the write doesn't fail).

Also note: index_field_count is taken from the just-written frame's descriptor, but create_column_stats_impl later resolves names against the index key's TSD. If those differ (e.g., column reordering under dynamic schema), the skip count is computed against a different descriptor than the one used to look up the columns — risk of off-by-one against the wrong layout.

);
ARCTICDB_DEBUG(
log::version(), "updated stream_id: {} , version_id: {}", frame->desc().id(), update_info.next_version_id_
);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hook added to update_impl, but the new test only exercises write + append. update_impl is the most complex of the three (partial-row replacement, intersecting segments, dynamic schema branches) and is the most likely to surface edge cases — please add a test that exercises an update and verifies the resulting auto-stats reflect the post-update data (not just the updating frame's range).

Comment thread cpp/arcticdb/version/version_core.cpp Outdated
Comment on lines +88 to +161
@@ -97,7 +157,9 @@
);
auto atom_key_fut =
async_write_dataframe_impl(store, version_id, frame, options, de_dup_map, sparsify_floats, validate_index);
return {std::move(atom_key_fut).get()};
VersionedItem versioned_item{std::move(atom_key_fut).get()};
create_auto_column_stats_impl_experimental(store, versioned_item, frame->desc());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always-on with no opt-out. Even framed as an experiment, this changes the cost profile of every write/append/update for every user upgrading past this commit: an extra full-symbol read, MINMAX compute, and COLUMN_STATS key write per call. For large frames or high-throughput writers this is a significant regression with no way to disable it short of patching.

Strongly recommend gating behind a runtime flag — either a WriteOptions field or a config setting (e.g. write_options.auto_create_column_stats) defaulting to false. That preserves the experimental intent (you can flip it on in benchmark runs and measure cleanly) without imposing the cost on everyone who happens to build from this branch. It also gives you a clean revert/cleanup path if the experiment doesn't pan out.

@claude

claude Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

ArcticDB Code Review Summary

Latest commits (delta): this delta reverts the experimental date_range/row_range range feature and the range-restricted RMW path. It removes merge_column_stats_with_range_replacement (column_stats.cpp/.hpp), the is_filtered branch in create_column_stats_impl, and the read_query parameter from the C++ create_column_stats_version / create_column_stats_version_internal / create_column_stats_internal / create_column_stats_impl signatures, plus the date_range/row_range kwargs from create_column_stats_experimental in _store.py. The C++ side is now internally consistent — but the Python revert was left incomplete.

Correctness

  • New regression — TypeError on the headline path. _store.py:1310 still calls self.version_store.create_column_stats_version(symbol, column_stats, version_query, read_query) with 4 positional args, but this delta removed read_query from the C++ method (version_store_api.hpp:107 → now (stream_id, column_stats, version_query)), and python_bindings.cpp:663 binds the method pointer directly. Calling with 4 args raises TypeError: incompatible function arguments. The dead read_query = _PythonVersionStoreReadQuery() at _store.py:1308 must also be dropped. This breaks every call to create_column_stats_experimental (including the default column_stats=None path). (Mirror of the earlier-flagged TypeError, now re-introduced by the partial revert.)
  • Resolved (delta). The previous new_segment / new_col_stats_segment compile error in create_column_stats_impl is gone — the revert restored consistent new_segment naming throughout version_core.cpp.
  • Resolved (delta). The range-restricted RMW mixed-stat-presence concern is moot — the is_filtered path and merge_column_stats_with_range_replacement are removed. No dangling references remain.

Testing

  • ❌ Still no test exercising create_column_stats_experimental. A trivial test of the default path (create_column_stats_experimental(sym)) and an explicit column_stats={...} call would have caught the TypeError above. Per the repo TDD guideline, add these before merging.

Documentation

  • ⚠️ _store.py docstring correctly dropped the date_range/row_range parameters. Confirm docs/claude/ and any user-facing docs no longer reference the now-removed range parameters.

Style

  • ⚠️ Trailing whitespace remains in _store.py (the if column_stats is None: line and the blank line before the assignment). Run make lint.

@markovskipetar markovskipetar changed the title [ EXPERIMENTAL ] Auto Columns Stats Creation Over All Columns [ EXPERIMENTAL ] Rename API For Col Stats To _experimental And Update Logic Jun 2, 2026
Comment thread python/arcticdb/version_store/_store.py Outdated

self.version_store.create_column_stats_version(symbol, column_stats, version_query, read_query)

def get_eligible_column_stats_spec(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is defined as get_eligible_column_stats_spec (no leading underscore), but it is called as self._get_eligible_column_stats_spec(symbol, as_of) at line 1319. The call will raise AttributeError — and this is exactly the default code path (column_stats is None) that is the headline feature of this PR ("default to all numeric columns"). Rename the definition to match the caller (it was previously the private _auto_column_stats):

Suggested change
def get_eligible_column_stats_spec(
def _get_eligible_column_stats_spec(

@markovskipetar markovskipetar force-pushed the experimental_auto_columns_stats_creation branch 2 times, most recently from 18f1128 to 1cafa99 Compare June 5, 2026 07:07
Comment thread python/arcticdb/version_store/_store.py Outdated
self.version_store.create_column_stats_version(symbol, column_stats, version_query)
read_query = _PythonVersionStoreReadQuery()

self.version_store.create_column_stats_version(symbol, column_stats, version_query, read_query)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call will raise TypeError at runtime. This commit removed the read_query parameter from the C++ PythonVersionStore::create_column_stats_version (now (stream_id, column_stats, version_query) — see version_store_api.hpp:107), and the pybind binding binds the method pointer directly (python_bindings.cpp:683). But this line still passes a 4th positional argument read_query, so every call to create_column_stats_experimental will fail with incompatible function arguments.

Drop the now-unused read_query construction (line 1308) and the argument here:

Suggested change
self.version_store.create_column_stats_version(symbol, column_stats, version_query, read_query)
self.version_store.create_column_stats_version(symbol, column_stats, version_query)

(also remove read_query = _PythonVersionStoreReadQuery() above). This is the headline default code path — a test exercising create_column_stats_experimental(sym) would have caught it.

Comment thread cpp/arcticdb/version/version_core.cpp Outdated
validate_column_stats_header_version(old_header);
arcticc::pb2::column_stats_pb2::ColumnStatsHeader new_header;
unpacked = new_segment.metadata()->UnpackTo(&new_header);
unpacked = new_col_stats_segment.metadata()->UnpackTo(&new_header);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compile error introduced by the rename in this commit. Seven lines above (the if (!old_metadata) branch, line 2097) still references the old variable name:

store->update(column_stats_key, std::move(new_segment), update_opts).get();

new_segment no longer exists — it was renamed to new_col_stats_segment (declared at line 2051, updated everywhere else). This file will not compile. Rename this last reference to new_col_stats_segment.

@markovskipetar markovskipetar force-pushed the experimental_auto_columns_stats_creation branch from 542f033 to 1cafa99 Compare June 5, 2026 07:59
lib.create_column_stats(sym, column_stats_dict)


def test_column_stats_multiple_creates(version_store_factory, lib_name, encoding_version, any_output_format):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why removing this one?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_column_stats_multiple_creates was removed because its purpose was to call create column stats with different column across different calls: {"col_1": {"MINMAX"}}, then {"col_2": {"MINMAX"}} and verify the merge was successful. But whit the new updated API, we cannot achieve this behavior because we create stats over all columns, without specifying particular columns.

test_column_stats_string_column_minmax was removed because its purpose was to test that creating stats over string column raises SchemaException. But with the new update API, the python layer iterates over all column and collects only the eligible columns for stats (numeric and bools)

The inner index level can be referenced by the user-facing name ("level"), the mangled name ("__idx__level"),
or the __fkidx__ name for unnamed index levels."""
"""Column stats on a multiindex DataFrame: auto-discovery picks up data columns; the inner
index level is not included by default."""

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong, we definitely need to create stats over the inner index levels.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this comment has been addressed properly, we still want to create stats over the multi-index columns,

    expected = index_columns_to_pl(lib, sym).with_columns(
        pl.Series(f"v1_MIN({stored_col_name})", [10, 30]),
        pl.Series(f"v1_MAX({stored_col_name})", [20, 40]),
        pl.Series("v1_MIN(val)", [100, 300]),
        pl.Series("v1_MAX(val)", [200, 400]),
    )

But this test just has:

    expected = index_columns_to_pl(lib, sym).with_columns(
        pl.Series("v1_MIN(val)", [100, 300]),
        pl.Series("v1_MAX(val)", [200, 400]),
    )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants