[ EXPERIMENTAL ] Rename API For Col Stats To _experimental And Update Logic#3141
[ EXPERIMENTAL ] Rename API For Col Stats To _experimental And Update Logic#3141markovskipetar wants to merge 12 commits into
Conversation
Co-authored-by: Copilot <copilot@github.com>
|
Label error. Requires exactly 1 of: patch, minor, major. Found: |
| ARCTICDB_DEBUG(log::version(), "Auto column stats skipped (user input): {}", e.what()); | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
The catch list is too narrow. create_column_stats_impl can throw beyond SchemaException/UserInputException:
storage::NoDataFoundExceptionif the index key read fails (e.g., transient storage error).internal::checkfailures (e.g., theCould not unpack metadata of old_headerutil::check).- The final
store->update(column_stats_key, ...).get()will throw anyStorageExceptionsubtype on a network/quota/permission failure.
Because this is called after the data write has already completed, any uncaught exception here makes the user's write/append/update appear to fail even though the symbol was written successfully — a confusing partial-success state. Either:
- Catch
std::exception(orArcticBaseException) at this boundary and log a warning so the data write is never observable as a failure, or - Document and test the failure semantics explicitly.
Given the goal is "experiment, don't break user writes" the broader catch is the safer call.
| async_write_dataframe_impl(store, version_id, frame, options, de_dup_map, sparsify_floats, validate_index); | ||
| return {std::move(atom_key_fut).get()}; | ||
| VersionedItem versioned_item{std::move(atom_key_fut).get()}; | ||
| create_auto_column_stats_impl_experimental(store, versioned_item, frame->desc()); |
There was a problem hiding this comment.
The hook only fires from the sync write_dataframe_impl / append_impl / update_impl paths. Batch operations in local_versioned_engine.cpp (batch_write_versioned_dataframe_internal at L1790, batch_append_internal at L1946/1958, batch_update_internal at L2031/2048) call async_write_dataframe_impl / async_append_impl / async_update_impl directly and bypass this hook entirely.
That means lib.write(sym, df) auto-creates stats, but lib.write_batch([sym], [df]) does not — same data, different outcome depending on which API the user picks. If the goal of the experiment is to measure the end-to-end impact, the inconsistency makes the measurement misleading and gives users a footgun (silently divergent on-disk state between batch and non-batch writes).
Either also hook the async paths (probably via .thenValue after the AtomKey future) or document the limitation prominently in the PR description.
| size_t pos = 0; | ||
|
|
||
| for (const auto& field : desc.fields()) { | ||
| const size_t this_pos = pos++; | ||
|
|
||
| if (this_pos < index_field_count) { | ||
| continue; | ||
| } | ||
| if (!arcticdb::is_numeric_type(field.type().data_type())) { | ||
| continue; | ||
| } | ||
|
|
||
| if (field.name().empty()) { | ||
| continue; | ||
| } | ||
|
|
||
| stats_map.emplace(std::move(std::string(field.name())), std::unordered_set<std::string>{"MINMAX"}); | ||
| } | ||
|
|
||
| if (stats_map.empty()) { | ||
| return std::nullopt; | ||
| } | ||
|
|
||
| return ColumnStats(stats_map); | ||
| } | ||
|
|
||
| // EXPERIMENTAL: invoked automatically after every write/append/update. Builds | ||
| // the auto stats from the input frame's descriptor and delegates to |
There was a problem hiding this comment.
No handling for dynamic_schema. When dynamic_schema=true, an append can introduce new columns whose offsets differ from the index segment's descriptor. create_column_stats_impl then resolves names against the merged TimeseriesDescriptor via ColumnStats::calculate_offsets (column_stats.cpp:183), and on a mismatch — or when stats already exist for this version with different offsets — it goes through the column_stats.drop(old_column_stats, false) merge path which assumes consistent offsets.
This isn't exercised by the new test (which uses static schema). Please add a dynamic_schema=True case with append introducing a new numeric column, and verify the auto-stats are correct (or at least that the write doesn't fail).
Also note: index_field_count is taken from the just-written frame's descriptor, but create_column_stats_impl later resolves names against the index key's TSD. If those differ (e.g., column reordering under dynamic schema), the skip count is computed against a different descriptor than the one used to look up the columns — risk of off-by-one against the wrong layout.
| ); | ||
| ARCTICDB_DEBUG( | ||
| log::version(), "updated stream_id: {} , version_id: {}", frame->desc().id(), update_info.next_version_id_ | ||
| ); |
There was a problem hiding this comment.
Hook added to update_impl, but the new test only exercises write + append. update_impl is the most complex of the three (partial-row replacement, intersecting segments, dynamic schema branches) and is the most likely to surface edge cases — please add a test that exercises an update and verifies the resulting auto-stats reflect the post-update data (not just the updating frame's range).
| @@ -97,7 +157,9 @@ | |||
| ); | |||
| auto atom_key_fut = | |||
| async_write_dataframe_impl(store, version_id, frame, options, de_dup_map, sparsify_floats, validate_index); | |||
| return {std::move(atom_key_fut).get()}; | |||
| VersionedItem versioned_item{std::move(atom_key_fut).get()}; | |||
| create_auto_column_stats_impl_experimental(store, versioned_item, frame->desc()); | |||
There was a problem hiding this comment.
Always-on with no opt-out. Even framed as an experiment, this changes the cost profile of every write/append/update for every user upgrading past this commit: an extra full-symbol read, MINMAX compute, and COLUMN_STATS key write per call. For large frames or high-throughput writers this is a significant regression with no way to disable it short of patching.
Strongly recommend gating behind a runtime flag — either a WriteOptions field or a config setting (e.g. write_options.auto_create_column_stats) defaulting to false. That preserves the experimental intent (you can flip it on in benchmark runs and measure cleanly) without imposing the cost on everyone who happens to build from this branch. It also gives you a clean revert/cleanup path if the experiment doesn't pan out.
ArcticDB Code Review SummaryLatest commits (delta): this delta reverts the experimental date_range/row_range range feature and the range-restricted RMW path. It removes merge_column_stats_with_range_replacement (column_stats.cpp/.hpp), the is_filtered branch in create_column_stats_impl, and the read_query parameter from the C++ create_column_stats_version / create_column_stats_version_internal / create_column_stats_internal / create_column_stats_impl signatures, plus the date_range/row_range kwargs from create_column_stats_experimental in _store.py. The C++ side is now internally consistent — but the Python revert was left incomplete. Correctness
Testing
Documentation
Style
|
|
|
||
| self.version_store.create_column_stats_version(symbol, column_stats, version_query, read_query) | ||
|
|
||
| def get_eligible_column_stats_spec( |
There was a problem hiding this comment.
This method is defined as get_eligible_column_stats_spec (no leading underscore), but it is called as self._get_eligible_column_stats_spec(symbol, as_of) at line 1319. The call will raise AttributeError — and this is exactly the default code path (column_stats is None) that is the headline feature of this PR ("default to all numeric columns"). Rename the definition to match the caller (it was previously the private _auto_column_stats):
| def get_eligible_column_stats_spec( | |
| def _get_eligible_column_stats_spec( |
18f1128 to
1cafa99
Compare
| self.version_store.create_column_stats_version(symbol, column_stats, version_query) | ||
| read_query = _PythonVersionStoreReadQuery() | ||
|
|
||
| self.version_store.create_column_stats_version(symbol, column_stats, version_query, read_query) |
There was a problem hiding this comment.
This call will raise TypeError at runtime. This commit removed the read_query parameter from the C++ PythonVersionStore::create_column_stats_version (now (stream_id, column_stats, version_query) — see version_store_api.hpp:107), and the pybind binding binds the method pointer directly (python_bindings.cpp:683). But this line still passes a 4th positional argument read_query, so every call to create_column_stats_experimental will fail with incompatible function arguments.
Drop the now-unused read_query construction (line 1308) and the argument here:
| self.version_store.create_column_stats_version(symbol, column_stats, version_query, read_query) | |
| self.version_store.create_column_stats_version(symbol, column_stats, version_query) |
(also remove read_query = _PythonVersionStoreReadQuery() above). This is the headline default code path — a test exercising create_column_stats_experimental(sym) would have caught it.
| validate_column_stats_header_version(old_header); | ||
| arcticc::pb2::column_stats_pb2::ColumnStatsHeader new_header; | ||
| unpacked = new_segment.metadata()->UnpackTo(&new_header); | ||
| unpacked = new_col_stats_segment.metadata()->UnpackTo(&new_header); |
There was a problem hiding this comment.
Compile error introduced by the rename in this commit. Seven lines above (the if (!old_metadata) branch, line 2097) still references the old variable name:
store->update(column_stats_key, std::move(new_segment), update_opts).get();new_segment no longer exists — it was renamed to new_col_stats_segment (declared at line 2051, updated everywhere else). This file will not compile. Rename this last reference to new_col_stats_segment.
542f033 to
1cafa99
Compare
… and skip index columns
…ons of the functions and change the parameters passed to them
| lib.create_column_stats(sym, column_stats_dict) | ||
|
|
||
|
|
||
| def test_column_stats_multiple_creates(version_store_factory, lib_name, encoding_version, any_output_format): |
There was a problem hiding this comment.
Why removing this one?
There was a problem hiding this comment.
test_column_stats_multiple_creates was removed because its purpose was to call create column stats with different column across different calls: {"col_1": {"MINMAX"}}, then {"col_2": {"MINMAX"}} and verify the merge was successful. But whit the new updated API, we cannot achieve this behavior because we create stats over all columns, without specifying particular columns.
test_column_stats_string_column_minmax was removed because its purpose was to test that creating stats over string column raises SchemaException. But with the new update API, the python layer iterates over all column and collects only the eligible columns for stats (numeric and bools)
| The inner index level can be referenced by the user-facing name ("level"), the mangled name ("__idx__level"), | ||
| or the __fkidx__ name for unnamed index levels.""" | ||
| """Column stats on a multiindex DataFrame: auto-discovery picks up data columns; the inner | ||
| index level is not included by default.""" |
There was a problem hiding this comment.
This is wrong, we definitely need to create stats over the inner index levels.
There was a problem hiding this comment.
I don't think this comment has been addressed properly, we still want to create stats over the multi-index columns,
expected = index_columns_to_pl(lib, sym).with_columns(
pl.Series(f"v1_MIN({stored_col_name})", [10, 30]),
pl.Series(f"v1_MAX({stored_col_name})", [20, 40]),
pl.Series("v1_MIN(val)", [100, 300]),
pl.Series("v1_MAX(val)", [200, 400]),
)
But this test just has:
expected = index_columns_to_pl(lib, sym).with_columns(
pl.Series("v1_MIN(val)", [100, 300]),
pl.Series("v1_MAX(val)", [200, 400]),
)
Experimental: rename
create_column_statsand default to all numeric columnsSummary
Rename
NativeVersionStore.create_column_statstocreate_column_stats_experimental(Python-only — no C++ rename) and make thecolumn_statsargument optional. When omitted, the call computesMINMAXfor every column of the symbol, instead of forcing the user to specify columns.Before: column dict was required
The user had to (a) remember to call it after
write, and (b) hand-roll the per-column dict.Now: dict is optional, default = every supported column
Explicit dicts still work unchanged:
Changes
python/arcticdb/version_store/_store.pycreate_column_stats→create_column_stats_experimental.column_stats: Dict[str, Set[str]]is nowOptional[..] = None._auto_column_stats(symbol, as_of): reads the symbol descriptor viaget_infoand returns{col: {"MINMAX"}}for every columndrop_column_statsandget_column_stats_infodocstrings updated to point at the new name.