refactor: Overhaul to seed datasets #167

mikeknep · 2025-12-30T21:10:47Z

This PR makes some significant, breaking changes to how we work with seed datasets.

The two most important new base classes are:

SeedSource: config-side, provides pointers to datasets (HF, local file, dataframe, NMP File)
SeedReader: engine-side, reading seed datasets using duckdb

The most significant structural change is that both seed column resolution (fetching column names) and overall builder validation are moved to the "back-end", i.e. to engine. By doing this we can use the new SeedReader objects to get the column names consistently regardless of seed type (using duckdb). I made a single compile_data_designer_config function so that there is a single "entrypoint" to go from builder -> fully resolved and validated DataDesignerConfig. This new compiler could do more of both these things ("deferred column resolution" and overall validation) in the future if needed.

Moving validation to engine does mean that we no longer have a DataDesignerConfigBuilder.validate method. Validation happens when you call preview or create from the interface (and in NMP, validation will happen server-side in those endpoints). If this is too unpleasant and we want to keep (some form of) that validate method on the builder, we could move things around a bit so that "partial" validation can be done config-side, but it would not be able to do full validation because the builder will never know the seed columns.

Note: I've not yet updated docs or notebooks; I figure I'll wait until the implementation is reviewed to minimize rework there.

src/data_designer/config/seed.py

src/data_designer/config/seed_source.py

src/data_designer/config/seed.py

src/data_designer/config/config_builder.py

src/data_designer/config/seed_dataset.py

nabinchha · 2026-01-05T17:08:10Z

src/data_designer/config/seed_source.py

+class HuggingFaceSeedConfig(SeedDatasetConfig):
+    seed_type: Literal["hf"] = "hf"
+
+    dataset: str = Field(pattern=r"^hf://datasets/*")


Thoughts on calling this path as well to stay consistent with how hf load_dataset kwarg has it defined?

I went with the approach here to try to minimize user confusion e.g. asking questions like "am I supposed to include "datasets/" here or not? Am I supposed to put the "hf://" prefix?" The (enforced) answer being "yes and yes. However, I don't have a strong opinion here, and in fact leaning on load_dataset as the prior art and "matching" it as much as possible is probably the right way to go.

After looking some more at the HfFileSystem docs I decided to:

call the field path

not expect the hf:// prefix

src/data_designer/config/seed_source.py

src/data_designer/engine/column_generators/generators/seed_dataset.py

src/data_designer/interface/data_designer.py

src/data_designer/config/config_builder.py

src/data_designer/config/seed_source.py

src/data_designer/engine/resources/seed_reader.py

johnnygreco

Awesome work @mikeknep – thank you 🙏

I'm good to ship once @nabinchha also gives the ✅

nabinchha

Lgtm, nice work @mikeknep! Needs one temp file dropped before merging.

src/data_designer/temp_nmp.py

Makefile

nabinchha

🚢

mikeknep requested review from andreatgretel, johnnygreco and nabinchha December 30, 2025 21:10

mikeknep force-pushed the mknepper/refactor/seed-datasets branch 2 times, most recently from 86683fe to 0043b3b Compare December 30, 2025 21:14

mikeknep commented Dec 30, 2025

View reviewed changes

src/data_designer/config/seed.py Outdated Show resolved Hide resolved

mikeknep commented Dec 30, 2025

View reviewed changes

src/data_designer/config/seed_source.py Outdated Show resolved Hide resolved

mikeknep force-pushed the mknepper/refactor/seed-datasets branch 2 times, most recently from 08ba0d2 to 0e4ce18 Compare January 5, 2026 16:23

nabinchha reviewed Jan 5, 2026

View reviewed changes

johnnygreco reviewed Jan 5, 2026

View reviewed changes

src/data_designer/config/config_builder.py Show resolved Hide resolved

johnnygreco reviewed Jan 5, 2026

View reviewed changes

src/data_designer/config/config_builder.py Outdated Show resolved Hide resolved

nabinchha reviewed Jan 6, 2026

View reviewed changes

src/data_designer/config/seed_source.py Show resolved Hide resolved

nabinchha reviewed Jan 6, 2026

View reviewed changes

src/data_designer/engine/resources/seed_reader.py Show resolved Hide resolved

nabinchha reviewed Jan 6, 2026

View reviewed changes

src/data_designer/engine/resources/seed_reader.py Outdated Show resolved Hide resolved

mikeknep force-pushed the mknepper/refactor/seed-datasets branch 2 times, most recently from 4396fe4 to 07b2ddf Compare January 7, 2026 23:09

mikeknep mentioned this pull request Jan 8, 2026

fix: deserializing instantiates seed columns twice #188

Merged

mikeknep force-pushed the mknepper/refactor/seed-datasets branch from bcbdb10 to ebe3503 Compare January 8, 2026 16:35

johnnygreco previously approved these changes Jan 8, 2026

View reviewed changes

nabinchha previously approved these changes Jan 8, 2026

View reviewed changes

src/data_designer/temp_nmp.py Outdated Show resolved Hide resolved

Makefile Show resolved Hide resolved

mikeknep dismissed stale reviews from nabinchha and johnnygreco via b78861e January 8, 2026 17:39

nabinchha approved these changes Jan 8, 2026

View reviewed changes

mikeknep added 2 commits January 8, 2026 11:43

Refactor seed datasets

1d20321

Drop temp notes

0ff21e7

mikeknep force-pushed the mknepper/refactor/seed-datasets branch from b78861e to 0ff21e7 Compare January 8, 2026 17:43

mikeknep merged commit 6bf7698 into main Jan 8, 2026
14 checks passed

mikeknep deleted the mknepper/refactor/seed-datasets branch January 8, 2026 17:48

refactor: Overhaul to seed datasets #167

refactor: Overhaul to seed datasets #167

Uh oh!

Conversation

mikeknep commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nabinchha Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

mikeknep Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mikeknep Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

johnnygreco left a comment

Choose a reason for hiding this comment

Uh oh!

nabinchha left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nabinchha left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mikeknep commented Dec 30, 2025 •

edited

Loading

mikeknep Jan 5, 2026 •

edited

Loading