-
Notifications
You must be signed in to change notification settings - Fork 51
refactor: Overhaul to seed datasets #167
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
86683fe to
0043b3b
Compare
08ba0d2 to
0e4ce18
Compare
| class HuggingFaceSeedConfig(SeedDatasetConfig): | ||
| seed_type: Literal["hf"] = "hf" | ||
|
|
||
| dataset: str = Field(pattern=r"^hf://datasets/*") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on calling this path as well to stay consistent with how hf load_dataset kwarg has it defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with the approach here to try to minimize user confusion e.g. asking questions like "am I supposed to include "datasets/" here or not? Am I supposed to put the "hf://" prefix?" The (enforced) answer being "yes and yes. However, I don't have a strong opinion here, and in fact leaning on load_dataset as the prior art and "matching" it as much as possible is probably the right way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking some more at the HfFileSystem docs I decided to:
- call the field
path - not expect the
hf://prefix
4396fe4 to
07b2ddf
Compare
bcbdb10 to
ebe3503
Compare
johnnygreco
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work @mikeknep – thank you 🙏
I'm good to ship once @nabinchha also gives the ✅
nabinchha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm, nice work @mikeknep! Needs one temp file dropped before merging.
nabinchha
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚢
b78861e to
0ff21e7
Compare
This PR makes some significant, breaking changes to how we work with seed datasets.
The two most important new base classes are:
SeedSource: config-side, provides pointers to datasets (HF, local file, dataframe, NMP File)SeedReader: engine-side, reading seed datasets using duckdbThe most significant structural change is that both seed column resolution (fetching column names) and overall builder validation are moved to the "back-end", i.e. to
engine. By doing this we can use the newSeedReaderobjects to get the column names consistently regardless of seed type (using duckdb). I made a singlecompile_data_designer_configfunction so that there is a single "entrypoint" to go from builder -> fully resolved and validatedDataDesignerConfig. This new compiler could do more of both these things ("deferred column resolution" and overall validation) in the future if needed.Moving validation to
enginedoes mean that we no longer have aDataDesignerConfigBuilder.validatemethod. Validation happens when you callprevieworcreatefrom the interface (and in NMP, validation will happen server-side in those endpoints). If this is too unpleasant and we want to keep (some form of) thatvalidatemethod on the builder, we could move things around a bit so that "partial" validation can be done config-side, but it would not be able to do full validation because the builder will never know the seed columns.Note: I've not yet updated docs or notebooks; I figure I'll wait until the implementation is reviewed to minimize rework there.