Adding multisample feature along with testcases by VijayVignesh1 · Pull Request #740 · Lightning-AI/litData

VijayVignesh1 · 2025-10-24T20:05:04Z

Before submitting

Was this discussed/agreed via a Github issue? (no need for typos and docs improvements)
Did you read the contributor guideline, Pull Request section?
Did you make sure to update the docs?
Did you write any new necessary tests?

What does this PR do?

Fixes #317

PR review

Added support for multisample item.

Basically added a sample_count parameter which creates a batch of sub samples for each sample, given a single transform function.

Note:
Multi-sample behavior applies only when the transform is passed to the
StreamingDataset constructor (i.e., via the `transform` argument),
and not when overriding `__init__` in this subclass.

Sample code:

    def transform_fn_sq(x, sample_idx, *args, **kwargs):
        """A simple transform function that doubles the input."""
        return x * sample_idx

    dataset = StreamingDataset(
        data_dir,
        cache_dir=str(cache_dir),
        shuffle=False,
        transform=[transform_fn_sq],
        sample_count=3,
    )

Anyone in the community is free to review the PR once the tests have passed.
If we didn't discuss your PR in GitHub issues there's a high chance it will not be merged.

Did you have fun?

Make sure you had fun coding 🙃

VijayVignesh1 · 2025-10-24T20:16:06Z

@tchaton @deependujha @bhimrazy Can you verify the approach once? I can then make changes to the README.

codecov · 2025-10-29T15:57:28Z

Codecov Report

❌ Patch coverage is 86.48649% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 81%. Comparing base (90bd404) to head (7f95d5d).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files

@@         Coverage Diff         @@
##           main   #740   +/-   ##
===================================
- Coverage    81%    81%   -0%     
===================================
  Files        54     54           
  Lines      7613   7642   +29     
===================================
+ Hits       6140   6163   +23     
- Misses     1473   1479    +6

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

bhimrazy · 2026-03-19T04:19:11Z

Closing this PR due to inactivity. Please feel free to reopen or recreate it whenever convenient.
A clearer path forward can also be finalized through further discussion over issue.

VijayVignesh1 · 2026-03-23T19:11:14Z

Hi @bhimrazy,
I believe the required feature for this PR has already been completed. Would it be possible to reopen it so we can continue the review and move it forward? If there are any outstanding concerns, I’m happy to address them.

for more information, see https://pre-commit.ci

VijayVignesh1 · 2026-03-24T15:45:08Z

@deependujha The pipeline is failing on test_wav_deserialization in VideoSerializer, even though my PR didn’t touch this file. Do you know why it might be failing?

UPDATE: The latest TorchVision release (v0.26) has deprecated read_video. Our checks are failing because we are using this newest version of TorchVision.
More details: https://docs.pytorch.org/vision/0.25/io.html#video-deprecated
I have pinned the torchvision version to 0.25 and below for now.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

Comments suppressed due to low confidence (1)

src/litdata/streaming/dataset.py:499

Resume logic appears inconsistent with the new inflated indexing. num_samples_yielded counts yielded samples (now multiplied by sample_count), but _replay_chunks_sampling subtracts raw interval sizes (not scaled by sample_count), so chunks_index / indexes will be wrong when sample_count>1. This will cause _resume() to restart from the wrong chunk/offset (repeats or skips). The replay math needs to incorporate sample_count (and ideally persist/validate sample_count in the state_dict).

        # TODO: Implement elastic sampling where the number of workers, ranks can change.
        num_samples_yielded = self._state_dict["num_samples_yielded"]

        worker_start = self.distributed_env.global_rank * num_workers
        worker_end = worker_start + num_workers

        # replay sampling from each worker / chunks using the batch size
        indexes = _replay_sampling(num_samples_yielded, batch_size, num_workers)
        chunks_index, indexes = _replay_chunks_sampling(
            workers_intervals={i: workers_intervals[j] for i, j in enumerate(range(worker_start, worker_end))},
            indexes=indexes,
        )

        # select the chunks and intervals associated to this worker
        worker_rank = self.distributed_env.global_rank * self.worker_env.world_size + self.worker_env.rank
        worker_local_rank = self.worker_env.rank

        self.num_chunks = len(workers_intervals[worker_rank])
        self.worker_next_chunk_index = chunks_index[worker_local_rank]
        self.worker_chunks = workers_chunks[worker_rank]
        self.worker_intervals = workers_intervals[worker_rank]

        if self.worker_next_chunk_index >= self.num_chunks:
            # This can happen when interrupting and resuming after some but not all workers are done.
            # Proceeding would result in an indexing error when attempting to access the next chunk.
            # To prevent this we exit early and let the worker raise a StopIteration in __next__.
            return

        # replay the indexes for the current chunks
        interval = self.worker_intervals[self.worker_next_chunk_index]

        # multiply the interval by the sample_count for multisample case
        current_indexes = np.arange(interval[1] * self.sample_count, interval[2] * self.sample_count)

        # re-shuffle the indexes
        current_indexes = self.shuffler(
            current_indexes, self.num_chunks, self.current_epoch, self.worker_next_chunk_index
        )

        # skip any indexes already consumed
        current_indexes = current_indexes[indexes[worker_local_rank] :]
        self.upcoming_indexes = current_indexes

        self.global_index = indexes[worker_local_rank]

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

deependujha · 2026-04-03T14:11:35Z

Doesn't support resuming with multisample. Will be good to do it in the subsequent PR.

Also, sample_count word seems quite avg, what about num_variants, or something better? cc: @bhimrazy

deependujha

Thanks for the cool work. 🥳

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

Comments suppressed due to low confidence (1)

src/litdata/streaming/dataset.py:505

When sample_count > 1, resume logic will miscompute which chunk to resume from because _replay_chunks_sampling(...) subtracts chunk sizes in base-item units (interval[2]-interval[1]), while num_samples_yielded is counted in yielded (virtual) samples. This can cause resuming in the wrong chunk / offset. The replay logic needs to account for sample_count (e.g., treat each interval size as (interval[2]-interval[1]) * sample_count).

        # replay the indexes for the current chunks
        interval = self.worker_intervals[self.worker_next_chunk_index]

        # multiply the interval by the sample_count for multisample case
        current_indexes = np.arange(interval[1] * self.sample_count, interval[2] * self.sample_count)

        # re-shuffle the indexes
        current_indexes = self.shuffler(
            current_indexes, self.num_chunks, self.current_epoch, self.worker_next_chunk_index
        )

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

deependujha · 2026-04-03T15:24:28Z

requires #806

Adding multisample feature along with testcases

6a77302

VijayVignesh1 force-pushed the feature/add_multisample_support branch from 1b01b6f to 6a77302 Compare October 24, 2025 20:12

deependujha reviewed Oct 29, 2025

View reviewed changes

Comment thread src/litdata/streaming/dataset.py Outdated

Comment thread src/litdata/streaming/dataset.py Outdated

VijayVignesh1 added 2 commits November 3, 2025 15:44

Modifying the feature for only single transform case

494e148

Moving transform fns outside test functions

87f69b8

VijayVignesh1 marked this pull request as ready for review November 3, 2025 21:19

VijayVignesh1 requested review from justusschock, lantiga and tchaton as code owners November 3, 2025 21:19

Merge branch 'main' into feature/add_multisample_support

229ff5b

bhimrazy marked this pull request as draft January 8, 2026 10:16

bhimrazy closed this Mar 19, 2026

deependujha reopened this Mar 23, 2026

VijayVignesh1 and others added 3 commits March 24, 2026 10:42

Merge branch 'main' into feature/add_multisample_support

32084cb

[pre-commit.ci] auto fixes from pre-commit.com hooks

f853d16

for more information, see https://pre-commit.ci

Resolving conflicts

bfabbdc

Pinning torchvision version to less than or equal to 0.25

1322f71

VijayVignesh1 marked this pull request as ready for review March 24, 2026 18:56

deependujha requested a review from Copilot March 24, 2026 20:02

Copilot started reviewing on behalf of deependujha March 24, 2026 20:03 View session

deependujha reviewed Mar 24, 2026

View reviewed changes

Comment thread src/litdata/streaming/dataset.py Outdated

Comment thread src/litdata/streaming/dataset.py Outdated

Comment thread requirements.txt Outdated

VijayVignesh1 and others added 2 commits March 24, 2026 16:55

Changing invalid config handling from warning to error

f394571

update

59685f1

deependujha reviewed Mar 24, 2026

View reviewed changes

Comment thread src/litdata/streaming/dataset.py Outdated

Copilot AI reviewed Mar 25, 2026

View reviewed changes

Merge branch 'main' into feature/add_multisample_support

a57ae41

deependujha reviewed Mar 27, 2026

View reviewed changes

Comment thread requirements/test.txt Outdated

Apply suggestion from @deependujha

d84239d

deependujha reviewed Apr 3, 2026

View reviewed changes

Comment thread src/litdata/streaming/dataset.py Outdated

deependujha reviewed Apr 3, 2026

View reviewed changes

Comment thread src/litdata/streaming/dataset.py Outdated

deependujha added 4 commits April 3, 2026 18:33

Apply suggestion from @deependujha

08dea65

Apply suggestion from @deependujha

3bc2b75

update

495520f

update readme

743373c

deependujha requested a review from Copilot April 3, 2026 13:42

Copilot started reviewing on behalf of deependujha April 3, 2026 13:43 View session

Copilot AI reviewed Apr 3, 2026

View reviewed changes

deependujha and others added 7 commits April 3, 2026 19:19

update

d50178e

Update src/litdata/streaming/dataset.py

40f9cd3

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/litdata/streaming/dataset.py

335bb4d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update tests/streaming/test_dataset.py

b6c46a4

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/litdata/streaming/dataset.py

a365de0

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

update

9b65f25

update

b0335c4

deependujha requested a review from Copilot April 3, 2026 14:07

Copilot started reviewing on behalf of deependujha April 3, 2026 14:08 View session

deependujha approved these changes Apr 3, 2026

View reviewed changes

update

a77dd2c

Copilot AI reviewed Apr 3, 2026

View reviewed changes

deependujha and others added 4 commits April 3, 2026 19:47

update

04cfef3

Update tests/streaming/test_dataloader.py

2694588

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

update

23eff69

update

7f95d5d

Conversation

VijayVignesh1 commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

PR review

Did you have fun?

Uh oh!

VijayVignesh1 commented Oct 24, 2025

Uh oh!

Uh oh!

Uh oh!

codecov Bot commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bhimrazy commented Mar 19, 2026

Uh oh!

VijayVignesh1 commented Mar 23, 2026

Uh oh!

VijayVignesh1 commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

deependujha commented Apr 3, 2026

Uh oh!

deependujha left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

deependujha commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

VijayVignesh1 commented Oct 24, 2025 •

edited

Loading

codecov Bot commented Oct 29, 2025 •

edited

Loading

VijayVignesh1 commented Mar 24, 2026 •

edited

Loading