-
Notifications
You must be signed in to change notification settings - Fork 51
Reducing complexity of implementation in order to be able to add Atlas text search token based pagination #1046
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| slice_size = num_params_min_chunk or 1 | ||
| # If successful, continue with normal pagination | ||
| total_data = {"data": []} # type: dict | ||
| total_data["data"].extend(data["data"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should favor .append(...) w/itertools.chain.from_iterable(...) at the end rather than repeated calls to .extend (especially since there is a loop later).
lines: 656, 701, 732, 806
| for i in range(0, len(split_values), batch_size): | ||
| batch = split_values[i : i + batch_size] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be a ways off for being the minimum py version, but in 3.12 itertools introduced batched. I've used the approximate implementation from the docs before:
def batched(iterable, n, *, strict=False):
# batched('ABCDEFG', 2) → AB CD EF G
if n < 1:
raise ValueError('n must be at least one')
iterator = iter(iterable)
while batch := tuple(islice(iterator, n)):
if strict and len(batch) != n:
raise ValueError('batched(): incomplete batch')
yield batch
tsmathis
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really much to say on my end, I am be curious though about the performance/execution time of this implementation vs. the parallel approach.
| params_min_chunk = min( | ||
| parallel_param_str_chunks, key=lambda x: len(x.split("%2C")) | ||
| # If we found a parameter to split, try the request first and only split on error | ||
| if split_param and split_values and len(split_values) > 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if split_param and len(split_values or []) > 1| r -= 1 | ||
| except MPRestError as e: | ||
| # If we get 422 or 414 error, or 0 results for comma-separated params, split into batches | ||
| if "422" in str(e) or "414" in str(e) or "Got 0 results" in str(e): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any(trace in str(e) for trace in ("422","414","Got 0 results"))| ] | ||
| # Batch the split values to reduce number of requests | ||
| # Use batches of up to 100 values to balance URL length and request count | ||
| batch_size = min(100, max(1, len(split_values) // 10)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the batch size be chosen according to the limits we (may) impose on a Query? Or alternatively, should there be a check on the length of a batch after fixing the batch size? That way excessively long queries get rejected (e.g., I query for 1M task IDs, 100 batches would still give me an overly-long list of task IDs)
Summary
Major changes: