Skip to content

Add timeouts and retries to upstream requests to mitigate connection timeouts (#163)#164

Open
Vaibhavkkm wants to merge 2 commits into
nasa:masterfrom
Vaibhavkkm:fix/issue-163-upstream-timeouts
Open

Add timeouts and retries to upstream requests to mitigate connection timeouts (#163)#164
Vaibhavkkm wants to merge 2 commits into
nasa:masterfrom
Vaibhavkkm:fix/issue-163-upstream-timeouts

Conversation

@Vaibhavkkm

Copy link
Copy Markdown

Summary

Fixes #163.

The service scrapes apod.nasa.gov on every request using requests.get / a raw urllib3.PoolManager with no timeout and no retries. When the upstream site is slow or flaky, a hung connection holds a gunicorn worker until it is killed (sync workers, default 30s timeout), so transient upstream instability cascades into the connection timeouts and 503s several users report in #163.

This change hardens the two outbound calls in apod/utility.py (the APOD page fetch and the Vimeo thumbnail lookup) so a single transient upstream failure no longer becomes a hard error for the client, while keeping the worst-case time budget under the worker timeout.

Changes

  • Shared requests.Session with connection pooling and a mounted retry policy: 3 retries with exponential backoff on 429/500/502/503/504, GET only.
  • Explicit timeouts on both upstream calls — 5s connect, 8s read. The Vimeo lookup previously used a raw PoolManager with no timeout at all.
  • Bounded worst-case hang budget that fits within gunicorn's default 30s worker timeout: connect hangs cost at most ~16s, read hangs at most ~26s.
  • Retry-After headers are deliberately ignored (respect_retry_after_header=False). urllib3 honors that header by default with a 6-hour cap; a 429/503 carrying a long Retry-After — exactly the overloaded-upstream case this fix targets — would otherwise sleep the worker past its timeout.
  • raise_for_status() after the existing 404 check, so an upstream error that persists through the retries surfaces as a clear error instead of the parser choking on an error page.
  • Offline unit tests (tests/apod/test_http.py) covering the timeout argument, the retry configuration, the worst-case hang budget, and the 404 / persistent-error paths for both the APOD and Vimeo requests. They run fully offline (mocked session), so CI stays hermetic.

Notes / scope

  • The multi-date endpoints (start_date/end_date range and count random) call the fetch per date in a loop with no shared deadline, so during a sustained upstream outage the per-date retry cost adds up. This is pre-existing loop behavior (those endpoints already iterated and skipped failed dates before this change); I kept the fix scoped to the per-request hardening rather than introducing a cross-request deadline mechanism. Happy to follow up separately if maintainers would like a bounded total budget for the loop endpoints.
  • urllib3 is already a direct dependency in pyproject.toml; this only adds requests.adapters.HTTPAdapter and urllib3.util.retry.Retry imports.

Testing

uv run pytest tests/apod/test_http.py tests/apod/test_service.py   # 8 passed
uv run ruff check .                                                # clean

Also smoke-tested end-to-end against the live apod.nasa.gov site and through the Flask app for a fixed date and today's date.

…timeouts (nasa#163)

The service fetches apod.nasa.gov on every request with no timeout and no
retries: a hung upstream connection holds a gunicorn worker indefinitely
(default 30s worker timeout, sync workers), so upstream instability
cascades into the timeouts and 503s reported in nasa#163.

- Use a shared requests.Session with connection pooling and retries
  (3 retries with exponential backoff on 429/500/502/503/504)
- Add explicit timeouts (5s connect, 8s read) to the APOD page fetch and
  the Vimeo thumbnail lookup (previously a raw PoolManager, no timeout);
  worst-case hang budgets stay under the 30s gunicorn worker timeout
- Ignore Retry-After headers so an upstream-specified delay on 429/503
  cannot sleep a worker past its timeout
- Surface upstream errors that persist after retries via raise_for_status
  instead of parsing an error page
- Add offline unit tests covering the timeout, retry config, hang budget,
  404 and error paths
Copilot AI review requested due to automatic review settings June 11, 2026 05:02

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR hardens outbound HTTP calls to the APOD site (and Vimeo thumbnail API) by adding explicit timeouts and retry behavior, and adds regression tests to validate those behaviors.

Changes:

  • Replace direct requests.get / urllib3.PoolManager usage with a shared requests.Session configured with timeouts and Retry.
  • Surface persistent upstream errors via raise_for_status() instead of parsing error pages.
  • Add offline unit tests covering timeouts, retry configuration, and error mapping behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
tests/apod/test_http.py Adds regression tests validating timeouts, retries, and upstream error surfacing.
apod/utility.py Introduces shared requests.Session with retries/timeouts and updates APOD/Vimeo request paths accordingly.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/apod/test_http.py Outdated
@@ -0,0 +1,125 @@
#!/bin/sh/python
Comment thread apod/utility.py
Comment on lines +44 to +59
session = requests.Session()
session.mount(
"https://",
HTTPAdapter(
max_retries=Retry(
total=3,
connect=2,
read=1,
backoff_factor=0.5,
status_forcelist=[429, 500, 502, 503, 504],
allowed_methods=["GET"],
raise_on_status=False,
respect_retry_after_header=False,
)
),
)
The '#!/bin/sh/python' shebang is not a valid interpreter path and serves
no purpose: these modules are imported by pytest, never executed directly.
Dropped from all three test files, keeping the utf-8 encoding declaration.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Connection timeouts

2 participants