Skip to content

feat(data): serve contiguous data by content digest#753

Open
vilenarios wants to merge 2 commits into
developfrom
feat/serve-data-by-digest
Open

feat(data): serve contiguous data by content digest#753
vilenarios wants to merge 2 commits into
developfrom
feat/serve-data-by-digest

Conversation

@vilenarios
Copy link
Copy Markdown
Contributor

What

Adds GET|HEAD /ar-io/digest/:digest — serve contiguous data addressed by its SHA-256 content digest, the same base64url value the gateway already emits as the X-AR-IO-Digest response header and uses as its on-disk cache key.

The content store is already content-addressed by this hash; this PR exposes it directly as a first-class, gateway-namespaced endpoint.

Why

Lets clients fetch (and dedup/verify) bytes by content identity rather than by transaction/data-item id. Because the bytes provably hash to the requested digest, the endpoint is inherently self-verifying in a way /raw/:txid can't be for peer-sourced bytes.

Semantics

  • Self-verifying: X-AR-IO-Verified: true always.
  • Immutable: Cache-Control: public, max-age=…, immutable (the URL is the hash of the bytes).
  • Local-cache only: there is no on-demand fetch by content hash — Arweave/peers address by id, not by content hash — so a digest the node has never materialized returns 404.
  • Full signed-header parity with /raw: a representative id that resolves to the digest (cheap lookup via the existing contiguous_data_hash index) is run through the same setDataHeaders path, so the response carries the full id-scoped header set (X-AR-IO-Data-Id, tags, owner, signature, root offsets) — all then covered by the HTTPSIG signature. The served digest is pinned onto the attributes so digest/ETag/Content-Digest always describe the bytes actually streamed.
  • Range, HEAD, If-None-Match, blocklist (451), and malformed-digest (400) all handled.

Design notes

  • Path choice: /ar-io/digest/:digest (not /raw/by-digest/…). /raw and /{txid} are the cross-gateway Arweave-standard namespace; content-hash serving is AR.IO-specific, so it lives under the /ar-io/* namespace and mirrors the X-AR-IO-Digest header name for round-trip symmetry.
  • No migration: the reverse contiguous_data_hash index already exists; this only adds a read path.
  • Encoding: bare base64url SHA-256 (identical to the header and on-disk key) for v1.

Changes

  • DB: selectDataAttributesByHash + getDataAttributesByHash through the StandaloneSqlite worker / circuit-breaker / queue / message-handler chain
  • Data source: ReadThroughDataCache.getDataByHash + ByHashDataSource interface
  • Attributes: getDataAttributesByHash on DataAttributesSource + CompositeDataAttributesSource
  • Route/handler: DIGEST_DATA_PATH_REGEX, createDigestDataHandler on arIoRouter; handleRangeRequest generalized with an optional region fetcher
  • Docs: openapi.yaml path + glossary entry

Tests

getDataAttributesByHash (DB), getDataByHash (data source), and the handler (GET / HEAD / range / 404 / 451 / 400 / no-representative-id) — all green.

Note

A second commit (chore: incidental working-tree tweaks) bundles three unrelated working-tree changes (docker-compose TX_FETCHER_WORKER_COUNT passthrough, grafana foldersFromFilesStructure, a CLAUDE.md note) that were present in the checkout. Happy to drop them into their own PR if preferred.

Follow-ups (not in this PR)

  • GraphQL: expose digest field + transactions(digest:) filter (cross-DB; harder)
  • ClickHouse parity (needs a content-hash column → full auto-verify/Parquet chain)

🤖 Generated with Claude Code

Add GET|HEAD /ar-io/digest/:digest to serve contiguous data addressed by
its SHA-256 content digest — the same base64url value emitted as the
X-AR-IO-Digest response header and used as the on-disk cache key.

The gateway already stores contiguous data content-addressed by this hash;
this exposes it directly. Responses are inherently self-verifying (the bytes
provably hash to the requested digest) so X-AR-IO-Verified is always true and
Cache-Control is immutable. Local-cache only: there is no on-demand fetch by
content hash, since Arweave/peers address by id, so an unknown digest is 404.

For header parity with /raw, a representative id that resolves to the digest
is looked up (cheap, via the existing contiguous_data_hash index) and run
through the same setDataHeaders path, so the full id-scoped header set
(X-AR-IO-Data-Id, tags, owner, signature, root offsets) is present and signed
by the HTTPSIG middleware. The served digest is pinned onto the attributes so
the digest/ETag/Content-Digest headers always describe the bytes streamed.

- DB: selectDataAttributesByHash SQL + getDataAttributesByHash through the
  StandaloneSqlite worker/circuit-breaker/queue/handler chain (no migration —
  the contiguous_data_hash index already exists)
- Data source: ReadThroughDataCache.getDataByHash + ByHashDataSource interface
- Attributes: getDataAttributesByHash on DataAttributesSource + composite
- Route: DIGEST_DATA_PATH_REGEX, createDigestDataHandler on arIoRouter;
  handleRangeRequest generalized with an optional region fetcher
- Tests: handler (GET/HEAD/range/404/451/400/no-id), getDataByHash, DB method
- Docs: openapi path + glossary "Content Digest (ar-io-digest)"

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 26, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: ce347f96-ad9e-4ff2-b52a-e87ba07cf9ad

📥 Commits

Reviewing files that changed from the base of the PR and between 5f3411b and 4bf33dd.

📒 Files selected for processing (7)
  • src/data/read-through-data-cache.test.ts
  • src/data/read-through-data-cache.ts
  • src/database/sql/data/content-attributes.sql
  • src/database/standalone-sqlite.test.ts
  • src/routes/data/handlers.test.ts
  • src/routes/data/handlers.ts
  • src/types.d.ts
🚧 Files skipped from review as they are similar to previous changes (7)
  • src/database/sql/data/content-attributes.sql
  • src/types.d.ts
  • src/database/standalone-sqlite.test.ts
  • src/routes/data/handlers.test.ts
  • src/data/read-through-data-cache.test.ts
  • src/data/read-through-data-cache.ts
  • src/routes/data/handlers.ts

📝 Walkthrough

Walkthrough

This PR implements content-addressed data serving by SHA-256 digest via a new /ar-io/digest/{digest} endpoint. It adds hash-keyed database lookups, cache retrieval by content hash, pluggable range-request routing, and a complete HTTP handler with response headers, error handling, and range support.

Changes

Content-Addressed Digest Data Serving

Layer / File(s) Summary
Type contracts for hash-keyed data access
src/types.d.ts
DataAttributesByHash captures minimal metadata (hash, size, optional contentType and id); ContiguousDataIndex gains getDataAttributesByHash(hash) method; ByHashDataSource defines hash-keyed retrieval with region support; ByHashData extends ContiguousData with optional representativeId.
Database hash lookup with circuit breaker
src/database/sql/data/content-attributes.sql, src/database/standalone-sqlite.ts, src/metrics.ts, src/database/standalone-sqlite.test.ts
New selectDataAttributesByHash SQL query reverse-resolves metadata from content hash via left-join, deterministically selecting representative id by verified/trusted preference; worker method maps result to typed attributes object; dedicated circuit breaker with metrics registration; RPC dispatch; tests verify indexed hashes, missing hashes, and deterministic id selection across multiple verified/unverified rows.
Cached data serving by hash
src/data/read-through-data-cache.ts, src/data/read-through-data-cache.test.ts
ReadThroughDataCache.getDataByHash(hash, region?) validates hash is indexed, fetches cached stream with optional region slicing, records full vs range metrics, and returns ByHashData with verified/trusted/cached flags forced to true; tests cover full reads, region slicing, missing hashes, and missing blobs.
Pluggable region fetch for range requests
src/routes/data/handlers.ts
handleRangeRequest refactored to make dataSource optional and accept getRegionData(region) callback; internal fetchRegion helper routes requests to callback when provided or falls back to dataSource.getData; applied in single-range and multipart-range response paths.
Digest endpoint handler implementation
src/routes/data/handlers.ts, src/routes/data/handlers.test.ts
New createDigestDataHandler for `GET
Route registration and handler wiring
src/constants.ts, src/routes/ar-io.ts, src/routes/data/index.ts
DIGEST_DATA_PATH_REGEX constant matches /ar-io/digest/{43-char-digest}; ar-io.ts imports and registers GET route; data/index.ts imports factory and exports digestDataHandler instance wired with shared dependencies (log, data source, attributes source, blocklist, rate limiter, payment, metadata resolver).
Documentation and API specification
docs/glossary.md, docs/openapi.yaml
Glossary entry for Content Digest (ar-io-digest) explains SHA-256 base64url encoding, header name, on-disk mapping, and local-cache-only semantics; OpenAPI spec documents GET and HEAD on /ar-io/digest/{digest} with base64url digest parameter, Range support, and response codes (200/206/400/404/416/451) with immutable caching and always-verified headers.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

  • ar-io/ar-io-node#633: Introduces the trusted flag and updates content-attributes.sql, establishing the trust semantics that the digest-by-hash work leverages for deterministic representative ID selection.
  • ar-io/ar-io-node#705: Modifies setDataHeaders to emit retrieval-hint and alignment headers; the new digest handler uses setDataHeaders, creating a direct overlap in response header generation.
  • ar-io/ar-io-node#514: Refactors payment and rate-limiting enforcement in src/routes/data/handlers.ts; the new digest handler integrates into that same centralized flow.

Suggested reviewers

  • dtfiedler
  • hlolli
  • djwhitt
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat(data): serve contiguous data by content digest' clearly and concisely summarizes the main change—adding a new endpoint to serve data by content hash digest.
Description check ✅ Passed The description comprehensively explains the PR's purpose, implementation, semantics, design choices, and test coverage, and is clearly related to the changeset.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/serve-data-by-digest

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (3)
src/routes/data/index.ts (1)

35-45: ⚡ Quick win

Use TSDoc for the exported digestDataHandler declaration.

Since this is a new exported handler, prefer a TSDoc block over inline comments to keep export-level documentation consistent.

As per coding guidelines, src/**/*.ts: Add or improve TSDoc comments on code you touch.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/routes/data/index.ts` around lines 35 - 45, Add a TSDoc block above the
exported digestDataHandler declaration describing its purpose, route mounting
(/ar-io/digest/:digest), and the key injected dependencies
(createDigestDataHandler, dataSource, dataAttributesSource,
dataBlockListValidator, rateLimiter, paymentProcessor, dataItemMetaResolver);
replace the inline comment with this TSDoc and ensure it uses the /** ... */
format immediately above the export so documentation tools pick it up and it
follows the repo's TSDoc conventions.
src/constants.ts (1)

91-95: ⚡ Quick win

Add TSDoc to the new exported regex constant.

DIGEST_DATA_PATH_REGEX is exported and newly introduced; please convert the inline comments to a symbol-level TSDoc block for consistency with the file’s documented exports.

As per coding guidelines, src/**/*.ts: Add or improve TSDoc comments on code you touch.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/constants.ts` around lines 91 - 95, Add a symbol-level TSDoc block above
the exported constant DIGEST_DATA_PATH_REGEX that replaces the inline comments:
describe that it matches content-addressed data using a base64url SHA-256 digest
(43 chars), that this value is emitted as X-AR-IO-Digest, and that the
/ar-io/digest/ prefix is distinct from /raw/:txid because a 43-char digest looks
like a txid; leave the exported declaration export const DIGEST_DATA_PATH_REGEX
= /^\/ar-io\/digest\/([a-zA-Z0-9-_]{43})\/?$/i; unchanged except for adding the
TSDoc. Ensure the TSDoc uses proper /** ... */ format and is placed immediately
above the constant.
src/routes/data/handlers.ts (1)

1394-1416: ⚡ Quick win

Remove the orphaned TSDoc block above createDigestDataHandler.

There are two consecutive doc blocks here; the first one is not attached to a symbol and reads like stale docs, which makes generated/internal docs ambiguous.

As per coding guidelines, src/**/*.ts: Add or improve TSDoc comments on code you touch.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/routes/data/handlers.ts` around lines 1394 - 1416, Remove the stray/stale
TSDoc block that sits immediately above createDigestDataHandler so only the
single relevant doc block (the one attached to createDigestDataHandler) remains;
delete the orphaned comment content and, if needed, consolidate any unique
information into the remaining TSDoc for createDigestDataHandler (update that
function's comment rather than keeping two consecutive blocks) and ensure the
file follows the project's TSDoc guidelines.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/glossary.md`:
- Around line 226-237: Update the broken anchor in the "Content Digest
(ar-io-digest)" entry: change the link target from `#contiguous-data-store` to
the existing anchor `#contiguous-data` so the reference to the contiguous data
section resolves correctly (look for the "Content Digest (ar-io-digest)"
paragraph and its `[contiguous data store](`#contiguous-data-store`)` link).

In `@src/data/read-through-data-cache.test.ts`:
- Around line 208-215: The test for readThroughDataCache.getDataByHash currently
only asserts returned metadata; update the test to also assert that the
underlying dataStore.get was called with the expected region object (offset and
size) so the region is not dropped. Locate the call to
readThroughDataCache.getDataByHash in the test and add an assertion against the
mocked/spied dataStore.get (or equivalent spy used in this spec) verifying its
arguments include a region matching { offset: 0, size: 4 } (or that the
second/appropriate parameter contains that region), ensuring the store call
receives the region.

In `@src/data/read-through-data-cache.ts`:
- Around line 472-490: The success metric is being incremented too early; remove
the immediate call to metrics.getDataStreamSuccessesTotal.inc(...) and instead
attach a one-time listener to cacheStream (use cacheStream.once('end', ...)) to
call metrics.getDataStreamSuccessesTotal.inc with the same labels (class:
this.constructor.name, source: 'cache', request_type: requestType) when the
stream actually finishes; also attach a one-time 'error' listener to cacheStream
to avoid incrementing on failure (or increment a failure metric if available).
Ensure you handle the case the stream has already ended (check
cacheStream.readableEnded or equivalent) and call the metric immediately in that
case. Use the existing symbols cacheStream, requestType,
metrics.getDataStreamSuccessesTotal, attributes, region, and hash to locate and
implement the change.

In `@src/database/standalone-sqlite.ts`:
- Around line 1227-1242: Add a TSDoc comment to the public method
getDataAttributesByHash describing the function's contract: explain that the
input hash is expected to be a base64-url encoded string, that the
implementation decodes it (via fromB64Url) and will return undefined when no row
is found, and that the returned object contains base64-url encoded hash and id
fields (using toB64Url), a numeric size, and an optional contentType; also add
equivalent TSDoc to the other newly added "by-hash" methods to document
not-found behavior, encoding/decoding expectations, and exact return
shape/nullable fields.

In `@src/routes/ar-io.ts`:
- Around line 232-235: The new route arIoRouter.get(DIGEST_DATA_PATH_REGEX,
digestDataHandler) is exposing per-digest metric labels; add a normalizePath
mapping for the digest route so metrics use a fixed template (e.g.
"/ar-io/digest/:digest") instead of the raw digest value. Locate the metrics
path normalization configuration where includePath: true or path normalization
rules are defined and add an entry mapping the DIGEST_DATA_PATH_REGEX (or the
concrete route "/ar-io/digest/:digest") to the normalized template so the
arIoRouter.get route emits metrics under a stable label rather than per-digest
values.

---

Nitpick comments:
In `@src/constants.ts`:
- Around line 91-95: Add a symbol-level TSDoc block above the exported constant
DIGEST_DATA_PATH_REGEX that replaces the inline comments: describe that it
matches content-addressed data using a base64url SHA-256 digest (43 chars), that
this value is emitted as X-AR-IO-Digest, and that the /ar-io/digest/ prefix is
distinct from /raw/:txid because a 43-char digest looks like a txid; leave the
exported declaration export const DIGEST_DATA_PATH_REGEX =
/^\/ar-io\/digest\/([a-zA-Z0-9-_]{43})\/?$/i; unchanged except for adding the
TSDoc. Ensure the TSDoc uses proper /** ... */ format and is placed immediately
above the constant.

In `@src/routes/data/handlers.ts`:
- Around line 1394-1416: Remove the stray/stale TSDoc block that sits
immediately above createDigestDataHandler so only the single relevant doc block
(the one attached to createDigestDataHandler) remains; delete the orphaned
comment content and, if needed, consolidate any unique information into the
remaining TSDoc for createDigestDataHandler (update that function's comment
rather than keeping two consecutive blocks) and ensure the file follows the
project's TSDoc guidelines.

In `@src/routes/data/index.ts`:
- Around line 35-45: Add a TSDoc block above the exported digestDataHandler
declaration describing its purpose, route mounting (/ar-io/digest/:digest), and
the key injected dependencies (createDigestDataHandler, dataSource,
dataAttributesSource, dataBlockListValidator, rateLimiter, paymentProcessor,
dataItemMetaResolver); replace the inline comment with this TSDoc and ensure it
uses the /** ... */ format immediately above the export so documentation tools
pick it up and it follows the repo's TSDoc conventions.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: c9ecfca2-aac3-4fe1-a6d7-e45cf58943bc

📥 Commits

Reviewing files that changed from the base of the PR and between 9d07dba and 5f3411b.

📒 Files selected for processing (18)
  • CLAUDE.md
  • docker-compose.yaml
  • docs/glossary.md
  • docs/openapi.yaml
  • monitoring/grafana/provisioning/dashboards/dashboards.yaml
  • src/constants.ts
  • src/data/composite-data-attributes-source.ts
  • src/data/read-through-data-cache.test.ts
  • src/data/read-through-data-cache.ts
  • src/database/sql/data/content-attributes.sql
  • src/database/standalone-sqlite.test.ts
  • src/database/standalone-sqlite.ts
  • src/metrics.ts
  • src/routes/ar-io.ts
  • src/routes/data/handlers.test.ts
  • src/routes/data/handlers.ts
  • src/routes/data/index.ts
  • src/types.d.ts

Comment thread docs/glossary.md
Comment on lines +226 to +237
<a id="content-digest"></a> **Content Digest (ar-io-digest)** - The SHA-256
hash of a piece of contiguous data, base64url-encoded. It is emitted on data
responses as the `X-AR-IO-Digest` header and is the key under which the
[contiguous data store](#contiguous-data-store) addresses bytes on disk
(`data/<h0:2>/<h2:4>/<hash>`). Because the same value identifies content
across the cache, the index, and the response header, it doubles as a stable
content address. The `GET /ar-io/digest/{digest}` endpoint serves bytes
directly by this value; such responses are inherently self-verifying (the
bytes provably hash to the requested digest) and immutable, but local-cache
only — there is no on-demand fetch by content hash, since Arweave addresses
data by [item ID](#item-id), not by content hash.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fix broken link fragment.

The anchor reference at line 229 uses #contiguous-data-store, but the actual anchor at line 99 is <a id="contiguous-data">. Update the link to #contiguous-data to match the existing anchor.

🔗 Proposed fix
 <a id="content-digest"></a> **Content Digest (ar-io-digest)** - The SHA-256
 hash of a piece of contiguous data, base64url-encoded. It is emitted on data
 responses as the `X-AR-IO-Digest` header and is the key under which the
-[contiguous data store](`#contiguous-data-store`) addresses bytes on disk
+[contiguous data store](`#contiguous-data`) addresses bytes on disk
 (`data/<h0:2>/<h2:4>/<hash>`). Because the same value identifies content
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
<a id="content-digest"></a> **Content Digest (ar-io-digest)** - The SHA-256
hash of a piece of contiguous data, base64url-encoded. It is emitted on data
responses as the `X-AR-IO-Digest` header and is the key under which the
[contiguous data store](#contiguous-data-store) addresses bytes on disk
(`data/<h0:2>/<h2:4>/<hash>`). Because the same value identifies content
across the cache, the index, and the response header, it doubles as a stable
content address. The `GET /ar-io/digest/{digest}` endpoint serves bytes
directly by this value; such responses are inherently self-verifying (the
bytes provably hash to the requested digest) and immutable, but local-cache
only — there is no on-demand fetch by content hash, since Arweave addresses
data by [item ID](#item-id), not by content hash.
<a id="content-digest"></a> **Content Digest (ar-io-digest)** - The SHA-256
hash of a piece of contiguous data, base64url-encoded. It is emitted on data
responses as the `X-AR-IO-Digest` header and is the key under which the
[contiguous data store](`#contiguous-data`) addresses bytes on disk
(`data/<h0:2>/<h2:4>/<hash>`). Because the same value identifies content
across the cache, the index, and the response header, it doubles as a stable
content address. The `GET /ar-io/digest/{digest}` endpoint serves bytes
directly by this value; such responses are inherently self-verifying (the
bytes provably hash to the requested digest) and immutable, but local-cache
only — there is no on-demand fetch by content hash, since Arweave addresses
data by [item ID](`#item-id`), not by content hash.
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 229-229: Link fragments should be valid

(MD051, link-fragments)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/glossary.md` around lines 226 - 237, Update the broken anchor in the
"Content Digest (ar-io-digest)" entry: change the link target from
`#contiguous-data-store` to the existing anchor `#contiguous-data` so the
reference to the contiguous data section resolves correctly (look for the
"Content Digest (ar-io-digest)" paragraph and its `[contiguous data
store](`#contiguous-data-store`)` link).

Comment on lines +208 to +215
it('honors a byte region', async () => {
const result = await readThroughDataCache.getDataByHash('knownHash', {
offset: 0,
size: 4,
});
assert.equal(result.size, 4);
assert.equal(result.totalSize, 100);
});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert that the region is passed to the store call.

This test currently validates return metadata only; it can still pass if region is accidentally dropped before dataStore.get().

🧪 Suggested assertion hardening
   it('honors a byte region', async () => {
+    let receivedRegion: { offset: number; size: number } | undefined;
+    mock.method(
+      mockContiguousDataStore,
+      'get',
+      async (hash: string, region?: { offset: number; size: number }) => {
+        if (hash === 'knownHash') {
+          receivedRegion = region;
+          const stream = new Readable();
+          stream.push('simulated data');
+          stream.push(null);
+          return stream;
+        }
+        return undefined;
+      },
+    );
+
     const result = await readThroughDataCache.getDataByHash('knownHash', {
       offset: 0,
       size: 4,
     });
     assert.equal(result.size, 4);
     assert.equal(result.totalSize, 100);
+    assert.deepEqual(receivedRegion, { offset: 0, size: 4 });
   });
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
it('honors a byte region', async () => {
const result = await readThroughDataCache.getDataByHash('knownHash', {
offset: 0,
size: 4,
});
assert.equal(result.size, 4);
assert.equal(result.totalSize, 100);
});
it('honors a byte region', async () => {
let receivedRegion: { offset: number; size: number } | undefined;
mock.method(
mockContiguousDataStore,
'get',
async (hash: string, region?: { offset: number; size: number }) => {
if (hash === 'knownHash') {
receivedRegion = region;
const stream = new Readable();
stream.push('simulated data');
stream.push(null);
return stream;
}
return undefined;
},
);
const result = await readThroughDataCache.getDataByHash('knownHash', {
offset: 0,
size: 4,
});
assert.equal(result.size, 4);
assert.equal(result.totalSize, 100);
assert.deepEqual(receivedRegion, { offset: 0, size: 4 });
});
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/data/read-through-data-cache.test.ts` around lines 208 - 215, The test
for readThroughDataCache.getDataByHash currently only asserts returned metadata;
update the test to also assert that the underlying dataStore.get was called with
the expected region object (offset and size) so the region is not dropped.
Locate the call to readThroughDataCache.getDataByHash in the test and add an
assertion against the mocked/spied dataStore.get (or equivalent spy used in this
spec) verifying its arguments include a region matching { offset: 0, size: 4 }
(or that the second/appropriate parameter contains that region), ensuring the
store call receives the region.

Comment on lines +472 to +490
const requestType = region !== undefined ? 'range' : 'full';
metrics.getDataStreamSuccessesTotal.inc({
class: this.constructor.name,
source: 'cache',
request_type: requestType,
});

const totalSize = attributes.size;
return {
hash,
stream: cacheStream,
size: region?.size ?? totalSize,
totalSize,
sourceContentType: attributes.contentType,
// Content-addressed: the bytes provably hash to the requested digest.
verified: true,
trusted: true,
cached: true,
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Defer success metrics until the stream actually completes.

getDataByHash() increments getDataStreamSuccessesTotal before bytes are consumed. If the stream errors later, this path will still report success.

📈 Suggested fix
   const requestType = region !== undefined ? 'range' : 'full';
-  metrics.getDataStreamSuccessesTotal.inc({
-    class: this.constructor.name,
-    source: 'cache',
-    request_type: requestType,
-  });
+  cacheStream.once('error', () => {
+    metrics.getDataStreamErrorsTotal.inc({
+      class: this.constructor.name,
+      source: 'cache',
+      request_type: requestType,
+    });
+  });
+  cacheStream.once('end', () => {
+    metrics.getDataStreamSuccessesTotal.inc({
+      class: this.constructor.name,
+      source: 'cache',
+      request_type: requestType,
+    });
+  });
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
const requestType = region !== undefined ? 'range' : 'full';
metrics.getDataStreamSuccessesTotal.inc({
class: this.constructor.name,
source: 'cache',
request_type: requestType,
});
const totalSize = attributes.size;
return {
hash,
stream: cacheStream,
size: region?.size ?? totalSize,
totalSize,
sourceContentType: attributes.contentType,
// Content-addressed: the bytes provably hash to the requested digest.
verified: true,
trusted: true,
cached: true,
};
const requestType = region !== undefined ? 'range' : 'full';
cacheStream.once('error', () => {
metrics.getDataStreamErrorsTotal.inc({
class: this.constructor.name,
source: 'cache',
request_type: requestType,
});
});
cacheStream.once('end', () => {
metrics.getDataStreamSuccessesTotal.inc({
class: this.constructor.name,
source: 'cache',
request_type: requestType,
});
});
const totalSize = attributes.size;
return {
hash,
stream: cacheStream,
size: region?.size ?? totalSize,
totalSize,
sourceContentType: attributes.contentType,
// Content-addressed: the bytes provably hash to the requested digest.
verified: true,
trusted: true,
cached: true,
};
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/data/read-through-data-cache.ts` around lines 472 - 490, The success
metric is being incremented too early; remove the immediate call to
metrics.getDataStreamSuccessesTotal.inc(...) and instead attach a one-time
listener to cacheStream (use cacheStream.once('end', ...)) to call
metrics.getDataStreamSuccessesTotal.inc with the same labels (class:
this.constructor.name, source: 'cache', request_type: requestType) when the
stream actually finishes; also attach a one-time 'error' listener to cacheStream
to avoid incrementing on failure (or increment a failure metric if available).
Ensure you handle the case the stream has already ended (check
cacheStream.readableEnded or equivalent) and call the metric immediately in that
case. Use the existing symbols cacheStream, requestType,
metrics.getDataStreamSuccessesTotal, attributes, region, and hash to locate and
implement the change.

Comment on lines +1227 to +1242
getDataAttributesByHash(hash: string) {
const row = this.stmts.data.selectDataAttributesByHash.get({
hash: fromB64Url(hash),
});

if (row === undefined) {
return undefined;
}

return {
hash: row.hash ? toB64Url(row.hash) : hash,
size: row.data_size,
contentType: row.original_source_content_type ?? undefined,
id: row.id ? toB64Url(row.id) : undefined,
};
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add TSDoc for newly added by-hash methods.

Both new getDataAttributesByHash methods are touched public API surface but lack TSDoc comments. Please document contract/return semantics (not-found behavior and hash format assumptions).

As per coding guidelines: "src/**/*.ts: Add or improve TSDoc comments on code you touch".

Also applies to: 3505-3513

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/database/standalone-sqlite.ts` around lines 1227 - 1242, Add a TSDoc
comment to the public method getDataAttributesByHash describing the function's
contract: explain that the input hash is expected to be a base64-url encoded
string, that the implementation decodes it (via fromB64Url) and will return
undefined when no row is found, and that the returned object contains base64-url
encoded hash and id fields (using toB64Url), a numeric size, and an optional
contentType; also add equivalent TSDoc to the other newly added "by-hash"
methods to document not-found behavior, encoding/decoding expectations, and
exact return shape/nullable fields.

Comment thread src/routes/ar-io.ts
Comment on lines +232 to +235
// Content-addressed data: serve bytes by their SHA-256 digest (the value
// emitted as X-AR-IO-Digest). GET registration also answers HEAD. Local
// content store only — see createDigestDataHandler.
arIoRouter.get(DIGEST_DATA_PATH_REGEX, digestDataHandler);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Normalize /ar-io/digest/:digest in metrics path mapping to prevent label-cardinality blowup.

Adding this route without a normalizePath rule causes per-digest metric labels (includePath: true), which can grow unbounded and degrade metrics stability.

💡 Suggested fix
       if (path.startsWith('/ar-io/')) {
         if (path === '/ar-io/healthcheck') return path;
         if (path === '/ar-io/info') return path;
         if (path === '/ar-io/peers') return path;
+        if (path.match(/^\/ar-io\/digest\/[a-zA-Z0-9_-]{43}\/?$/))
+          return '/ar-io/digest/:digest';
         if (path.match(/^\/ar-io\/resolver\/[^/]+$/))
           return '/ar-io/resolver/:name';
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/routes/ar-io.ts` around lines 232 - 235, The new route
arIoRouter.get(DIGEST_DATA_PATH_REGEX, digestDataHandler) is exposing per-digest
metric labels; add a normalizePath mapping for the digest route so metrics use a
fixed template (e.g. "/ar-io/digest/:digest") instead of the raw digest value.
Locate the metrics path normalization configuration where includePath: true or
path normalization rules are defined and add an entry mapping the
DIGEST_DATA_PATH_REGEX (or the concrete route "/ar-io/digest/:digest") to the
normalized template so the arIoRouter.get route emits metrics under a stable
label rather than per-digest values.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 26, 2026

Codecov Report

❌ Patch coverage is 95.61404% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 79.28%. Comparing base (9d07dba) to head (4bf33dd).
⚠️ Report is 2 commits behind head on develop.

Files with missing lines Patch % Lines
src/database/standalone-sqlite.ts 89.79% 4 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #753      +/-   ##
===========================================
+ Coverage    79.23%   79.28%   +0.05%     
===========================================
  Files          126      126              
  Lines        46653    46767     +114     
  Branches      3556     3571      +15     
===========================================
+ Hits         36966    37080     +114     
+ Misses        9633     9632       -1     
- Partials        54       55       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

- selectDataAttributesByHash: ORDER BY verified DESC, trusted DESC, id ASC so
  when several ids share a hash (byte-identical content, different signed
  envelopes) the representative is deterministic and prefers the strongest
  provenance, rather than depending on index iteration order.
- Collapse the per-request double lookup: getDataByHash now returns the
  representative id (ByHashData), so the handler no longer issues its own
  getDataAttributesByHash call. Removed the now-unused method from
  DataAttributesSource / CompositeDataAttributesSource (kept on
  ContiguousDataIndex, which getDataByHash uses).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vilenarios vilenarios force-pushed the feat/serve-data-by-digest branch from 5f3411b to 4bf33dd Compare May 27, 2026 01:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant