Skip to content

feat: audit any live site with npx aeo.js check <url>#63

Open
rubenmarcus wants to merge 4 commits into
mainfrom
feat/remote-url-check
Open

feat: audit any live site with npx aeo.js check <url>#63
rubenmarcus wants to merge 4 commits into
mainfrom
feat/remote-url-check

Conversation

@rubenmarcus

Copy link
Copy Markdown
Member

What

The audit no longer requires a local build — check and report now accept a URL or bare domain:

npx aeo.js check mysite.com
npx aeo.js report https://mysite.com --json

How

  • src/core/remote-crawl.ts — AEO surface discovery (robots.txt, llms.txt, llms-full.txt, sitemap.xml, ai-index.json, homepage), a 23-bot AI crawler access matrix parsed from robots.txt, and a bounded crawler (sitemap-first, homepage-link fallback; configurable timeout/concurrency/max pages, defaults 12s/5/10).
  • src/core/remote-audit.ts — the same 5-category / 100-point GEO audit driven by live data, per-page citability, platform hints (adds Claude + Gemini based on actual bot access), aeo.js usage detection, and a terminal formatter.
  • CLI — positional URL support, bare-domain normalization (mysite.comhttps://mysite.com/), --json output with raw page HTML stripped, exit 1 on invalid/unreachable targets, Node 18+ guard for global fetch.
  • All of it is exported from the package root, so check.aeojs.org can replace its reimplemented lib/remote-audit.ts/lib/crawler.ts with these imports and the web checker, CLI, and future browser extension report identical scores.

This is the scoring logic that already runs in production on check.aeojs.org, upstreamed (minus checker-specific language/country detection, which stays in the app).

Verification

  • tsc --noEmit clean
  • 210 tests pass (30 new: crawler with mocked fetch, audit categories, report builder, formatter, arg parsing, URL normalization)
  • Smoke-tested against live sites: check aeojs.org → 100/100, check example.com → 32/100, invalid input and unreachable hosts exit 1

🤖 Generated with Claude Code

rubenmarcus and others added 4 commits June 10, 2026 12:57
Ports the check.aeojs.org crawler into the library: AEO surface discovery
(robots.txt, llms.txt, llms-full.txt, sitemap.xml, ai-index.json, homepage),
a 23-bot AI crawler access matrix parsed from robots.txt, and a bounded
page crawler (sitemap-first with homepage-link fallback, configurable
timeout/concurrency/max pages).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
remoteAuditSite() runs the same 5-category / 100-point GEO audit as
auditSite() but against crawled live-site data (discovery surface +
pages). buildRemoteReport() adds per-page citability scores, platform
hints (including Claude and Gemini driven by live bot access), and
aeo.js usage detection. formatRemoteReport() renders the terminal view.

This is the scoring engine check.aeojs.org reimplements today; it can
now import it from the library so web, CLI, and extension stay in sync.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
`npx aeo.js check mysite.com` (or a full https:// URL) now scans the
live site — discovery surface, bot access matrix, bounded crawl — and
prints the same 5-category GEO readiness score the local audit uses.
`report <url>` adds platform hints and per-page citability; --json
emits the full scan report with raw HTML stripped. Bare domains get
https:// prepended; invalid targets and unreachable hosts exit 1.

Also exports the remote crawl/audit API from the package root so
check.aeojs.org and the browser extension can share the same scoring.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@vercel

vercel Bot commented Jun 10, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
aeo-js Ready Ready Preview, Comment Jun 10, 2026 11:03am

Request Review

@github-actions

Copy link
Copy Markdown

Docs Preview

Preview URL: https://feat-remote-url-check.aeojs.pages.dev

This preview was deployed from the latest commit on this PR.

@greptile-apps

greptile-apps Bot commented Jun 10, 2026

Copy link
Copy Markdown

Greptile Summary

This PR adds npx aeo.js check <url> / report <url> support, letting users audit any live site without a local build. It introduces two new modules (remote-crawl.ts, remote-audit.ts) that fetch a site's AEO surface files, parse a 23-bot robots.txt access matrix, crawl up to 10 pages, and run the same 5-category / 100-point GEO audit already used by check.aeojs.org.

  • remote-crawl.ts — parallel discovery (robots.txt, llms.txt, sitemap, ai-index.json, homepage) + bounded concurrent page crawler + robots.txt parser into a per-bot access matrix.
  • remote-audit.ts — 5-category remote audit, per-page citability scoring, Claude/Gemini platform hints driven by live bot-access data, and a terminal formatter.
  • CLI — positional URL support, bare-domain normalization (mysite.comhttps://mysite.com/), Node 18 guard, exit 1 on invalid/unreachable targets; all new symbols exported from src/index.ts.

Confidence Score: 3/5

Safe to merge for CLI use, but the unbounded response-body read in fetchText needs addressing before check.aeojs.org imports these functions server-side.

The crawler's fetchText buffers the full response body with no size limit. Because the PR explicitly designates this code for import by the check.aeojs.org web service where arbitrary user-submitted URLs are processed, a target serving a large response body could exhaust server memory. The SSRF gap (private IPs not blocked) compounds this for the server context. Both issues are straightforward to fix but should be resolved before the web service migration.

src/core/remote-crawl.ts — specifically fetchText (response body size) and the discover/crawlPages public API (SSRF validation).

Security Review

  • SSRF (library boundary)discover and crawlPages accept any URL including private IP ranges, loopback, and cloud metadata endpoints (e.g. 169.254.169.254). The CLI's normalizeTargetUrl only blocks bare localhost; raw IPs pass through. When check.aeojs.org imports these functions directly with user-supplied URLs (per the PR description), SSRF is possible without an additional validation layer in the consuming application.
  • No secrets or credential leakage introduced.
  • No injection vulnerabilities in the audit/scoring logic.

Important Files Changed

Filename Overview
src/core/remote-crawl.ts New crawler: fetches AEO surface files, parses robots.txt into a 23-bot access matrix, and crawls up to 10 inner pages. Unbounded res.text() call in fetchText poses a DoS risk for server-side use; no SSRF guard on the public API; all exported types use interface instead of type.
src/core/remote-audit.ts New 5-category / 100-point remote audit engine mirroring the local audit, plus platform hints for Claude and Gemini driven by live bot-access data. Logic looks correct; RemoteScanReport uses interface instead of type.
src/cli.ts Adds positional URL support to check and report, bare-domain normalization, Node 18 guard, and remote scan dispatch. cmdCheckRemote and cmdReportRemote produce identical JSON output despite having different text-mode behavior.
src/index.ts Re-exports all new crawler and audit symbols, making them available as a package-level API for consumers like check.aeojs.org.
src/core/remote-crawl.test.ts Good coverage of robots.txt parsing (wildcard, specific bot, grouped agents, allow-override), sitemap URL extraction, link extraction, and discover/crawlPages with mocked fetch including an unreachable-site path.
src/core/remote-audit.test.ts Covers all five audit categories, aeo.js detection, full report construction (including Claude/Gemini platform hints), and the terminal formatter. Tests are well-structured and representative.
src/cli.test.ts Adds tests for positional argument capture, mixed flags/positionals, and normalizeTargetUrl including rejection of localhost, FTP, and bare non-domain strings.

Sequence Diagram

sequenceDiagram
    participant User
    participant CLI as cli.ts (main)
    participant RC as remote-crawl.ts
    participant RA as remote-audit.ts
    participant Site as Target Site

    User->>CLI: npx aeo.js check mysite.com
    CLI->>CLI: "normalizeTargetUrl("mysite.com") -> "https://mysite.com/""
    CLI->>RC: discover(targetUrl)
    par AEO surface discovery
        RC->>Site: GET /robots.txt
        RC->>Site: GET /llms.txt
        RC->>Site: GET /llms-full.txt
        RC->>Site: GET /sitemap.xml
        RC->>Site: GET /ai-index.json
        RC->>Site: GET / (homepage)
    end
    RC->>RC: "parseRobotsTxtBotAccess(robotsTxt) -> 23-bot matrix"
    RC-->>CLI: DiscoveryResult
    CLI->>RC: crawlPages(discovery, targetUrl)
    loop "sitemap URLs (up to 10), batched by concurrency=5"
        RC->>Site: GET /page-n
    end
    RC-->>CLI: CrawledPage[]
    CLI->>RA: buildRemoteReport(url, discovery, pages)
    RA->>RA: "remoteAuditSite() -> 5 categories / 100 pts"
    RA->>RA: scorePageCitability() per page
    RA->>RA: generatePlatformHints() + Claude/Gemini hints
    RA-->>CLI: RemoteScanReport
    CLI->>User: formatRemoteReport(report) or JSON
Loading

Comments Outside Diff (2)

  1. src/core/remote-crawl.ts, line 1193-1197 (link)

    P1 Unbounded response body buffering

    res.text() reads the entire response body into a single string with no size cap. The PR explicitly targets check.aeojs.org as a future consumer of these exports, meaning discover will run server-side against arbitrary user-submitted URLs. A malicious or pathological target can serve a multi-GB HTML body; res.text() will buffer all of it before returning, risking OOM and crashing the server.

    Consider reading the content-length header first and bailing early when it exceeds a reasonable threshold (e.g. 2 MB). Note that a missing or lying content-length won't be caught this way; streaming is safer for a hardened server deployment.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/core/remote-crawl.ts
    Line: 1193-1197
    
    Comment:
    **Unbounded response body buffering**
    
    `res.text()` reads the entire response body into a single string with no size cap. The PR explicitly targets check.aeojs.org as a future consumer of these exports, meaning `discover` will run server-side against arbitrary user-submitted URLs. A malicious or pathological target can serve a multi-GB HTML body; `res.text()` will buffer all of it before returning, risking OOM and crashing the server.
    
    Consider reading the `content-length` header first and bailing early when it exceeds a reasonable threshold (e.g. 2 MB). Note that a missing or lying `content-length` won't be caught this way; streaming is safer for a hardened server deployment.
    
    How can I resolve this? If you propose a fix, please make it concise.
  2. src/core/remote-crawl.ts, line 1333-1378 (link)

    P2 security SSRF exposure when used as a library

    discover (and by extension crawlPages) will faithfully fetch whatever URL it is given, including private IP ranges (10.x.x.x, 192.168.x.x, 172.16-31.x.x), loopback (127.0.0.1, ::1), and cloud metadata endpoints (169.254.169.254). The CLI's normalizeTargetUrl rejects bare localhost but does nothing for raw IP addresses or link-local hosts.

    Because the PR description explicitly calls out that check.aeojs.org will import these functions directly, any server-side integration that passes a user-supplied URL to discover without an additional validation layer would be vulnerable to SSRF.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/core/remote-crawl.ts
    Line: 1333-1378
    
    Comment:
    **SSRF exposure when used as a library**
    
    `discover` (and by extension `crawlPages`) will faithfully fetch whatever URL it is given, including private IP ranges (`10.x.x.x`, `192.168.x.x`, `172.16-31.x.x`), loopback (`127.0.0.1`, `::1`), and cloud metadata endpoints (`169.254.169.254`). The CLI's `normalizeTargetUrl` rejects bare `localhost` but does nothing for raw IP addresses or link-local hosts.
    
    Because the PR description explicitly calls out that check.aeojs.org will import these functions directly, any server-side integration that passes a user-supplied URL to `discover` without an additional validation layer would be vulnerable to SSRF.
    
    How can I resolve this? If you propose a fix, please make it concise.
Prompt To Fix All With AI
Fix the following 5 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 5
src/core/remote-crawl.ts:3-40
`interface` used for simple data structures across both new files, violating custom rules that mandate `type` for DTOs and data structures that don't use inheritance or extension. This pattern repeats for `BotAccessEntry`, `DiscoveryResult`, `CrawledPage`, and `RemoteCrawlOptions` in this file, and `RemoteScanReport` in `remote-audit.ts`.

```suggestion
export type BotAccessEntry = {
  bot: string;
  company: string;
  purpose: string;
  allowed: boolean;
};

export type DiscoveryResult = {
  robotsTxt: { exists: boolean; content: string | null; hasAiDisallow: boolean };
  llmsTxt: { exists: boolean; contentLength: number; content?: string | null };
  llmsFullTxt: { exists: boolean; contentLength: number };
  sitemap: { exists: boolean; urls: string[] };
  aiIndex: { exists: boolean; content?: string | null };
  homepage: { html: string; url: string } | null;
  botAccess: BotAccessEntry[];
};

export type CrawledPage = {
  url: string;
  pathname: string;
  html: string;
  title?: string;
  description?: string;
  content?: string;
  jsonLd?: object[];
  ogTags?: Record<string, string>;
};

export type RemoteCrawlOptions = {
  /** Per-request timeout in milliseconds. Default: 12000. */
  timeoutMs?: number;
  /** Maximum inner pages to crawl beyond the homepage. Default: 10. */
  maxPages?: number;
  /** Concurrent page fetches. Default: 5. */
  concurrency?: number;
  /** User-Agent header for all requests. */
  userAgent?: string;
};
```

### Issue 2 of 5
src/core/remote-audit.ts:9-22
`RemoteScanReport` is a pure data container with no inheritance — should be `type` per the same convention as the other DTOs in this module.

```suggestion
export type RemoteScanReport = {
  url: string;
  scannedAt: string;
  discovery: DiscoveryResult;
  pages: CrawledPage[];
  audit: AuditResult;
  citability: {
    averageScore: number;
    pages: PageCitabilityResult[];
  };
  platformHints: PlatformHint[];
  botAccess: BotAccessEntry[];
  usesAeoJs: boolean;
};
```

### Issue 3 of 5
src/core/remote-crawl.ts:1193-1197
**Unbounded response body buffering**

`res.text()` reads the entire response body into a single string with no size cap. The PR explicitly targets check.aeojs.org as a future consumer of these exports, meaning `discover` will run server-side against arbitrary user-submitted URLs. A malicious or pathological target can serve a multi-GB HTML body; `res.text()` will buffer all of it before returning, risking OOM and crashing the server.

Consider reading the `content-length` header first and bailing early when it exceeds a reasonable threshold (e.g. 2 MB). Note that a missing or lying `content-length` won't be caught this way; streaming is safer for a hardened server deployment.

### Issue 4 of 5
src/core/remote-crawl.ts:1333-1378
**SSRF exposure when used as a library**

`discover` (and by extension `crawlPages`) will faithfully fetch whatever URL it is given, including private IP ranges (`10.x.x.x`, `192.168.x.x`, `172.16-31.x.x`), loopback (`127.0.0.1`, `::1`), and cloud metadata endpoints (`169.254.169.254`). The CLI's `normalizeTargetUrl` rejects bare `localhost` but does nothing for raw IP addresses or link-local hosts.

Because the PR description explicitly calls out that check.aeojs.org will import these functions directly, any server-side integration that passes a user-supplied URL to `discover` without an additional validation layer would be vulnerable to SSRF.

### Issue 5 of 5
src/cli.ts:213-245
**`check --json` and `report --json` produce identical output**

Both `cmdCheckRemote` and `cmdReportRemote` delegate to `remoteReportJson(report)` in the JSON branch, so `npx aeo.js check mysite.com --json` and `npx aeo.js report mysite.com --json` emit byte-for-byte the same JSON. The extra detail that `cmdReportRemote` adds in text mode (platform hints loop, per-page citability list) is not reflected in the structured output. A note in the help text or docs would prevent confusion for users scripting around the output.

Reviews (1): Last reviewed commit: "docs: document URL mode for check and re..." | Re-trigger Greptile

Comment thread src/core/remote-crawl.ts
Comment on lines +3 to +40
export interface BotAccessEntry {
bot: string;
company: string;
purpose: string;
allowed: boolean;
}

export interface DiscoveryResult {
robotsTxt: { exists: boolean; content: string | null; hasAiDisallow: boolean };
llmsTxt: { exists: boolean; contentLength: number; content?: string | null };
llmsFullTxt: { exists: boolean; contentLength: number };
sitemap: { exists: boolean; urls: string[] };
aiIndex: { exists: boolean; content?: string | null };
homepage: { html: string; url: string } | null;
botAccess: BotAccessEntry[];
}

export interface CrawledPage {
url: string;
pathname: string;
html: string;
title?: string;
description?: string;
content?: string;
jsonLd?: object[];
ogTags?: Record<string, string>;
}

export interface RemoteCrawlOptions {
/** Per-request timeout in milliseconds. Default: 12000. */
timeoutMs?: number;
/** Maximum inner pages to crawl beyond the homepage. Default: 10. */
maxPages?: number;
/** Concurrent page fetches. Default: 5. */
concurrency?: number;
/** User-Agent header for all requests. */
userAgent?: string;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 interface used for simple data structures across both new files, violating custom rules that mandate type for DTOs and data structures that don't use inheritance or extension. This pattern repeats for BotAccessEntry, DiscoveryResult, CrawledPage, and RemoteCrawlOptions in this file, and RemoteScanReport in remote-audit.ts.

Suggested change
export interface BotAccessEntry {
bot: string;
company: string;
purpose: string;
allowed: boolean;
}
export interface DiscoveryResult {
robotsTxt: { exists: boolean; content: string | null; hasAiDisallow: boolean };
llmsTxt: { exists: boolean; contentLength: number; content?: string | null };
llmsFullTxt: { exists: boolean; contentLength: number };
sitemap: { exists: boolean; urls: string[] };
aiIndex: { exists: boolean; content?: string | null };
homepage: { html: string; url: string } | null;
botAccess: BotAccessEntry[];
}
export interface CrawledPage {
url: string;
pathname: string;
html: string;
title?: string;
description?: string;
content?: string;
jsonLd?: object[];
ogTags?: Record<string, string>;
}
export interface RemoteCrawlOptions {
/** Per-request timeout in milliseconds. Default: 12000. */
timeoutMs?: number;
/** Maximum inner pages to crawl beyond the homepage. Default: 10. */
maxPages?: number;
/** Concurrent page fetches. Default: 5. */
concurrency?: number;
/** User-Agent header for all requests. */
userAgent?: string;
}
export type BotAccessEntry = {
bot: string;
company: string;
purpose: string;
allowed: boolean;
};
export type DiscoveryResult = {
robotsTxt: { exists: boolean; content: string | null; hasAiDisallow: boolean };
llmsTxt: { exists: boolean; contentLength: number; content?: string | null };
llmsFullTxt: { exists: boolean; contentLength: number };
sitemap: { exists: boolean; urls: string[] };
aiIndex: { exists: boolean; content?: string | null };
homepage: { html: string; url: string } | null;
botAccess: BotAccessEntry[];
};
export type CrawledPage = {
url: string;
pathname: string;
html: string;
title?: string;
description?: string;
content?: string;
jsonLd?: object[];
ogTags?: Record<string, string>;
};
export type RemoteCrawlOptions = {
/** Per-request timeout in milliseconds. Default: 12000. */
timeoutMs?: number;
/** Maximum inner pages to crawl beyond the homepage. Default: 10. */
maxPages?: number;
/** Concurrent page fetches. Default: 5. */
concurrency?: number;
/** User-Agent header for all requests. */
userAgent?: string;
};

Rule Used: Use type by default in TypeScript unless you spe... (source)

Learned From
cytonic-network/ai-frontend#48

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/core/remote-crawl.ts
Line: 3-40

Comment:
`interface` used for simple data structures across both new files, violating custom rules that mandate `type` for DTOs and data structures that don't use inheritance or extension. This pattern repeats for `BotAccessEntry`, `DiscoveryResult`, `CrawledPage`, and `RemoteCrawlOptions` in this file, and `RemoteScanReport` in `remote-audit.ts`.

```suggestion
export type BotAccessEntry = {
  bot: string;
  company: string;
  purpose: string;
  allowed: boolean;
};

export type DiscoveryResult = {
  robotsTxt: { exists: boolean; content: string | null; hasAiDisallow: boolean };
  llmsTxt: { exists: boolean; contentLength: number; content?: string | null };
  llmsFullTxt: { exists: boolean; contentLength: number };
  sitemap: { exists: boolean; urls: string[] };
  aiIndex: { exists: boolean; content?: string | null };
  homepage: { html: string; url: string } | null;
  botAccess: BotAccessEntry[];
};

export type CrawledPage = {
  url: string;
  pathname: string;
  html: string;
  title?: string;
  description?: string;
  content?: string;
  jsonLd?: object[];
  ogTags?: Record<string, string>;
};

export type RemoteCrawlOptions = {
  /** Per-request timeout in milliseconds. Default: 12000. */
  timeoutMs?: number;
  /** Maximum inner pages to crawl beyond the homepage. Default: 10. */
  maxPages?: number;
  /** Concurrent page fetches. Default: 5. */
  concurrency?: number;
  /** User-Agent header for all requests. */
  userAgent?: string;
};
```

**Rule Used:** Use `type` by default in TypeScript unless you spe... ([source](https://app.greptile.com/multivm-labs/-/custom-context?memory=c862f053-5655-4b41-be69-c840e3c9f280))

**Learned From**
[cytonic-network/ai-frontend#48](https://github.com/cytonic-network/ai-frontend/pull/48)

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment thread src/core/remote-audit.ts
Comment on lines +9 to +22
export interface RemoteScanReport {
url: string;
scannedAt: string;
discovery: DiscoveryResult;
pages: CrawledPage[];
audit: AuditResult;
citability: {
averageScore: number;
pages: PageCitabilityResult[];
};
platformHints: PlatformHint[];
botAccess: BotAccessEntry[];
usesAeoJs: boolean;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 RemoteScanReport is a pure data container with no inheritance — should be type per the same convention as the other DTOs in this module.

Suggested change
export interface RemoteScanReport {
url: string;
scannedAt: string;
discovery: DiscoveryResult;
pages: CrawledPage[];
audit: AuditResult;
citability: {
averageScore: number;
pages: PageCitabilityResult[];
};
platformHints: PlatformHint[];
botAccess: BotAccessEntry[];
usesAeoJs: boolean;
}
export type RemoteScanReport = {
url: string;
scannedAt: string;
discovery: DiscoveryResult;
pages: CrawledPage[];
audit: AuditResult;
citability: {
averageScore: number;
pages: PageCitabilityResult[];
};
platformHints: PlatformHint[];
botAccess: BotAccessEntry[];
usesAeoJs: boolean;
};

Rule Used: Use type instead of interface for DTOs and sim... (source)

Learned From
cytonic-network/ai-frontend#48

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/core/remote-audit.ts
Line: 9-22

Comment:
`RemoteScanReport` is a pure data container with no inheritance — should be `type` per the same convention as the other DTOs in this module.

```suggestion
export type RemoteScanReport = {
  url: string;
  scannedAt: string;
  discovery: DiscoveryResult;
  pages: CrawledPage[];
  audit: AuditResult;
  citability: {
    averageScore: number;
    pages: PageCitabilityResult[];
  };
  platformHints: PlatformHint[];
  botAccess: BotAccessEntry[];
  usesAeoJs: boolean;
};
```

**Rule Used:** Use `type` instead of `interface` for DTOs and sim... ([source](https://app.greptile.com/multivm-labs/-/custom-context?memory=2b2a7a55-162e-44b9-8c4c-3f52514f7037))

**Learned From**
[cytonic-network/ai-frontend#48](https://github.com/cytonic-network/ai-frontend/pull/48)

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment thread src/cli.ts
Comment on lines 213 to +245

async function scanRemote(targetUrl: string): Promise<RemoteScanReport> {
if (typeof fetch !== 'function') {
console.error('[aeo.js] URL checks require Node 18+ (global fetch).');
process.exit(1);
}

console.error(`[aeo.js] Scanning ${targetUrl} ...`);
const discovery = await discover(targetUrl);

if (!discovery.homepage) {
console.error(`[aeo.js] Could not reach ${targetUrl} — check the URL and try again.`);
process.exit(1);
}

const pages = await crawlPages(discovery, targetUrl);
console.error(`[aeo.js] Crawled ${pages.length} page(s).`);
return buildRemoteReport(targetUrl, discovery, pages);
}

/** Report JSON for terminal output — drops raw page HTML to keep it readable. */
function remoteReportJson(report: RemoteScanReport): string {
return JSON.stringify(
{
...report,
discovery: {
...report.discovery,
homepage: report.discovery.homepage ? { url: report.discovery.homepage.url } : null,
},
pages: report.pages.map(({ html: _html, ...page }) => page),
},
null,
2

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 check --json and report --json produce identical output

Both cmdCheckRemote and cmdReportRemote delegate to remoteReportJson(report) in the JSON branch, so npx aeo.js check mysite.com --json and npx aeo.js report mysite.com --json emit byte-for-byte the same JSON. The extra detail that cmdReportRemote adds in text mode (platform hints loop, per-page citability list) is not reflected in the structured output. A note in the help text or docs would prevent confusion for users scripting around the output.

Prompt To Fix With AI
This is a comment left during a code review.
Path: src/cli.ts
Line: 213-245

Comment:
**`check --json` and `report --json` produce identical output**

Both `cmdCheckRemote` and `cmdReportRemote` delegate to `remoteReportJson(report)` in the JSON branch, so `npx aeo.js check mysite.com --json` and `npx aeo.js report mysite.com --json` emit byte-for-byte the same JSON. The extra detail that `cmdReportRemote` adds in text mode (platform hints loop, per-page citability list) is not reflected in the structured output. A note in the help text or docs would prevent confusion for users scripting around the output.

How can I resolve this? If you propose a fix, please make it concise.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant