Skip to content

Improve payload construction, size enforcement, and full scan correctness#23

Merged
Rishabh4275 merged 21 commits into
mainfrom
users/richawla/p4iStuff
Apr 4, 2026
Merged

Improve payload construction, size enforcement, and full scan correctness#23
Rishabh4275 merged 21 commits into
mainfrom
users/richawla/p4iStuff

Conversation

@Rishabh4275

Copy link
Copy Markdown
Collaborator

Motivation and Context

Several issues were identified during live testing of the Purview GitHub Action:

  1. Full scan included PR files — getAllRepoFiles() used filesystem glob which reads from the PR HEAD checkout, incorrectly including files added in the PR when the full scan
    should only cover the base/pre-PR state.
  2. No payload size enforcement for commits — commit-level PCA, inline PC, and upload signal requests had no size validation, causing 413 errors on large commits.
  3. PCA batch item count exceeded API limit — the API enforces a max of 64 items per processContentAsync batch, but the batching logic only checked byte size, causing
    ProcessContentBatch request has item count greater than 64 errors.
  4. Noisy logging — ~30 log statements inside nested loops (per-file, per-user, per-commit) flooded the output, making it hard to follow the actual flow.
  5. file-patterns: '' excluded subdirectories — minimatch('src/index.ts', '') returns false because * doesn't cross / boundaries, so patterns like * and *.ts silently
    excluded all nested files.
  6. accessedResources lacked structured identifiers and the isCrossPromptInjectionDetected field required by the API.
  7. dist/ tracked unnecessary build artifacts — the full tsc output (.d.ts, .js.map, subdirectories) was committed when only the ncc-bundled dist/index.js is needed at
    runtime.

Description

Full scan correctness

  • getAllRepoFiles(atRef?) now accepts an optional git ref. When provided (PR base SHA or push before SHA), it uses git ls-tree + git cat-file instead of filesystem glob,
    ensuring only files from the pre-event state are scanned.
  • Added git fetch --depth=1 origin to handle shallow clones where the base SHA isn't available locally.

Payload size enforcement (two-tier system)

  • maxContentSize (3 MB) — content data field limit; triggers chunking into multiple parts with Part: N in accessedResources names.
  • maxRequestSize (3.7 MB) — total request limit; triggers batch/request splitting. AccessedResources are included in the size calculation.
  • maxBatchItems (64) — API item count limit per PCA batch; enforced alongside byte size in both file and commit batch-splitting loops.
  • Added splitCommitRequests() helper that handles content splitting first, then accessedResources splitting for extreme cases (e.g., commits with 500+ files).
  • buildCommitProcessContentRequest now returns ProcessContentRequest[] (was singular) and buildCommitUploadSignalRequest now returns UploadSignalRequest[] (was singular),
    with all callers updated.

accessedResources improvements

  • Identifier format: PR: Commit: when PR is available; Commit: otherwise (full scans, push events).
  • File name format: Repo: File: Path: .
  • Commit name format: Repo: Commit: .
  • isCrossPromptInjectionDetected: false set on all accessed resources.
  • Part: N suffix appended to names only when content/resources are split.

Comment thread src/runner/gitHubActionsRunner.ts
@Rishabh4275 Rishabh4275 merged commit 105ff6f into main Apr 4, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants