Skip to content

#3038 Phase 0 spike: dual-stack System.Text.Json on the eager path (93% corpus parity)#3041

Draft
michaelcfanning wants to merge 1 commit into
mainfrom
users/mikefan/stj-spike
Draft

#3038 Phase 0 spike: dual-stack System.Text.Json on the eager path (93% corpus parity)#3041
michaelcfanning wants to merge 1 commit into
mainfrom
users/mikefan/stj-spike

Conversation

@michaelcfanning

Copy link
Copy Markdown
Member

Phase-0 spike for #3038: dual-stack System.Text.Json on the eager (de)serialization path, with measured corpus parity. The Newtonsoft path is untouched — this is purely additive.

tl;dr for the "weeks of effort" estimate

The full issue (deferred-reader rewrite + public-API break + AOT publish + measurement) is genuinely multi-week. Phase 0 is not. This branch gets the eager STJ path to 93% corpus parity in ~4 new files + one transform script, with the remaining 7% pinned to three narrow root causes. The scary parts of the issue turned out to be tractable:

Fear in #3038 What actually happened
"75 generated files need attribute changes" One idempotent PowerShell script (StjAttributeTransform.ps1) adds STJ attributes alongside Newtonsoft via a using Stj = … alias. Both serializers work against the same model. 84 files, one command.
"PropertyBag is the harder polymorphic case" JsonElement.GetRawText() + Utf8JsonWriter.WriteRawValue() are exactly the capture-raw-JSON primitives SerializedPropertyInfo needs. ~40 lines.
"Byte-parity across the golden-file corpus is the dominant validation cost" The existing test bar is semantic (JToken.DeepEquals), not byte-exact. STJ's escaping/number-format/indent differences never reach the assertion.
"We'd need jschema to emit STJ attributes first" The augment-don't-replace approach makes that moot, and the issue's frozen-source decision was already correct.

What's here

  • scripts/StjAttributeTransform.ps1 — the mechanical transform (already applied to the 84 model files in this commit; re-runnable, idempotent)
  • src/Sarif/Readers/SystemTextJson/SarifJsonConverters.cs — 9 JsonConverter<T> ports
  • src/Sarif/Readers/SystemTextJson/SarifJson.csSarifJson.Load/Save/Serialize/Options + 3 contract modifiers + [JsonSerializable] source-gen context
  • src/Test.UnitTests.Sarif/Readers/SystemTextJson/SarifJsonRoundTripTests.cs — parity harness over every embedded *.sarif fixture

Measured

75 / 81 pass (93%) on the v2 corpus + targeted property-bag tests.

Progression across four root-cause fixes during the spike: 28 → 29 → 36 → 34 → 75. Each jump was one well-understood gap:

  1. STJ doesn't honor [DataContract] opt-in → EnforceDataContractOptIn modifier
  2. [DataMember(EmitDefaultValue=false)] on every property → global WhenWritingDefault
  3. [DefaultValue(x)] suppresses at x, not CLR default → HonorDefaultValueAttribute modifier
  4. Location.Id : BigInteger has no built-in STJ converter → StjBigIntegerConverter

Remaining 6, categorized (none are research risk):

  • PrereleaseCompatibilityTransformer fixtures — pre-2.1.0 schema-upgrade edge cases, not representative SARIF
  • Result.Kind [DefaultValue(ResultKind.Fail)] suppression — the modifier should cover it; the attribute may live on the NotYetAutoGenerated/ partial half (which the csproj excludes from compile)
  • SerializedPropertyInfo null-literal handling — Newtonsoft stores the JSON literal "null"; STJ converter currently stores CLR null. ~2-line fix.

Zero regression on the existing Newtonsoft suite: 959/960 pass; the 1 failure is FileSystem_IsSymbolicLinkTestCases, an environment-dependent symlink test unrelated to serialization.

What this does NOT do (the actual long poles)

  • Deferred reader (LoadDeferred, SarifDeferredContractResolver, JsonPositionedTextReader) — still Newtonsoft, off the default path, sequenced last per Migrate JSON serialization Newtonsoft.Json -> System.Text.Json (source-gen) to unblock NativeAOT #3038.
  • Public-API break — 9 public types derive from Newtonsoft.Json.JsonConverter, 66 public members expose Newtonsoft types. This branch adds; it doesn't remove. The break plan is Phase 2.
  • NativeAOT publish + measurementSarifJsonContext (the source-gen root) compiles, but is parked behind reflection mode until the transform script also stamps [Stj.JsonIgnore] on non-[DataMember] public properties so the opt-in modifier can retire and fast-path comes back. That's the next mechanical step, then <PublishAot>true</PublishAot> on the multitool and measure.

How to extend

# Re-apply the transform after a model edit (idempotent):
.\scripts\StjAttributeTransform.ps1

# Run the parity harness:
dotnet test src\Test.UnitTests.Sarif -c Release -f net8.0 --filter SarifJsonRoundTrip

🤖 Generated with Claude Code

The frozen object model now carries STJ attributes ALONGSIDE the existing
Newtonsoft ones (via a `using Stj = ...` alias), so both serializers work
against the same model during the transition. The Newtonsoft path
(SarifLog.Load/LoadDeferred/Save and every existing call site) is
untouched and behaves identically.

  scripts/StjAttributeTransform.ps1
    One-time, idempotent mechanical transform of the 84 model files under
    Autogenerated/, NotYetAutoGenerated/, VersionOne/Autogenerated/:
      [DataMember(Name=x,...)]        -> + [Stj.JsonPropertyName(x)]
      [JsonProperty(DefaultValueH..)] -> + [Stj.JsonIgnore(WhenWritingDefault)]
      [JsonIgnore]                    -> + [Stj.JsonIgnore]
    The 84-file transform is one command, not 84 hand-edits.

  src/Sarif/Readers/SystemTextJson/SarifJsonConverters.cs
    9 JsonConverter<T> ports: Uri, DateTime, Version, SarifVersion,
    BigInteger, plain-enum factory, [Flags]-enum factory, and the
    PropertyBag/SerializedPropertyInfo pair (#3038 flags this as the
    harder case; JsonElement.GetRawText() + Utf8JsonWriter.WriteRawValue()
    are exactly the capture-raw-JSON primitives that pattern needs).

  src/Sarif/Readers/SystemTextJson/SarifJson.cs
    SarifJson.Load/Save/Serialize + configured Options. Three
    TypeInfoResolver modifiers replicate the contract semantics
    Newtonsoft gets for free from [DataContract]:
      - EnforceDataContractOptIn: drop any property without
        [JsonPropertyName] (suppresses SarifNodeKind, Tags, etc.)
      - HonorDefaultValueAttribute: ShouldSerialize at the
        [DefaultValue(x)] target, not the CLR default
      - AdmitInternalPropertiesBag: include the internal Properties
        bag every node overrides
    SarifJsonContext is the [JsonSerializable] source-gen root for
    NativeAOT; it compiles, but is parked behind reflection mode for
    the spike (re-enabled once the transform also stamps [Stj.JsonIgnore]
    on non-DataMember properties so the opt-in modifier can retire and
    fast-path serialization comes back).

  src/Test.UnitTests.Sarif/Readers/SystemTextJson/SarifJsonRoundTripTests.cs
    Parity harness: STJ self-idempotency + STJ-vs-Newtonsoft semantic
    equality (JToken.DeepEquals, the existing FileDiffingUnitTests bar)
    across every embedded *.sarif fixture (76 v2 logs).

Measured Phase-0 result on the 76-fixture v2 corpus + 5 targeted tests:

  75 / 81 pass  (93%)

  Progression across four root-cause fixes:
    28 -> 29  (resource-path bug in test harness)
       -> 36  (DataContract opt-in modifier; clears SarifNodeKind/Tags leak)
       -> 34  (global WhenWritingDefault for EmitDefaultValue=false)
       -> 75  (BigInteger converter; clears ~45 in one shot)

  Remaining 6, categorized:
    - 4x PrereleaseCompatibilityTransformer fixtures (pre-2.1.0 schema
      version-upgrade edge cases; not representative SARIF)
    - 1x Result.Kind [DefaultValue(ResultKind.Fail)] suppression
      (the modifier should cover it; needs a closer look at where the
      attribute lives across the Autogenerated/NotYetAutoGenerated split)
    - 1x SerializedPropertyInfo null-literal handling: Newtonsoft stores
      the JSON literal text; the STJ converter currently stores a CLR
      null for JsonTokenType.Null. ~2-line fix.

  Zero regression on the existing Newtonsoft suite: 959/960 pass; the 1
  failure is FileSystem_IsSymbolicLinkTestCases, an environment-
  dependent symlink test unrelated to serialization.

What this de-risks (per #3038 phasing):
  - The frozen-model attribute transform IS mechanical (one script).
  - The PropertyBag converter ports cleanly via STJ's raw-JSON
    primitives; no research risk.
  - The test bar is semantic (JToken.DeepEquals), not byte-exact, so
    STJ/Newtonsoft formatting differences are not a validation cost.
  - The deferred-reader port and the public-API break are genuinely the
    long poles; everything before them is hours, not weeks.

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant