Skip to content

Latest commit

 

History

History
753 lines (569 loc) · 30.1 KB

File metadata and controls

753 lines (569 loc) · 30.1 KB

InformaticaProjectAnalysis — Design Document

Status: Implemented — v1.0.0 Author: ad25343 Created: 2026-03-09 Last Updated: 2026-03-19

Bottom line: Pre-conversion analysis for teams migrating from Informatica PowerCenter to open code. Analyzes all mapping XMLs together — not one at a time — to group structurally similar mappings into template candidates, build a dependency graph, and produce a conversion strategy for human review.


1. Context — The Informatica Migration Problem

Informatica PowerCenter is a legacy enterprise ETL (Extract, Transform, Load) platform widely used by banks, insurers, telecoms, and government agencies to move data between systems. Each unit of data transformation logic in PowerCenter is called a mapping — it defines how data flows from a source table, through transformations (lookups, expressions, filters, aggregations), and into a target table. A typical enterprise PowerCenter environment contains dozens to hundreds of these mappings, all exportable as XML files.

If your organization runs PowerCenter, you already know its gaps. The platform does what it was built to do — batch ETL on structured data — but it has not kept pace with what engineering teams are now being asked to deliver. The most acute gap is GenAI: business teams are demanding AI-enabled pipelines, and Informatica PowerCenter has no native GenAI capability. It cannot call an LLM (Large Language Model), cannot integrate with vector databases, and has no roadmap for AI-augmented data flows. Staying on PowerCenter means the data infrastructure cannot participate in what the business is trying to do next.

The destination is open code — Python scripts, dbt models, PySpark jobs — code that teams own, version-control, and deploy without platform lock-in, and that integrates with any model or API in the ecosystem. This SaaS-to-code migration needs to produce well-structured, maintainable source code, not just functionally equivalent scripts. The goal is to convert existing logic faithfully, not rewrite the data architecture.

Converting each mapping one at a time, in isolation, produces poor results: 14 mappings that follow the same pattern become 14 separate scripts instead of one parameterized template. Shared lookup tables are redefined independently. There is no dependency graph, no project structure. You left a proprietary tool and landed in a code mess.

InformaticaProjectAnalysis solves this by analyzing the entire collection of mappings before any conversion runs, identifying cross-mapping patterns, and producing a conversion strategy that a human reviews before conversion starts.


2. Core Principles

  1. Analyze all N mappings together, not in partitions. Cross-mapping references (a mapping that looks up a table produced by another mapping) can only be detected when the full project is visible.

  2. Pattern grouping is the primary goal. The analysis determines which mappings share enough structural similarity that one template + config replaces N separate files. A project of 50 mappings might collapse into 8 templates + configs and 12 unique files instead of 50 independent scripts.

  3. The strategy is a recommendation with evidence. For every pattern group, the document shows: the member mappings, the structural evidence (shared fingerprint), the parameter differences (what varies), the recommended template approach, and a confidence level. Tech leads validate the groupings before conversion runs.

  4. We are converting, not rewriting. The analysis observes what exists and recommends smart conversion — collapsing identical patterns into templates. It does not redesign the data architecture or suggest how the project "should have been built."

  5. Variation handling is explicit. Mappings are grouped by structural similarity, and variation within groups is surfaced transparently so humans can confirm or override.


3. Target Personas

Primary: Data Engineering Tech Lead Reviews the strategy document in detail. Validates pattern groupings, confirms or adjusts mapping-to-group assignments, identifies edge cases. Needs structural evidence, dependency graphs, and per-mapping confidence levels.

Secondary: Engineering Leadership Reviews the summary layer. Needs mapping count, pattern group count, complexity distribution, risk flags, and estimated conversion scope reduction. Does not need per-mapping technical detail.


4. Workflow

Input: N mapping XMLs + optional workflow XMLs + parameter files
    │
    ▼
Phase 1 — Discovery (deterministic parsing + AI-assisted interpretation)
    Parse all N mappings.
    Extract structural fingerprints.
    Build cross-mapping dependency graph.
    AI interprets: custom SQL overrides, expression logic classification,
    mapplet black boxes, implicit dependencies.
    │
    ▼
Phase 2 — Pattern Grouping (AI-assisted)
    Cluster mappings by structural fingerprint (spine + complexity profile).
    Within each cluster, diff parameters to confirm same-pattern-different-config.
    Name patterns in human terms.
    Identify which parameters externalize to config.
    Flag edge cases with confidence levels.
    │
    ▼
Phase 3 — Strategy Document Generation
    Produce PDF (leadership summary + tech lead detail) and Excel workbook
    (pattern groups, dependency graph, shared assets, per-mapping assignments).
    │
    ▼
Phase 4 — Human Gate
    Tech leads + leadership review the strategy in the UI.
    Confirm, adjust, or override groupings.
    │
    ▼
Phase 5 — Strategy Delivery
    Approved strategy available as JSON, PDF, and Excel downloads.
    Strategy JSON can be consumed by any downstream conversion workflow.

5. Phase 1 — Discovery

5.1 Deterministic Parsing (no AI)

For each of the N mappings, the parser extracts:

  • Transformation types and their order (the "spine")
  • Source tables and their connection attributes (DBDNAME)
  • Target tables and their connection attributes
  • Lookup transformation targets (which tables are used as lookups)
  • Connector edges (which transformation connects to which)
  • Expression transformation bodies
  • Mapplet instances (expanded if definitions present, flagged if not)
  • SQL overrides on Source Qualifiers
  • Parameter variables ($$VARs)

5.2 Cross-Mapping Graph (no AI)

From the per-mapping parse results, build the project-level graph:

  • Nodes: every source table, target table, and mapping
  • Edges: mapping A writes to TABLE_X, mapping B has a Lookup against TABLE_X → B depends on A
  • Shared assets: tables referenced as Lookup sources by 3+ mappings
  • Repeated expressions: expression fragments appearing verbatim in 4+ mappings

5.3 AI-Assisted Interpretation

The parser extracts the facts; AI fills in the gaps:

  • Custom SQL in Source Qualifiers: Understanding what hand-written SQL overrides do — what tables they join, what filters apply, whether two SQL overrides are structurally equivalent despite different table/column names.
  • Expression logic classification: Determining whether two Expression transformations implement the same pattern with different parameters vs. fundamentally different logic. IIF(ISNULL(IN_STATUS), 'UNKNOWN', IN_STATUS) vs. IIF(ISNULL(IN_CATEGORY), 'DEFAULT', IN_CATEGORY) = same pattern. A 40-line DECODE cascade = different.
  • Mapplet inference: When a mapplet definition is missing, infer its likely purpose from input/output ports and wiring context.
  • Implicit dependencies: Dependencies hidden in SQL overrides or stored procedure calls that don't appear in the structured XML.

6. Phase 2 — Pattern Grouping

6.1 Structural Fingerprinting

Each mapping's transformation topology is reduced to a canonical signature — the "spine": the ordered sequence of transformation types from source to target, ignoring branches.

Example: SQ → EXP → LKP → LKP → TARGET

Two mappings with the same spine are candidates for the same pattern group.

6.2 Variation Tiers

Within a fingerprint group, variation is classified into three tiers:

Tier 1 — Parameter variation. Structurally identical. Only table names, column names, filter values, connection strings differ. One template, one config file. No question, these group together.

Tier 2 — Minor structural variation. Core flow is the same, but one mapping has an extra Expression (e.g., adds ETL_LOAD_DATE), or one has a Filter that another doesn't, or one has 3 Lookups vs. 5. The template accommodates variation via config flags (e.g., add_etl_metadata: true).

Tier 3 — Fundamental structural variation. Different transformation types, different flow shapes. These don't group. Convert individually.

The boundary between Tier 2 and Tier 3 is determined by spine + complexity profile. Two mappings match when they have the same spine AND their complexity at each step is in the same ballpark.

6.3 Per-Group Evidence

For each pattern group, the strategy document shows:

Pattern Group: Truncate & Load (14 mappings)
Core spine: SQ → EXP → TARGET
Variations found:
  - 11 mappings: exact match, differ only by table/columns
  - 2 mappings: extra Filter before TARGET
    (minor — recommend config flag `has_filter: true`)
  - 1 mapping: Expression contains 25-line business logic unlike the others
    (flag for tech lead — does this belong in this group or convert individually?)
Confidence: HIGH (11), MEDIUM (2), LOW (1)

6.4 Classification by Structural Behavior (not naming conventions)

Real-world Informatica projects do not follow consistent naming conventions. Tables may be called ACCT_LOAD, PROCESS_TRANSACTIONS, RPT_SUMMARY, or TBL_047_PROC.

Classification is based on transformation topology and graph position, not names:

Signal What it tells you
DBDNAME on SOURCE vs TARGET Which tables are OLTP vs warehouse
Lookup TABLE pointing at a table That table is a shared reference/dimension
Self-lookup (LKP points at own target) SCD2 (slowly changing dimension) — certain
Aggregator transformation present Aggregate/summary table
Number of Lookup transformations Fan-out of dimension joins → likely fact table
Router + Update Strategy together SCD2 or conditional load — dimension pattern
Union transformation Multi-source merge — consolidation table
Lookup in-degree across project Shared dimension vs one-off lookup
Mapping produces 2+ targets Router/split output — affects project structure

Naming conventions are one optional hint that gets folded in if present, not relied upon.


7. Phase 3 — Strategy Document

7.1 Format

Three deliverables:

PDF — human-readable strategy document with two layers:

  • Leadership summary (page 1): mapping count, pattern group count, unique mapping count, complexity distribution, risk flags, estimated scope reduction
  • Tech lead detail (remaining pages): per-group evidence, dependency graph (DAG) visualization, shared asset catalogue, per-mapping assignments with confidence levels

Excel workbook — reviewable tabular data:

  • Sheet 1: Pattern Groups (group name, member mappings, spine, variation notes, confidence)
  • Sheet 2: Dependency Graph (source mapping, target mapping, edge type, shared table)
  • Sheet 3: Shared Assets (table/expression, referenced by which mappings, recommendation)
  • Sheet 4: Per-Mapping Assignments (mapping name, assigned group, confidence, flags, notes)
  • Sheet 5: Risk Flags (mapping name, flag type, severity, description)

Strategy JSON — machine-readable format. Schema versioned. Contains pattern groups with members, unique mappings with reasons, shared assets, dependency graph (a directed acyclic graph / DAG), and execution order (topologically sorted into parallel stages).

7.2 Honest Uncertainty

The strategy document includes an explicit uncertainty section:

  • Mappings classified with HIGH confidence
  • Mappings classified with MEDIUM confidence (tech lead should confirm)
  • Mappings classified with LOW confidence or unclassifiable (needs human review)
  • Patterns the analysis couldn't read (custom SQL too complex, missing mapplet definitions, etc.)

The output is: "here are the 38 mappings I can classify with high confidence, here are 8 that need a human to confirm, and here are 4 I couldn't read at all."


8. Phase 4 — Human Gate

The strategy is reviewed in the standalone web UI. Tech leads can:

  • View the full strategy document (PDF rendering in UI)
  • Browse pattern groups and their member mappings
  • View the dependency graph
  • Confirm or override individual mapping-to-group assignments
  • Add notes per mapping or per group
  • Approve the strategy or reject (request re-analysis)

9. Phase 5 — Strategy Delivery

The approved strategy is available in three formats:

  • JSON — machine-readable, schema-versioned. Suitable as input to any downstream conversion tool or workflow.
  • PDF — human-readable strategy document.
  • Excel — reviewable tabular data.

The strategy JSON is the canonical output. It contains: pattern groups with members and externalized parameters, unique mappings with reasons, shared assets, dependency DAG, and execution order.


10. Standalone Tool Architecture

InformaticaProjectAnalysis is a standalone application with its own:

  • Web UI (for uploading project configs, viewing strategy, human review gate)
  • REST API (for programmatic access)
  • Database (job tracking, strategy persistence, review decisions)
  • PDF + Excel generation pipeline

Architecture:

  • FastAPI backend (port 8090)
  • SSE progress streaming
  • SQLite persistence
  • Human review gate with structured decisions
  • Audit trail

10.1 Project Configuration File

The primary input is a *.project.yaml file that defines the full migration scope. This is the single source of truth for the analysis — not individual XML uploads.

project:
  name: "FirstBank DWH Migration"
  version: "1.0"
  owner: "Data Engineering"

source:
  type: folder                    # folder | repo | zip | s3
  location: "/path/to/informatica/exports/"

scope:
  mappings:
    include: ["mappings/**/*.xml"]
    exclude: ["mappings/archive/**"]
  workflows:
    include: ["workflows/**/*.xml"]
  parameters:
    include: ["parameter_files/*.xml"]
  default_parameter_env: "dev"

analysis:
  fingerprint_strictness: "moderate"
  min_group_size: 2
  confidence_threshold: 0.7

review:
  tech_lead:
    name: "Jane Smith"
    email: "jane.smith@firstbank.com"
  leadership:
    name: "Mike Johnson"
    email: "mike.johnson@firstbank.com"

output:
  strategy_format: "json"
  output_dir: "/output/firstbank/"

notifications:
  webhook_url: ""
  events:
    on_analysis_complete: true
    on_strategy_ready: true
    on_review_approved: true

Source types supported:

  • folder — local path; tool scans recursively using scope globs
  • repo — Git URL + branch + subfolder path; tool clones and scans
  • zip — uploaded ZIP archive; tool extracts and scans
  • s3 — S3 bucket path; tool pulls and scans

10.2 Three Operating Modes

Interactive — user uploads the project config through the UI or provides a folder/repo path. Watches the analysis run in real time. Reviews the strategy in the browser. Approves or overrides.

Watcher — tool polls a directory for *.project.yaml files. When one appears or changes, it triggers analysis automatically.

CI/CD — a pipeline step drops the project config and triggers analysis via API. Strategy document posted as a PR artifact or comment. Review happens in the PR workflow or the UI.

10.3 Sample Project Config

A working sample config is provided at: sample_data/firstbank/firstbank_migration.project.yaml

This config points at the 50-mapping FirstBank test project and is ready to use for development and testing of the analysis pipeline.


11. Strategy JSON Schema

The approved strategy is a JSON file — the canonical machine-readable output.

{
    "strategy_version": 1,
    "project_name": "FirstBank_Q1_Migration",
    "analysis_job_id": "uuid",
    "analyzed_at": "ISO datetime",

    "summary": {
        "total_mappings": 50,
        "pattern_groups": 8,
        "template_candidates": 36,
        "unique_mappings": 14,
        "scope_reduction_pct": 56
    },

    "pattern_groups": [
        {
            "group_id": "trunc_load_01",
            "group_name": "Truncate & Load",
            "spine": "SQ → EXP → TARGET",
            "members": [
                {
                    "mapping_name": "m_load_customer",
                    "confidence": "HIGH",
                    "variation_tier": 1,
                    "variation_notes": null,
                    "override": null
                },
                {
                    "mapping_name": "m_load_account",
                    "confidence": "MEDIUM",
                    "variation_tier": 2,
                    "variation_notes": "Extra Filter before target",
                    "override": "confirmed by tech lead"
                }
            ],
            "externalized_params": ["source_table", "target_table", "column_list", "filter_condition"],
            "template_hints": "Single config-driven truncate-and-load with optional filter"
        }
    ],

    "unique_mappings": [
        {
            "mapping_name": "m_complex_reconciliation",
            "reason": "Tier 3 — fundamentally different structure, no pattern match",
            "risk_flags": ["CUSTOM_SQL_OVERRIDE", "5_JOINER_TRANSFORMATIONS"]
        }
    ],

    "shared_assets": [
        {
            "table_name": "DIM_CUSTOMER",
            "referenced_by": ["m_fact_daily_txn", "m_fact_loan_origination", "m_agg_monthly"],
            "reference_type": "lookup",
            "recommendation": "shared reference — referenced by 3 mappings"
        }
    ],

    "dependency_dag": [
        {"from": "m_stg_customer", "to": "m_dim_customer", "via": "STG_CUSTOMER"},
        {"from": "m_dim_customer", "to": "m_fact_daily_txn", "via": "DIM_CUSTOMER"}
    ],

    "execution_order": [
        ["m_stg_customer", "m_stg_account", "m_stg_transactions"],
        ["m_dim_customer", "m_dim_account"],
        ["m_fact_daily_txn", "m_fact_loan_origination"],
        ["m_agg_monthly_summary"]
    ],

    "review": {
        "approved_at": "ISO datetime",
        "approved_by": "reviewer_name",
        "overrides": [],
        "notes": ""
    }
}

Schema versioned via strategy_version field so the format can evolve.


12. Separation of Concerns

The analysis tool observes and surfaces. It does NOT prescribe target stacks, warehouses, or orchestration platforms. Those decisions belong to the humans reviewing the strategy and whatever conversion tools they choose.

What the analysis tool DOES surface — characteristics that inform downstream decisions:

12.1 Structural Characteristics (per mapping and per group)

  • Dependency depth: "4-layer chain — staging → dimensions → facts → aggregations"
  • Cross-mapping dependencies: "12 mappings have Lookup references to other mappings' targets"
  • Parallelism potential: "Stage 2 has 3 independent tracks that can run concurrently"
  • Complexity distribution: "15 simple, 20 medium, 15 complex"

12.2 Transformation Characteristics

  • "Pattern Group 4 (SCD2) requires merge/upsert or snapshot semantics"
  • "Pattern Group 7 (risk/regulatory) involves 3-source joins with complex expressions"
  • "7 simple dimension loads are pure SQL — no transformation framework overhead needed"
  • "1 mapping uses Union transformation across 3 heterogeneous sources"
  • "2 mappings contain custom SQL overrides that bypass the transformation layer"

12.3 Orchestration Characteristics

  • The dependency DAG itself — which mappings must run before others
  • Stage boundaries — where parallelism is safe vs. where serialization is required
  • Error propagation paths — if mapping A fails, which downstream mappings are affected
  • Volume indicators — source table sizes where available from metadata

12.4 Risk Characteristics

  • Unmapped expressions, missing mapplet definitions, custom SQL
  • Confidence distribution across groupings
  • Mappings that resist classification
  • Patterns the tool couldn't interpret

The strategy document presents all of this as evidence for humans to act on — never as prescriptive decisions about technology choices.


13. Design Decisions (resolved)

13.1 Re-Analysis on Overrides

Two tiers of overrides:

  • Simple overrides (move a mapping between groups, confirm/reject an assignment, add notes): applied directly to the strategy JSON. No re-analysis. The structural evidence doesn't change — only the human's decision about where a mapping belongs.

  • Structural changes (split a group, merge groups, regroup): trigger a lightweight re-validation pass. Re-runs Phase 2 (pattern grouping) against the existing Phase 1 output with the overrides as constraints. No full re-parse. Validates that the new grouping holds structurally and updates variation tiers and confidence levels.

13.2 Incremental Updates

Phase 1 caches parse results keyed by file content hash (SHA-256). When new mappings are added or existing ones change:

  • Only new/changed XMLs are re-parsed (cache hit for unchanged files)
  • Phase 2 (pattern grouping) always runs on the full project — fast because parsing is cached
  • Phase 3 (strategy generation) runs on the full project

The strategy document includes a diff section: "5 new mappings added since last analysis — here's what changed in the groupings and dependency graph."

Previous human decisions (overrides, confirmations) are preserved unless the structural change invalidates them. If a confirmed mapping's XML changed, the confirmation is cleared and the mapping is re-evaluated.

13.3 UI Design

Three views, same underlying data:

Dashboard view (leadership) Project summary on one page: total mappings, pattern groups found, unique mappings, complexity distribution (heat map), dependency depth, estimated scope reduction ("50 mappings → 8 templates + 12 unique files"). Printable. The approve/reject gate lives here — leadership makes the call after tech leads have reviewed the detail.

Pattern groups view (tech leads) Left panel: list of all pattern groups with member count and confidence indicator. Right panel (on group select):

  • Spine visualization (transformation flow diagram)
  • Member mappings table with variation tier, confidence, and flags per mapping
  • Parameter differences table (what varies across members)
  • Evidence section (why these were grouped — structural fingerprint match details)
  • Per-mapping override controls: "confirm", "move to group...", "convert individually"
  • Notes field per mapping and per group

Dependency graph view (both audiences) Interactive dependency graph visualization. Nodes are mappings, colored by pattern group. Edges are dependencies (target → lookup references). Click a node for details. Execution stages highlighted. Critical path shown. Error propagation paths visible (if mapping A fails, which downstream mappings are affected).

All three views are React components. PDF and Excel exports are generated from the same data model that powers the UI.

The dependency graph view uses vis-network v9.1.9 (CDN, not vis.js 4.x which is unmaintained). Nodes are colored by execution stage (hex colors); shared assets render as diamonds. Layout: { hierarchical: { direction: "UD", sortMethod: "directed" } }.


14. Design Decisions — v0.3.0 (AI Enhancement)

14.1 AI Agent Architecture

Three optional AI agents run as post-processing steps in the orchestrator thread worker. All use asyncio.run() which is safe inside a ThreadPoolExecutor worker because no event loop exists in that thread.

  • Expression Classifier — batches non-trivial ports (>20 chars or containing function calls) into groups of ≤50. Uses a single Claude call per batch with structured output. Fallback: port classified as UNKNOWN on parse failure.

  • SQL Interpreter — one Claude call per Source Qualifier SQL override. Extracts source_tables, join_conditions, filter_logic, business_intent. JSON response only; fallback to {} on parse failure. Skips empty SQL.

  • Confidence Refiner — re-evaluates MEDIUM and LOW group members only (HIGH is already confident). Per-member Claude call asking: "Does this mapping fit this group?" Updated confidence + reason written back to GroupMember.confidence.

14.2 AI Feature Gating

All three AI agents are gated by AnalysisSettings flags. This ensures:

  • Tests run without an ANTHROPIC_API_KEY (all flags default to false in test env)
  • Users who want deterministic results can disable AI entirely
  • ai_refine_confidence defaults to false — it's slower and most projects don't need it

14.3 vis-network Graph Design

vis-network was chosen over vis.js 4.x (unmaintained) and d3-dag (no CDN, requires build step). The CDN-only constraint (single index.html, no build step) made vis-network the only viable option for an interactive DAG with click events.

Node click opens a detail panel (upstream/downstream/shared assets) in the existing right panel — reuses the same pattern as the Pattern Groups view to minimize frontend state.


15. Design Decisions — v0.4.0 (Watcher + Incremental)

15.1 Parse Cache Design

Cache keyed by SHA-256 of file content (not path). This means:

  • Moving a file doesn't invalidate its cache entry
  • Renaming a file doesn't cause re-parse
  • Only actual content changes trigger re-parse

Cache lives in the SQLite database (same file as jobs). No separate cache store.

15.2 Override Preservation

preserved_overrides is a list of {mapping_name, action, target_group} dicts accumulated across runs. After Phase 2 (pattern grouping), the orchestrator replays these overrides — re-applying CONFIRM/INDIVIDUALIZE decisions automatically.

If a mapping's XML changed between runs, its preserved override is skipped (the underlying evidence changed, so the human decision may no longer be valid). A log entry records the skip.

15.3 Webhook HMAC Signing

HMAC-SHA256 with WEBHOOK_SECRET. The signature covers the full JSON body (sorted keys, compact serialization). Signing happens before the HTTP POST in notifier.py. The receiving system verifies with the same secret.

asyncio.new_event_loop().run_until_complete(notify(...)) is used from the thread worker (not asyncio.run()) because asyncio.run() would create and destroy a loop, but the thread worker may be reused across calls.

15.4 Watcher File Lifecycle

watch/
  *.project.yaml        ← detected by watcher
  processed/
    *.project.yaml      ← moved here on success
  failed/
    *.project.yaml      ← moved here on error
    *.project.yaml.error ← error details sidecar

Files are moved atomically (same filesystem). The .error sidecar contains the exception type, message, and traceback — making failures diagnosable without server logs.


16. Design Decisions — v0.5.0 (Extended Sources + CI/CD)

16.1 Git Source (GitPython)

Shallow clone (depth=1) to minimize network I/O. Cloned to a temp dir registered with atexit for cleanup — cleanup happens even on server crash/SIGKILL.

Branch defaults to the remote HEAD. path field (subfolder within repo) is joined onto the clone root before glob scanning. This lets you point at a subfolder of a monorepo without cloning the whole thing (the whole repo is still cloned at depth=1, but only the subfolder is scanned).

16.2 ZIP Source

Path traversal protection: every entry path is checked for .. components before extraction. Entries that fail the check are skipped (not rejected — avoids aborting on a single bad entry in a large archive).

Size cap enforced at extraction time (not just at upload). Total extracted bytes tracked across all entries; extraction aborted if cap exceeded.

16.3 API Key Auth

secrets.compare_digest() prevents timing attacks. Absent API key in the request returns HTTP 401. Invalid key returns HTTP 403. The check is a FastAPI Depends injected only on write endpoints — read endpoints (GET) remain unauthenticated for dashboard access.


17. Design Decisions — v1.0.0 (Production Ready)

17.1 Test Architecture

Three test categories with different isolation levels:

  • Unit — fully isolated. temp_db fixture monkeypatches _DB_PATH per function. No real files needed beyond the sample XMLs in tests/fixtures/.
  • API — FastAPI TestClient with module-scoped DB patch. Tests the HTTP layer (routing, serialization, status codes). No real analysis runs.
  • Integration — exercises the full pipeline against the FirstBank sample data. Skipped (pytest.mark.skipif) when sample data is absent so CI passes on forks that don't include the sample project.

17.2 Removing from __future__ import annotations

FastAPI + Pydantic v2's TypeAdapter resolves type annotations at import time using the module's global namespace. from __future__ import annotations defers all annotations to strings, which breaks TypeAdapter resolution for locally-defined models in app/main.py. Fix: remove the future import from main.py and ensure all imports are at module top level (PEP 8 compliant anyway).

This does not affect app/backend/models/schemas.py or other modules — only the file where FastAPI route annotations and Pydantic models are defined together.

17.3 Rate Limiting

slowapi (Starlette-compatible) at 10 requests/minute on write endpoints. The limit applies per-IP using the X-Forwarded-For header (set by reverse proxies). In direct-connection mode the real IP is used.

Rate limit headers are included in all responses: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset.

17.4 Parallel Parsing

ThreadPoolExecutor(max_workers=4) activated for projects with ≥8 mapping files. Below that threshold the overhead of thread pool coordination exceeds the benefit. The threshold is configurable via PARSE_WORKERS env var.

Parsing is I/O-bound (XML reads from disk) so thread-based parallelism is appropriate. ProcessPoolExecutor was considered but rejected due to macOS spawn/asyncio conflicts and the overhead of cross-process serialization for small XML files.