ShadowVerse

Agentic World — Behavioral Digital Twins from Real User Data

Train AI agents on real user behavior from PostHog session recordings — not LLM persona prompts — to create persistent digital twins of user demographics. These agents learn how different people actually interact with web products, then autonomously explore and evaluate any web application from their demographic's perspective.

Hugging Face Artifacts: amaan784/agentic-world-behavioral

Weights and Biases Finetuning Report: https://api.wandb.ai/links/amaan784-columbia-university/5kjdhb4e

Youtube demo link: https://youtu.be/XkuUeHlwi5k

Google drive link: https://drive.google.com/drive/folders/1jrB0kd2uz1yg_ZYN0xU1qZ-EqIuYTakW?usp=drive_link

Core Differentiator

Every competitor describes personas with prompts. We clone them from real behavioral data.

Agents learn scrolling speed, click patterns, hesitation moments, feature discovery order, and engagement style from actual PostHog recordings. They are then fine-tuned into demographic-specific models that can explore unfamiliar applications the way real users from that demographic would.

Architecture

                         ┌─────────────────────────────────────────────────────────┐
                         │                   DATA PIPELINE                         │
                         │                                                         │
  PostHog Recordings ──► │  Parse ──► Describe ──► Encode ──► Cluster ──► Train    │
       (15+ users)       │  (S1)      (S2)         (S3)       K-Means     JSONL    │
                         │  Python    Mistral      Mistral    scikit-     Training  │
                         │            Medium       Embed      learn       Data      │
                         └──────────────────────────────────────┬──────────────────┘
                                                                │
                         ┌──────────────────────────────────────▼──────────────────┐
                         │                   FINE-TUNING                            │
                         │                                                         │
                         │  Option A: Mistral API (cloud) ── W&B tracking          │
                         │  Option B: Local LoRA on Mistral-7B (A100 GPU)          │
                         │  Option C: vLLM multi-LoRA serving                      │
                         └──────────────────────────────────────┬──────────────────┘
                                                                │
                         ┌──────────────────────────────────────▼──────────────────┐
                         │                  AGENT EXECUTION                         │
                         │                                                         │
                         │  Autonomous exploration via AgentQL + Playwright         │
                         │  Per-demographic decision model ── Weave tracing        │
                         │  Stuck detection ── Session logging                     │
                         └──────────────────────────────────────┬──────────────────┘
                                                                │
                         ┌──────────────────────────────────────▼──────────────────┐
                         │              EVALUATION & REPORTING                     │
                         │                                                         │
                         │  Quantitative metrics ── Qualitative LLM feedback       │
                         │  Weave evaluations ── Next.js comparative dashboard     │
                         └─────────────────────────────────────────────────────────┘

How It Works

Download — Bulk-fetch all session recordings from PostHog API
Parse — Extract structured events (clicks, scrolls, inputs, API calls) from rrweb data
Describe — Mistral Medium generates natural language behavioral profiles
Encode — Mistral Embed creates 768-D vector embeddings for each session
Cluster — K-Means groups sessions into 3 natural demographic archetypes
Train — Convert action sequences into supervised (state -> next_action) training pairs
Fine-Tune — Create 3 demographic-specific Mistral models (cloud API or local LoRA)
Explore — Each agent autonomously explores the target app using its fine-tuned decision model
Report — Comparative dashboard + per-demographic narrative reports with Weave evaluations

Quick Start

# 1. Install dependencies
pip install -r requirements.txt
playwright install chromium

# 2. Configure environment
cp .env.example .env
# Fill in: MISTRAL_API_KEY, AGENTQL_API_KEY, POSTHOG_PERSONAL_API_KEY,
#          POSTHOG_PROJECT_ID, WANDB_API_KEY, TARGET_APP_URL

# 3. Download all recordings from PostHog
python download_recordings.py

# 4. Process recordings and cluster into 3 demographics
python cluster_demographics.py

# 5. Build training data from clustered sessions
python build_training_data.py

# 6. Fine-tune 3 Mistral models (one per demographic)
#    Option A: Mistral API fine-tuning (cloud)
python fine_tune.py
#    Option B: Local LoRA fine-tuning (requires GPU, e.g. A100)
python finetune_job.py --all-clusters --skip-inference

# 7. Run all 3 agents against the target app
python run_agents.py

# 8. View results in the comparative dashboard
cd visualizer && npm install && npx next dev -p 3333

Linux / GPU Server Setup

If running on a remote GPU server (e.g. Shadeform, Lambda, Brev):

# Create a virtual environment and install dependencies
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -r requirements.txt
playwright install chromium

# Run local LoRA fine-tuning
python3 finetune_job.py --all-clusters --skip-inference

Note: Always use python3 -m pip install (not bare pip install) to ensure packages install into the correct interpreter.

Prerequisites

Python 3.10+
Mistral API key — console.mistral.ai (for description, embeddings, fine-tuning, and reports)
AgentQL API key — agentql.com (for semantic DOM querying)
PostHog Personal API key — posthog.com (for downloading recordings)
W&B API key — wandb.ai (for fine-tuning experiment tracking)
Target web app — Any web application the agents should explore

Safety

The agent refuses to run against production URLs listed in BLOCKED_HOSTS. The config.py module validates the target URL at startup and exits immediately if it matches any blocked hostname.

Project Structure

ShadowVerse/
├── config.py                       # Env vars, URL validation, data paths
├── .env.example                    # Template for all API keys and config
├── requirements.txt                # Python dependencies
│
├── CORE PIPELINE
├── download_recordings.py          # Fetch all recordings from PostHog API
├── cluster_demographics.py         # Process recordings + K-Means clustering
├── build_training_data.py          # Generate fine-tuning JSONL from clusters
├── fine_tune.py                    # Mistral API fine-tuning + W&B tracking
├── finetune_job.py                 # Local LoRA fine-tuning (A100 optimized)
├── run_agents.py                   # Launch 3 demographic agents
├── run_pipeline.py                 # End-to-end single-session pipeline
│
├── AGENT EXECUTION
├── agent_runner.py                 # Reusable agentic loop engine
├── agentic_loop.py                 # Local model agentic loop (LoRA)
├── local_client.py                 # Browser client for remote GPU service
├── service.py                      # Unified GPU inference + online pipeline (FastAPI)
├── serve_vllm.sh                   # vLLM multi-LoRA server launcher
│
├── EVALUATION & BATCH
├── evaluation.py                   # Weave evaluation scorers
├── process_synthetic_batch.py      # Batch-process synthetic sessions
├── real_data_pipeline.py           # Real-session-only processing pipeline
├── generate_all_policies.py        # Batch-generate behavioral policies
│
├── pipeline/                       # Core processing stages
│   ├── stage1_parse.py             # PostHog JSON → structured events
│   ├── stage2_describe.py          # Events → behavioral narrative (Mistral)
│   ├── stage3_encode.py            # Narrative → vector embedding (Mistral Embed)
│   ├── stage4_policy.py            # Description → agent policy JSON
│   ├── stage5_execute.py           # Legacy: fixed-sequence agent
│   └── stage5_explore.py           # Autonomous exploratory agent
│
├── feedback/                       # UX metrics and report generation
│   ├── session_logger.py           # Records agent actions + stuck detection
│   ├── metrics.py                  # Quantitative agent vs real-user metrics
│   └── stage6_report.py            # Qualitative + quantitative UX report
│
├── online_pipeline/                # FastAPI server for continuous processing
│   ├── server.py                   # Webhook receiver + REST API
│   ├── poller.py                   # Background PostHog polling
│   ├── processor.py                # Single-recording pipeline processor
│   ├── store.py                    # JSON-backed state store
│   └── retrain.py                  # Auto-retrain when new data arrives
│
├── visualizer/                     # Next.js comparative dashboard
│   ├── app/
│   │   ├── page.tsx                # Main dashboard page
│   │   ├── components/
│   │   │   ├── ComparativeDashboard.tsx   # Multi-agent comparison view
│   │   │   ├── ActionTimeline.tsx         # Agent action timeline
│   │   │   ├── FrictionMap.tsx            # UX friction visualization
│   │   │   ├── HeroStats.tsx              # Key metric cards
│   │   │   ├── TimingComparison.tsx       # Timing analysis charts
│   │   │   ├── QualitativeReport.tsx      # Narrative report display
│   │   │   ├── PipelineRunner.tsx         # Pipeline control UI
│   │   │   └── ResultsView.tsx            # Per-agent results
│   │   ├── api/                           # Next.js API routes
│   │   └── data/                          # Sample data for development
│   └── package.json                # React 19, Next.js 16, Recharts, Tailwind 4
│
└── data/                           # All generated artifacts
    ├── recordings/                 # Raw PostHog exports
    ├── parsed/                     # Structured event sequences
    ├── descriptions/               # Behavioral narratives
    ├── embeddings/                 # Vector embeddings
    ├── clusters/                   # Demographic cluster assignments
    ├── training/                   # Fine-tuning JSONL data
    ├── models/                     # Fine-tuned model registry + LoRA adapters
    ├── agent_logs/                 # Exploration session logs
    └── reports/                    # Comparative + per-agent reports

Running Individual Steps

# Download recordings (skip already downloaded)
python download_recordings.py --min-duration 30

# Cluster with custom count
python cluster_demographics.py --clusters 3

# Build training data with larger context window
python build_training_data.py --window-size 7

# Fine-tune via Mistral API
python fine_tune.py --base-model open-mistral-nemo

# Fine-tune locally with LoRA (GPU required)
python finetune_job.py --all-clusters                    # Train all clusters
python finetune_job.py --cluster 0 --epochs 3            # Train single cluster
python finetune_job.py --all-clusters --lora-rank 64     # Higher LoRA rank
python finetune_job.py --all-clusters --no-wandb         # Disable W&B tracking

# Run only one demographic
python run_agents.py --cluster 0 --max-steps 20 --max-duration 120

# Run the agentic loop with local LoRA model
python agentic_loop.py --url http://localhost:3000 --app-description "..."

# Run end-to-end pipeline on a single recording
python run_pipeline.py <recording.json>

# Process synthetic sessions (flat-event JSON format)
python process_synthetic_batch.py --clusters 3 --concurrency 5

# Generate policies for all described sessions
python generate_all_policies.py

# Run evaluation scorers
python evaluation.py

Fine-Tuning Options

Approach	Script	Where	Requirements
Mistral API	`fine_tune.py`	Cloud (Mistral servers)	`MISTRAL_API_KEY`
Local LoRA	`finetune_job.py`	Your GPU server	A100 80GB recommended, PyTorch, PEFT, TRL
vLLM Serving	`serve_vllm.sh`	Your GPU server	vLLM, trained LoRA adapters

The local LoRA approach fine-tunes Mistral-7B-Instruct-v0.3 with per-cluster LoRA adapters. It supports Flash Attention 2, fused AdamW, gradient checkpointing, and packing for efficient A100 training.

Experiment Tracking with Weights & Biases

All fine-tuning runs are tracked in Weights & Biases for full experiment observability. Both fine_tune.py (cloud) and finetune_job.py (local LoRA) integrate with W&B:

Real-time metrics — Training loss, eval loss, and learning rate logged per step
Run config — Cluster ID, demographic label, base model, hyperparameters (LoRA rank, epochs, batch size)
Artifacts — Training JSONL data and trained LoRA adapters are logged as versioned W&B Artifacts
Model registry — Fine-tuned model IDs (cloud) or adapter paths (local) stored in run summary
Run tags — Runs tagged with behavioral-finetuning, lora, and cluster identifiers for filtering

Each cluster gets its own W&B run under the agentic-world project, making it easy to compare training dynamics across demographics. Disable tracking with --no-wandb if needed.

Hugging Face Artifacts

Trained LoRA adapters, training data, and model artifacts are uploaded to Hugging Face for reproducibility and sharing. The published repository is available at amaan784/agentic-world-behavioral.

vLLM Multi-LoRA Serving

After local fine-tuning, serve all cluster adapters via vLLM:

bash serve_vllm.sh                        # Default: port 8001
bash serve_vllm.sh --port 8080            # Custom port
BASE_MODEL=mistralai/Mistral-7B-Instruct-v0.3 bash serve_vllm.sh

The script auto-discovers trained LoRA adapters in data/models/ and serves them as named models (cluster0, cluster1, cluster2, ...) via the OpenAI-compatible /v1/chat/completions endpoint.

GPU Inference Service

The service.py module is a unified FastAPI server that combines GPU inference with the online data pipeline. It runs on a VM with a GPU and exposes:

Inference Endpoints:

Endpoint	Method	Description
`/predict`	POST	Predict next action for a cluster + page state
`/predict/batch`	POST	Predict for all clusters at once
`/clusters`	GET	List clusters and their personas
`/switch/{id}`	POST	Pre-load a specific cluster adapter
`/health`	GET	Health check

Online Pipeline Endpoints:

Endpoint	Method	Description
`/webhook/posthog`	POST	Receive PostHog webhook events
`/process/{id}`	POST	Manually trigger recording processing
`/process/{id}/status`	GET	Check processing status
`/pipeline/status`	GET	Pipeline stats and cluster counts
`/retrain/check`	GET	Check if retraining is needed
`/retrain/trigger`	POST	Trigger async retraining
`/retrain/status`	GET	Retrain job status
`/reload`	POST	Hot-reload adapters after retraining
`/poller/start`	POST	Start background PostHog polling
`/poller/stop`	POST	Stop poller
`/poller/status`	GET	Poller status

# Start the unified service
python service.py

# With ngrok tunnel and API key
NGROK_AUTH_TOKEN=<token> SERVICE_API_KEY=mysecret python service.py

Use local_client.py to drive browser automation locally against the remote GPU service, keeping the browser on your machine and inference on the GPU.

Online Pipeline

The online_pipeline/ module provides a standalone FastAPI server for continuous, webhook-driven processing (alternative to the unified service.py):

uvicorn online_pipeline.server:app --port 8100 --reload

POST /webhook/posthog — Receive PostHog webhook events for new recordings
POST /process/{recording_id} — Manually trigger processing
GET /status — Pipeline status and per-cluster counts
POST /retrain/trigger — Retrain clusters that have accumulated new data
POST /poller/start — Start background polling (alternative to webhooks)

Evaluation & Tracing

Agent sessions are traced with Weave for observability. The evaluation.py module provides scoring functions decorated with @weave.op():

Exploration coverage — Unique pages/features discovered
Demographic consistency — Whether agent behavior matches cluster profile
Friction detection — Identification of UX pain points

Scores appear in the Weave UI alongside agent traces for debugging and analysis.

Output

After run_agents.py completes, find results in data/reports/:

comparative_report_latest.json — Multi-agent comparison with recommendations
comparative_report_<timestamp>.json — Timestamped report archive

The comparative report includes:

Per-agent metrics — Steps, duration, success/failure rates, impressions
Per-agent narrative — First-person UX report from each demographic
Common friction points — Issues found by multiple demographics
Demographic-specific issues — Problems unique to certain user groups
Prioritized recommendations — Ranked by impact, tagged by affected demographics
Engagement patterns — Which demographics engaged most/least and why

Visualizer Dashboard

The visualizer/ directory contains a Next.js 16 dashboard (React 19, Recharts, Tailwind CSS 4) for interactive comparison of agent results:

cd visualizer && npm install && npm run dev

Components include comparative dashboards, action timelines, friction maps, timing analysis, and qualitative report rendering.

Tech Stack

Category	Technologies
Language	Python 3.10+, TypeScript
LLM	Mistral-7B-Instruct-v0.3, Mistral API (Medium, Embed)
Fine-Tuning	PEFT (LoRA), TRL, Flash Attention 2, W&B
Serving	vLLM (multi-LoRA), FastAPI + Uvicorn
Browser Automation	Playwright, AgentQL
ML/Clustering	scikit-learn (K-Means), NumPy
Observability	Weave (tracing + evaluation), W&B (experiments)
Frontend	Next.js 16, React 19, Recharts, Tailwind CSS 4
Data Source	PostHog session recordings (rrweb)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ShadowVerse

Agentic World — Behavioral Digital Twins from Real User Data

Core Differentiator

Architecture

How It Works

Quick Start

Linux / GPU Server Setup

Prerequisites

Safety

Project Structure

Running Individual Steps

Fine-Tuning Options

Experiment Tracking with Weights & Biases

Hugging Face Artifacts

vLLM Multi-LoRA Serving

GPU Inference Service

Online Pipeline

Evaluation & Tracing

Output

Visualizer Dashboard

Tech Stack

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.ipynb_checkpoints		.ipynb_checkpoints
__MACOSX		__MACOSX
data		data
eval_results		eval_results
feedback		feedback
online_pipeline		online_pipeline
pipeline		pipeline
sessions		sessions
visualizer		visualizer
.DS_Store		.DS_Store
.env.example		.env.example
.gitignore		.gitignore
Agent-Implementation-Guide.md		Agent-Implementation-Guide.md
README.md		README.md
agent_runner.py		agent_runner.py
agentic_loop.py		agentic_loop.py
archive.zip		archive.zip
build_training_data.py		build_training_data.py
cluster_demographics.py		cluster_demographics.py
config.py		config.py
download_recordings.py		download_recordings.py
evaluation.py		evaluation.py
export-019ca0c1-4d2a-745d-828e-48c6fc0cd042-ph-recording.json		export-019ca0c1-4d2a-745d-828e-48c6fc0cd042-ph-recording.json
export-019ca1b0-0b09-7402-890c-e7b3e9f23d25-ph-recording.json		export-019ca1b0-0b09-7402-890c-e7b3e9f23d25-ph-recording.json
fine_tune.py		fine_tune.py
finetune_job.py		finetune_job.py
generate_all_policies.py		generate_all_policies.py
instructions.txt		instructions.txt
local_client.py		local_client.py
process_synthetic_batch.py		process_synthetic_batch.py
real_data_pipeline.py		real_data_pipeline.py
requirements.txt		requirements.txt
run_agents.py		run_agents.py
run_pipeline.py		run_pipeline.py
run_pipeline.sh		run_pipeline.sh
serve_vllm.sh		serve_vllm.sh
service.py		service.py
sessions.zip		sessions.zip

Folders and files

Latest commit

History

Repository files navigation

ShadowVerse

Agentic World — Behavioral Digital Twins from Real User Data

Core Differentiator

Architecture

How It Works

Quick Start

Linux / GPU Server Setup

Prerequisites

Safety

Project Structure

Running Individual Steps

Fine-Tuning Options

Experiment Tracking with Weights & Biases

Hugging Face Artifacts

vLLM Multi-LoRA Serving

GPU Inference Service

Online Pipeline

Evaluation & Tracing

Output

Visualizer Dashboard

Tech Stack

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages