Train AI agents on real user behavior from PostHog session recordings — not LLM persona prompts — to create persistent digital twins of user demographics. These agents learn how different people actually interact with web products, then autonomously explore and evaluate any web application from their demographic's perspective.
Hugging Face Artifacts: amaan784/agentic-world-behavioral
Weights and Biases Finetuning Report: https://api.wandb.ai/links/amaan784-columbia-university/5kjdhb4e
Youtube demo link: https://youtu.be/XkuUeHlwi5k
Google drive link: https://drive.google.com/drive/folders/1jrB0kd2uz1yg_ZYN0xU1qZ-EqIuYTakW?usp=drive_link
Every competitor describes personas with prompts. We clone them from real behavioral data.
Agents learn scrolling speed, click patterns, hesitation moments, feature discovery order, and engagement style from actual PostHog recordings. They are then fine-tuned into demographic-specific models that can explore unfamiliar applications the way real users from that demographic would.
┌─────────────────────────────────────────────────────────┐
│ DATA PIPELINE │
│ │
PostHog Recordings ──► │ Parse ──► Describe ──► Encode ──► Cluster ──► Train │
(15+ users) │ (S1) (S2) (S3) K-Means JSONL │
│ Python Mistral Mistral scikit- Training │
│ Medium Embed learn Data │
└──────────────────────────────────────┬──────────────────┘
│
┌──────────────────────────────────────▼──────────────────┐
│ FINE-TUNING │
│ │
│ Option A: Mistral API (cloud) ── W&B tracking │
│ Option B: Local LoRA on Mistral-7B (A100 GPU) │
│ Option C: vLLM multi-LoRA serving │
└──────────────────────────────────────┬──────────────────┘
│
┌──────────────────────────────────────▼──────────────────┐
│ AGENT EXECUTION │
│ │
│ Autonomous exploration via AgentQL + Playwright │
│ Per-demographic decision model ── Weave tracing │
│ Stuck detection ── Session logging │
└──────────────────────────────────────┬──────────────────┘
│
┌──────────────────────────────────────▼──────────────────┐
│ EVALUATION & REPORTING │
│ │
│ Quantitative metrics ── Qualitative LLM feedback │
│ Weave evaluations ── Next.js comparative dashboard │
└─────────────────────────────────────────────────────────┘
- Download — Bulk-fetch all session recordings from PostHog API
- Parse — Extract structured events (clicks, scrolls, inputs, API calls) from rrweb data
- Describe — Mistral Medium generates natural language behavioral profiles
- Encode — Mistral Embed creates 768-D vector embeddings for each session
- Cluster — K-Means groups sessions into 3 natural demographic archetypes
- Train — Convert action sequences into supervised (state -> next_action) training pairs
- Fine-Tune — Create 3 demographic-specific Mistral models (cloud API or local LoRA)
- Explore — Each agent autonomously explores the target app using its fine-tuned decision model
- Report — Comparative dashboard + per-demographic narrative reports with Weave evaluations
# 1. Install dependencies
pip install -r requirements.txt
playwright install chromium
# 2. Configure environment
cp .env.example .env
# Fill in: MISTRAL_API_KEY, AGENTQL_API_KEY, POSTHOG_PERSONAL_API_KEY,
# POSTHOG_PROJECT_ID, WANDB_API_KEY, TARGET_APP_URL
# 3. Download all recordings from PostHog
python download_recordings.py
# 4. Process recordings and cluster into 3 demographics
python cluster_demographics.py
# 5. Build training data from clustered sessions
python build_training_data.py
# 6. Fine-tune 3 Mistral models (one per demographic)
# Option A: Mistral API fine-tuning (cloud)
python fine_tune.py
# Option B: Local LoRA fine-tuning (requires GPU, e.g. A100)
python finetune_job.py --all-clusters --skip-inference
# 7. Run all 3 agents against the target app
python run_agents.py
# 8. View results in the comparative dashboard
cd visualizer && npm install && npx next dev -p 3333If running on a remote GPU server (e.g. Shadeform, Lambda, Brev):
# Create a virtual environment and install dependencies
python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -r requirements.txt
playwright install chromium
# Run local LoRA fine-tuning
python3 finetune_job.py --all-clusters --skip-inferenceNote: Always use
python3 -m pip install(not barepip install) to ensure packages install into the correct interpreter.
- Python 3.10+
- Mistral API key — console.mistral.ai (for description, embeddings, fine-tuning, and reports)
- AgentQL API key — agentql.com (for semantic DOM querying)
- PostHog Personal API key — posthog.com (for downloading recordings)
- W&B API key — wandb.ai (for fine-tuning experiment tracking)
- Target web app — Any web application the agents should explore
The agent refuses to run against production URLs listed in BLOCKED_HOSTS. The config.py module validates the target URL at startup and exits immediately if it matches any blocked hostname.
ShadowVerse/
├── config.py # Env vars, URL validation, data paths
├── .env.example # Template for all API keys and config
├── requirements.txt # Python dependencies
│
├── CORE PIPELINE
├── download_recordings.py # Fetch all recordings from PostHog API
├── cluster_demographics.py # Process recordings + K-Means clustering
├── build_training_data.py # Generate fine-tuning JSONL from clusters
├── fine_tune.py # Mistral API fine-tuning + W&B tracking
├── finetune_job.py # Local LoRA fine-tuning (A100 optimized)
├── run_agents.py # Launch 3 demographic agents
├── run_pipeline.py # End-to-end single-session pipeline
│
├── AGENT EXECUTION
├── agent_runner.py # Reusable agentic loop engine
├── agentic_loop.py # Local model agentic loop (LoRA)
├── local_client.py # Browser client for remote GPU service
├── service.py # Unified GPU inference + online pipeline (FastAPI)
├── serve_vllm.sh # vLLM multi-LoRA server launcher
│
├── EVALUATION & BATCH
├── evaluation.py # Weave evaluation scorers
├── process_synthetic_batch.py # Batch-process synthetic sessions
├── real_data_pipeline.py # Real-session-only processing pipeline
├── generate_all_policies.py # Batch-generate behavioral policies
│
├── pipeline/ # Core processing stages
│ ├── stage1_parse.py # PostHog JSON → structured events
│ ├── stage2_describe.py # Events → behavioral narrative (Mistral)
│ ├── stage3_encode.py # Narrative → vector embedding (Mistral Embed)
│ ├── stage4_policy.py # Description → agent policy JSON
│ ├── stage5_execute.py # Legacy: fixed-sequence agent
│ └── stage5_explore.py # Autonomous exploratory agent
│
├── feedback/ # UX metrics and report generation
│ ├── session_logger.py # Records agent actions + stuck detection
│ ├── metrics.py # Quantitative agent vs real-user metrics
│ └── stage6_report.py # Qualitative + quantitative UX report
│
├── online_pipeline/ # FastAPI server for continuous processing
│ ├── server.py # Webhook receiver + REST API
│ ├── poller.py # Background PostHog polling
│ ├── processor.py # Single-recording pipeline processor
│ ├── store.py # JSON-backed state store
│ └── retrain.py # Auto-retrain when new data arrives
│
├── visualizer/ # Next.js comparative dashboard
│ ├── app/
│ │ ├── page.tsx # Main dashboard page
│ │ ├── components/
│ │ │ ├── ComparativeDashboard.tsx # Multi-agent comparison view
│ │ │ ├── ActionTimeline.tsx # Agent action timeline
│ │ │ ├── FrictionMap.tsx # UX friction visualization
│ │ │ ├── HeroStats.tsx # Key metric cards
│ │ │ ├── TimingComparison.tsx # Timing analysis charts
│ │ │ ├── QualitativeReport.tsx # Narrative report display
│ │ │ ├── PipelineRunner.tsx # Pipeline control UI
│ │ │ └── ResultsView.tsx # Per-agent results
│ │ ├── api/ # Next.js API routes
│ │ └── data/ # Sample data for development
│ └── package.json # React 19, Next.js 16, Recharts, Tailwind 4
│
└── data/ # All generated artifacts
├── recordings/ # Raw PostHog exports
├── parsed/ # Structured event sequences
├── descriptions/ # Behavioral narratives
├── embeddings/ # Vector embeddings
├── clusters/ # Demographic cluster assignments
├── training/ # Fine-tuning JSONL data
├── models/ # Fine-tuned model registry + LoRA adapters
├── agent_logs/ # Exploration session logs
└── reports/ # Comparative + per-agent reports
# Download recordings (skip already downloaded)
python download_recordings.py --min-duration 30
# Cluster with custom count
python cluster_demographics.py --clusters 3
# Build training data with larger context window
python build_training_data.py --window-size 7
# Fine-tune via Mistral API
python fine_tune.py --base-model open-mistral-nemo
# Fine-tune locally with LoRA (GPU required)
python finetune_job.py --all-clusters # Train all clusters
python finetune_job.py --cluster 0 --epochs 3 # Train single cluster
python finetune_job.py --all-clusters --lora-rank 64 # Higher LoRA rank
python finetune_job.py --all-clusters --no-wandb # Disable W&B tracking
# Run only one demographic
python run_agents.py --cluster 0 --max-steps 20 --max-duration 120
# Run the agentic loop with local LoRA model
python agentic_loop.py --url http://localhost:3000 --app-description "..."
# Run end-to-end pipeline on a single recording
python run_pipeline.py <recording.json>
# Process synthetic sessions (flat-event JSON format)
python process_synthetic_batch.py --clusters 3 --concurrency 5
# Generate policies for all described sessions
python generate_all_policies.py
# Run evaluation scorers
python evaluation.py| Approach | Script | Where | Requirements |
|---|---|---|---|
| Mistral API | fine_tune.py |
Cloud (Mistral servers) | MISTRAL_API_KEY |
| Local LoRA | finetune_job.py |
Your GPU server | A100 80GB recommended, PyTorch, PEFT, TRL |
| vLLM Serving | serve_vllm.sh |
Your GPU server | vLLM, trained LoRA adapters |
The local LoRA approach fine-tunes Mistral-7B-Instruct-v0.3 with per-cluster LoRA adapters. It supports Flash Attention 2, fused AdamW, gradient checkpointing, and packing for efficient A100 training.
All fine-tuning runs are tracked in Weights & Biases for full experiment observability. Both fine_tune.py (cloud) and finetune_job.py (local LoRA) integrate with W&B:
- Real-time metrics — Training loss, eval loss, and learning rate logged per step
- Run config — Cluster ID, demographic label, base model, hyperparameters (LoRA rank, epochs, batch size)
- Artifacts — Training JSONL data and trained LoRA adapters are logged as versioned W&B Artifacts
- Model registry — Fine-tuned model IDs (cloud) or adapter paths (local) stored in run summary
- Run tags — Runs tagged with
behavioral-finetuning,lora, and cluster identifiers for filtering
Each cluster gets its own W&B run under the agentic-world project, making it easy to compare training dynamics across demographics. Disable tracking with --no-wandb if needed.
Trained LoRA adapters, training data, and model artifacts are uploaded to Hugging Face for reproducibility and sharing. The published repository is available at amaan784/agentic-world-behavioral.
After local fine-tuning, serve all cluster adapters via vLLM:
bash serve_vllm.sh # Default: port 8001
bash serve_vllm.sh --port 8080 # Custom port
BASE_MODEL=mistralai/Mistral-7B-Instruct-v0.3 bash serve_vllm.shThe script auto-discovers trained LoRA adapters in data/models/ and serves them as named models (cluster0, cluster1, cluster2, ...) via the OpenAI-compatible /v1/chat/completions endpoint.
The service.py module is a unified FastAPI server that combines GPU inference with the online data pipeline. It runs on a VM with a GPU and exposes:
Inference Endpoints:
| Endpoint | Method | Description |
|---|---|---|
/predict |
POST | Predict next action for a cluster + page state |
/predict/batch |
POST | Predict for all clusters at once |
/clusters |
GET | List clusters and their personas |
/switch/{id} |
POST | Pre-load a specific cluster adapter |
/health |
GET | Health check |
Online Pipeline Endpoints:
| Endpoint | Method | Description |
|---|---|---|
/webhook/posthog |
POST | Receive PostHog webhook events |
/process/{id} |
POST | Manually trigger recording processing |
/process/{id}/status |
GET | Check processing status |
/pipeline/status |
GET | Pipeline stats and cluster counts |
/retrain/check |
GET | Check if retraining is needed |
/retrain/trigger |
POST | Trigger async retraining |
/retrain/status |
GET | Retrain job status |
/reload |
POST | Hot-reload adapters after retraining |
/poller/start |
POST | Start background PostHog polling |
/poller/stop |
POST | Stop poller |
/poller/status |
GET | Poller status |
# Start the unified service
python service.py
# With ngrok tunnel and API key
NGROK_AUTH_TOKEN=<token> SERVICE_API_KEY=mysecret python service.pyUse local_client.py to drive browser automation locally against the remote GPU service, keeping the browser on your machine and inference on the GPU.
The online_pipeline/ module provides a standalone FastAPI server for continuous, webhook-driven processing (alternative to the unified service.py):
uvicorn online_pipeline.server:app --port 8100 --reloadPOST /webhook/posthog— Receive PostHog webhook events for new recordingsPOST /process/{recording_id}— Manually trigger processingGET /status— Pipeline status and per-cluster countsPOST /retrain/trigger— Retrain clusters that have accumulated new dataPOST /poller/start— Start background polling (alternative to webhooks)
Agent sessions are traced with Weave for observability. The evaluation.py module provides scoring functions decorated with @weave.op():
- Exploration coverage — Unique pages/features discovered
- Demographic consistency — Whether agent behavior matches cluster profile
- Friction detection — Identification of UX pain points
Scores appear in the Weave UI alongside agent traces for debugging and analysis.
After run_agents.py completes, find results in data/reports/:
comparative_report_latest.json— Multi-agent comparison with recommendationscomparative_report_<timestamp>.json— Timestamped report archive
The comparative report includes:
- Per-agent metrics — Steps, duration, success/failure rates, impressions
- Per-agent narrative — First-person UX report from each demographic
- Common friction points — Issues found by multiple demographics
- Demographic-specific issues — Problems unique to certain user groups
- Prioritized recommendations — Ranked by impact, tagged by affected demographics
- Engagement patterns — Which demographics engaged most/least and why
The visualizer/ directory contains a Next.js 16 dashboard (React 19, Recharts, Tailwind CSS 4) for interactive comparison of agent results:
cd visualizer && npm install && npm run devComponents include comparative dashboards, action timelines, friction maps, timing analysis, and qualitative report rendering.
| Category | Technologies |
|---|---|
| Language | Python 3.10+, TypeScript |
| LLM | Mistral-7B-Instruct-v0.3, Mistral API (Medium, Embed) |
| Fine-Tuning | PEFT (LoRA), TRL, Flash Attention 2, W&B |
| Serving | vLLM (multi-LoRA), FastAPI + Uvicorn |
| Browser Automation | Playwright, AgentQL |
| ML/Clustering | scikit-learn (K-Means), NumPy |
| Observability | Weave (tracing + evaluation), W&B (experiments) |
| Frontend | Next.js 16, React 19, Recharts, Tailwind CSS 4 |
| Data Source | PostHog session recordings (rrweb) |