A comprehensive evaluation framework for testing Vision Language Model (VLM) accuracy in pixel-level pointing tasks. This tool generates synthetic test images, evaluates multiple VLMs, and provides a visual web interface to compare model performance.
For codebase structure and module organization, see STRUCTURE.md.
This benchmark evaluates how accurately different VLMs can identify and point to specific locations in images when given natural language prompts. It's particularly useful for:
- Comparing VLM performance on pixel-accurate pointing tasks
- Testing model accuracy across different screen sizes and aspect ratios
- Visualizing model predictions with an interactive web interface
- Evaluating models for UI automation and device control applications
- Test Suite System: Modular test suite architecture supporting both synthetic and screenshot-based tests
- Synthetic Test Image Generation: Creates test images with various shapes (circles, squares, triangles, buttons, X marks) in different positions and colors
- Multi-Model Evaluation: Supports multiple VLMs including:
- Claude Sonnet 4
- Claude Opus 4
- Gemini 3 Pro
- GPT-5.2
- Claude Haiku 4
- Grok 4.1 (via OpenRouter)
- Qwen3-VL (via OpenRouter)
- GLM-4.6V (via OpenRouter)
- Gemini 2.5 Flash (via OpenRouter)
- Multiple Passes: Run evaluations multiple times to calculate statistics and standard deviation
- Non-Overwriting Results: Results are stored with timestamps, allowing multiple runs without overwriting
- Multiple Screen Sizes: Test models across different screen dimensions and aspect ratios
- Comprehensive Metrics: Calculates distance errors, extraction rates, accuracy thresholds, and standard deviation across passes
- Enhanced Visual Web Viewer: Interactive HTML interface with:
- Test suite selection
- Model and pass filtering
- Multiple pass visualization
- Improved color scheme
- Statistical summaries (mean, std dev, min/max)
- Custom Test Cases: Define your own test scenarios via JSON configuration or create custom test suites
-
Clone or download this repository
-
Create a virtual environment (recommended):
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate- Install dependencies:
pip install -r requirements.txt-
Set up environment variables: Create a
.envfile in the project root with your API keys:ANTHROPIC_API_KEY=your_anthropic_key OPENAI_API_KEY=your_openai_key GEMINI_API_KEY=your_gemini_key OPENROUTER_API_KEY=your_openrouter_keyNote: For OpenRouter models (grok-4.1, qwen3-vl, glm-4.6v, gemini-2.5-flash), you need to set
OPENROUTER_API_KEY.
The codebase has been refactored with a modular architecture. Use the new evaluate.py:
# List available test suites
python evaluate.py --list-suites
# Run a suite by itself (defaults to all available models, 3 passes)
python evaluate.py --test-suite basic_shapes
# Run a specific test suite with specific models
python evaluate.py --test-suite basic_shapes --models sonnet opus gemini3
# Run with multiple passes for statistics
python evaluate.py --test-suite basic_shapes --models sonnet --num-passes 3
# Custom screen size
python evaluate.py --test-suite basic_shapes --width 1080 --height 2400Suites available in this repo (see python evaluate.py --list-suites):
- basic_shapes: simple shapes + a few harder variants (overlap, transparency, low contrast) at your chosen
--width/--height - color_identification: basic colors → subtle differences → hex colors (square images recommended)
- shape_identification: confusable shapes (e.g., hexagon vs octagon, decagon among circles)
- size_comparison: “pick larger/smaller” comparisons (can trigger wrong-object clicks)
- resolution_test_256x256 / 512x512 / 1024x1024: explicit per-resolution suites (recommended for resolution comparisons)
Examples:
# Color/shape/size suites at 1024x1024
python evaluate.py --test-suite color_identification --width 1024 --height 1024
python evaluate.py --test-suite shape_identification --width 1024 --height 1024
python evaluate.py --test-suite size_comparison --width 1024 --height 1024
# Resolution sweep (run each suite at matching width/height)
python evaluate.py --test-suite resolution_test_256x256 --width 256 --height 256
python evaluate.py --test-suite resolution_test_512x512 --width 512 --height 512
python evaluate.py --test-suite resolution_test_1024x1024 --width 1024 --height 1024Select specific models:
python evaluate.py --test-suite basic_shapes --models sonnet opus gemini3Custom screen size:
python evaluate.py --test-suite basic_shapes --width 1080 --height 2400Run multiple passes for statistics:
python evaluate.py --test-suite basic_shapes --models sonnet --num-passes 3Don't save images (faster, smaller output):
python evaluate.py --test-suite basic_shapes --no-save-imagesFix consolidated results (if models list is missing):
python -m evaluation.utils fix --test-suite basic_shapes --screen-size customUpdate test suites index:
python -m evaluation.utils indexOption 1: Using the Python server (Recommended)
Run the included server script:
python serve_viewer.pyThis will:
- Start a local web server on port 8000
- Automatically open the viewer in your browser
- Allow the viewer to load JSON files and images
Option 2: Using Python's built-in server
python -m http.server 8000Then open http://localhost:8000/index.html in your browser.
Using the Enhanced Viewer (index.html):
- Select the results directory (default:
results) - Select a test suite from the dropdown
- Click "Load Results"
- Use the filters to:
- Show/hide specific models
- Show/hide specific passes (for multi-pass runs)
- Browse through test images with visual overlays showing:
- Colored dots: Each model's prediction (multiple passes shown with reduced opacity)
- Legend: Click legend items to toggle model visibility
- Statistics: Mean distance, standard deviation, and min/max across passes
Results are organized in the output directory as follows:
results/
├── test_suites.json # Index of all test suites
├── basic_shapes/ # Test suite name
│ └── custom/ # Screen size name
│ ├── images/ # Test images (one per test case)
│ │ ├── simple_circle.png
│ │ ├── top_right_square.png
│ │ └── ...
│ ├── consolidated_results.json # All models' predictions per test
│ ├── runs_index.json # Index of all runs
│ ├── sonnet_pass1_*.json # Individual run results
│ ├── opus_pass1_*.json
│ └── ...
└── ...
The benchmark includes 8 default test cases:
- Simple circle (center)
- Top-right square corner
- Middle of X mark
- Transparent button
- Small circle
- Bottom-left triangle
- Overlapping shapes
- Low contrast button
Create a JSON file with custom test configurations:
[
{
"name": "custom_circle",
"prompt": "Point to the center of the purple circle",
"shape": "circle",
"color": "purple",
"position": "center",
"expected_coords": [540, 1200]
},
{
"name": "custom_button",
"prompt": "Point to the center of the red button",
"shape": "button",
"color": "red",
"position": "center",
"size": "large"
}
]Available options:
name: Unique test identifierprompt: Natural language instruction for the modelshape:circle,square,triangle,x,buttoncolor:purple,red,blue,green,yellow,orange,gray,lightgray,transparentposition:center,top_left,top_right,bottom_left,bottom_rightsize:small,medium(default),largeexpected_coords:[x, y]- Optional exact coordinatesoverlap:true- For overlapping shapes testbackground: Background color name
The benchmark calculates several accuracy metrics:
- Extraction Rate: Percentage of successful coordinate extractions
- Mean Distance: Average pixel distance from ground truth
- Median Distance: Median pixel distance
- Accuracy within 10px: Percentage of predictions within 10 pixels
- Accuracy within 5%: Percentage within 5% of screen diagonal
The enhanced viewer (index.html) uses the following color scheme:
- Sonnet: rgb(168, 2, 15) (Red)
- Opus: rgb(255, 132, 0) (Orange)
- Gemini3: rgb(0, 255, 76) (Green)
- ChatGPT: rgb(17, 160, 207) (Blue)
- Haiku: rgb(164, 11, 224) (Purple)
- Grok-4.1: rgb(255, 20, 147) (Deep Pink)
- Qwen3-VL: rgb(34, 139, 34) (Forest Green)
- GLM-4.6V: rgb(25, 6, 133) (Orange)
- Gemini-2.5-Flash: rgb(255, 0, 43) (Dodger Blue)
- Ground Truth: Black circle with white center
Colors are displayed as dots next to model names in the statistics section.
- Python 3.8+
- API keys for the models you want to test
- Modern web browser for viewing results
This project is provided as-is for evaluation and research purposes.