Pixel Pointing Benchmark

A comprehensive evaluation framework for testing Vision Language Model (VLM) accuracy in pixel-level pointing tasks. This tool generates synthetic test images, evaluates multiple VLMs, and provides a visual web interface to compare model performance.

For codebase structure and module organization, see STRUCTURE.md.

Overview

This benchmark evaluates how accurately different VLMs can identify and point to specific locations in images when given natural language prompts. It's particularly useful for:

Comparing VLM performance on pixel-accurate pointing tasks
Testing model accuracy across different screen sizes and aspect ratios
Visualizing model predictions with an interactive web interface
Evaluating models for UI automation and device control applications

Features

Test Suite System: Modular test suite architecture supporting both synthetic and screenshot-based tests
Synthetic Test Image Generation: Creates test images with various shapes (circles, squares, triangles, buttons, X marks) in different positions and colors
Multi-Model Evaluation: Supports multiple VLMs including:
- Claude Sonnet 4
- Claude Opus 4
- Gemini 3 Pro
- GPT-5.2
- Claude Haiku 4
- Grok 4.1 (via OpenRouter)
- Qwen3-VL (via OpenRouter)
- GLM-4.6V (via OpenRouter)
- Gemini 2.5 Flash (via OpenRouter)
Multiple Passes: Run evaluations multiple times to calculate statistics and standard deviation
Non-Overwriting Results: Results are stored with timestamps, allowing multiple runs without overwriting
Multiple Screen Sizes: Test models across different screen dimensions and aspect ratios
Comprehensive Metrics: Calculates distance errors, extraction rates, accuracy thresholds, and standard deviation across passes
Enhanced Visual Web Viewer: Interactive HTML interface with:
- Test suite selection
- Model and pass filtering
- Multiple pass visualization
- Improved color scheme
- Statistical summaries (mean, std dev, min/max)
Custom Test Cases: Define your own test scenarios via JSON configuration or create custom test suites

Installation

Clone or download this repository
Create a virtual environment (recommended):

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Set up environment variables: Create a .env file in the project root with your API keys:
```
ANTHROPIC_API_KEY=your_anthropic_key
OPENAI_API_KEY=your_openai_key
GEMINI_API_KEY=your_gemini_key
OPENROUTER_API_KEY=your_openrouter_key
```
Note: For OpenRouter models (grok-4.1, qwen3-vl, glm-4.6v, gemini-2.5-flash), you need to set OPENROUTER_API_KEY.

Usage

New Structure (Recommended)

The codebase has been refactored with a modular architecture. Use the new evaluate.py:

# List available test suites
python evaluate.py --list-suites

# Run a suite by itself (defaults to all available models, 3 passes)
python evaluate.py --test-suite basic_shapes

# Run a specific test suite with specific models
python evaluate.py --test-suite basic_shapes --models sonnet opus gemini3

# Run with multiple passes for statistics
python evaluate.py --test-suite basic_shapes --models sonnet --num-passes 3

# Custom screen size
python evaluate.py --test-suite basic_shapes --width 1080 --height 2400

Running Different Test Suites

Suites available in this repo (see python evaluate.py --list-suites):

basic_shapes: simple shapes + a few harder variants (overlap, transparency, low contrast) at your chosen --width/--height
color_identification: basic colors → subtle differences → hex colors (square images recommended)
shape_identification: confusable shapes (e.g., hexagon vs octagon, decagon among circles)
size_comparison: “pick larger/smaller” comparisons (can trigger wrong-object clicks)
resolution_test_256x256 / 512x512 / 1024x1024: explicit per-resolution suites (recommended for resolution comparisons)

Examples:

# Color/shape/size suites at 1024x1024
python evaluate.py --test-suite color_identification --width 1024 --height 1024
python evaluate.py --test-suite shape_identification --width 1024 --height 1024
python evaluate.py --test-suite size_comparison --width 1024 --height 1024

# Resolution sweep (run each suite at matching width/height)
python evaluate.py --test-suite resolution_test_256x256 --width 256 --height 256
python evaluate.py --test-suite resolution_test_512x512 --width 512 --height 512
python evaluate.py --test-suite resolution_test_1024x1024 --width 1024 --height 1024

Custom Options

Select specific models:

python evaluate.py --test-suite basic_shapes --models sonnet opus gemini3

Custom screen size:

python evaluate.py --test-suite basic_shapes --width 1080 --height 2400

Run multiple passes for statistics:

python evaluate.py --test-suite basic_shapes --models sonnet --num-passes 3

Don't save images (faster, smaller output):

python evaluate.py --test-suite basic_shapes --no-save-images

Utility Commands

Fix consolidated results (if models list is missing):

python -m evaluation.utils fix --test-suite basic_shapes --screen-size custom

Update test suites index:

python -m evaluation.utils index

Viewing Results

Option 1: Using the Python server (Recommended)

Run the included server script:

python serve_viewer.py

This will:

Start a local web server on port 8000
Automatically open the viewer in your browser
Allow the viewer to load JSON files and images

Option 2: Using Python's built-in server

python -m http.server 8000

Then open http://localhost:8000/index.html in your browser.

Using the Enhanced Viewer (index.html):

Select the results directory (default: results)
Select a test suite from the dropdown
Click "Load Results"
Use the filters to:
- Show/hide specific models
- Show/hide specific passes (for multi-pass runs)
Browse through test images with visual overlays showing:
- Colored dots: Each model's prediction (multiple passes shown with reduced opacity)
- Legend: Click legend items to toggle model visibility
- Statistics: Mean distance, standard deviation, and min/max across passes

Output Structure

Results are organized in the output directory as follows:

results/
├── test_suites.json              # Index of all test suites
├── basic_shapes/                 # Test suite name
│   └── custom/                   # Screen size name
│       ├── images/               # Test images (one per test case)
│       │   ├── simple_circle.png
│       │   ├── top_right_square.png
│       │   └── ...
│       ├── consolidated_results.json  # All models' predictions per test
│       ├── runs_index.json       # Index of all runs
│       ├── sonnet_pass1_*.json   # Individual run results
│       ├── opus_pass1_*.json
│       └── ...
└── ...

Test Configuration

Default Test Cases

The benchmark includes 8 default test cases:

Simple circle (center)
Top-right square corner
Middle of X mark
Transparent button
Small circle
Bottom-left triangle
Overlapping shapes
Low contrast button

Custom Test Cases

Create a JSON file with custom test configurations:

[
  {
    "name": "custom_circle",
    "prompt": "Point to the center of the purple circle",
    "shape": "circle",
    "color": "purple",
    "position": "center",
    "expected_coords": [540, 1200]
  },
  {
    "name": "custom_button",
    "prompt": "Point to the center of the red button",
    "shape": "button",
    "color": "red",
    "position": "center",
    "size": "large"
  }
]

Available options:

name: Unique test identifier
prompt: Natural language instruction for the model
shape: circle, square, triangle, x, button
color: purple, red, blue, green, yellow, orange, gray, lightgray, transparent
position: center, top_left, top_right, bottom_left, bottom_right
size: small, medium (default), large
expected_coords: [x, y] - Optional exact coordinates
overlap: true - For overlapping shapes test
background: Background color name

Metrics

The benchmark calculates several accuracy metrics:

Extraction Rate: Percentage of successful coordinate extractions
Mean Distance: Average pixel distance from ground truth
Median Distance: Median pixel distance
Accuracy within 10px: Percentage of predictions within 10 pixels
Accuracy within 5%: Percentage within 5% of screen diagonal

Model Colors in Viewer

The enhanced viewer (index.html) uses the following color scheme:

Sonnet: rgb(168, 2, 15) (Red)
Opus: rgb(255, 132, 0) (Orange)
Gemini3: rgb(0, 255, 76) (Green)
ChatGPT: rgb(17, 160, 207) (Blue)
Haiku: rgb(164, 11, 224) (Purple)
Grok-4.1: rgb(255, 20, 147) (Deep Pink)
Qwen3-VL: rgb(34, 139, 34) (Forest Green)
GLM-4.6V: rgb(25, 6, 133) (Orange)
Gemini-2.5-Flash: rgb(255, 0, 43) (Dodger Blue)
Ground Truth: Black circle with white center

Colors are displayed as dots next to model names in the statistics section.

Requirements

Python 3.8+
API keys for the models you want to test
Modern web browser for viewing results

License

This project is provided as-is for evaluation and research purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.vscode		.vscode
basic_shapes/custom		basic_shapes/custom
evaluation		evaluation
results		results
test_generation		test_generation
test_outputs		test_outputs
test_suites		test_suites
.gitignore		.gitignore
README.md		README.md
STRUCTURE.md		STRUCTURE.md
evaluate.py		evaluate.py
example_custom_tests.json		example_custom_tests.json
index.html		index.html
requirements.txt		requirements.txt
serve_viewer.py		serve_viewer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Pixel Pointing Benchmark

Overview

Features

Installation

Usage

New Structure (Recommended)

Running Different Test Suites

Custom Options

Utility Commands

Viewing Results

Output Structure

Test Configuration

Default Test Cases

Custom Test Cases

Metrics

Model Colors in Viewer

Requirements

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

autodevice/PixelPointingBenchmark

Folders and files

Latest commit

History

Repository files navigation

Pixel Pointing Benchmark

Overview

Features

Installation

Usage

New Structure (Recommended)

Running Different Test Suites

Custom Options

Utility Commands

Viewing Results

Output Structure

Test Configuration

Default Test Cases

Custom Test Cases

Metrics

Model Colors in Viewer

Requirements

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages