Skip to content

autodevice/PixelPointingBenchmark

Repository files navigation

Pixel Pointing Benchmark

A comprehensive evaluation framework for testing Vision Language Model (VLM) accuracy in pixel-level pointing tasks. This tool generates synthetic test images, evaluates multiple VLMs, and provides a visual web interface to compare model performance.

For codebase structure and module organization, see STRUCTURE.md.

Overview

This benchmark evaluates how accurately different VLMs can identify and point to specific locations in images when given natural language prompts. It's particularly useful for:

  • Comparing VLM performance on pixel-accurate pointing tasks
  • Testing model accuracy across different screen sizes and aspect ratios
  • Visualizing model predictions with an interactive web interface
  • Evaluating models for UI automation and device control applications

Features

  • Test Suite System: Modular test suite architecture supporting both synthetic and screenshot-based tests
  • Synthetic Test Image Generation: Creates test images with various shapes (circles, squares, triangles, buttons, X marks) in different positions and colors
  • Multi-Model Evaluation: Supports multiple VLMs including:
    • Claude Sonnet 4
    • Claude Opus 4
    • Gemini 3 Pro
    • GPT-5.2
    • Claude Haiku 4
    • Grok 4.1 (via OpenRouter)
    • Qwen3-VL (via OpenRouter)
    • GLM-4.6V (via OpenRouter)
    • Gemini 2.5 Flash (via OpenRouter)
  • Multiple Passes: Run evaluations multiple times to calculate statistics and standard deviation
  • Non-Overwriting Results: Results are stored with timestamps, allowing multiple runs without overwriting
  • Multiple Screen Sizes: Test models across different screen dimensions and aspect ratios
  • Comprehensive Metrics: Calculates distance errors, extraction rates, accuracy thresholds, and standard deviation across passes
  • Enhanced Visual Web Viewer: Interactive HTML interface with:
    • Test suite selection
    • Model and pass filtering
    • Multiple pass visualization
    • Improved color scheme
    • Statistical summaries (mean, std dev, min/max)
  • Custom Test Cases: Define your own test scenarios via JSON configuration or create custom test suites

Installation

  1. Clone or download this repository

  2. Create a virtual environment (recommended):

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up environment variables: Create a .env file in the project root with your API keys:

    ANTHROPIC_API_KEY=your_anthropic_key
    OPENAI_API_KEY=your_openai_key
    GEMINI_API_KEY=your_gemini_key
    OPENROUTER_API_KEY=your_openrouter_key
    

    Note: For OpenRouter models (grok-4.1, qwen3-vl, glm-4.6v, gemini-2.5-flash), you need to set OPENROUTER_API_KEY.

Usage

New Structure (Recommended)

The codebase has been refactored with a modular architecture. Use the new evaluate.py:

# List available test suites
python evaluate.py --list-suites

# Run a suite by itself (defaults to all available models, 3 passes)
python evaluate.py --test-suite basic_shapes

# Run a specific test suite with specific models
python evaluate.py --test-suite basic_shapes --models sonnet opus gemini3

# Run with multiple passes for statistics
python evaluate.py --test-suite basic_shapes --models sonnet --num-passes 3

# Custom screen size
python evaluate.py --test-suite basic_shapes --width 1080 --height 2400

Running Different Test Suites

Suites available in this repo (see python evaluate.py --list-suites):

  • basic_shapes: simple shapes + a few harder variants (overlap, transparency, low contrast) at your chosen --width/--height
  • color_identification: basic colors → subtle differences → hex colors (square images recommended)
  • shape_identification: confusable shapes (e.g., hexagon vs octagon, decagon among circles)
  • size_comparison: “pick larger/smaller” comparisons (can trigger wrong-object clicks)
  • resolution_test_256x256 / 512x512 / 1024x1024: explicit per-resolution suites (recommended for resolution comparisons)

Examples:

# Color/shape/size suites at 1024x1024
python evaluate.py --test-suite color_identification --width 1024 --height 1024
python evaluate.py --test-suite shape_identification --width 1024 --height 1024
python evaluate.py --test-suite size_comparison --width 1024 --height 1024

# Resolution sweep (run each suite at matching width/height)
python evaluate.py --test-suite resolution_test_256x256 --width 256 --height 256
python evaluate.py --test-suite resolution_test_512x512 --width 512 --height 512
python evaluate.py --test-suite resolution_test_1024x1024 --width 1024 --height 1024

Custom Options

Select specific models:

python evaluate.py --test-suite basic_shapes --models sonnet opus gemini3

Custom screen size:

python evaluate.py --test-suite basic_shapes --width 1080 --height 2400

Run multiple passes for statistics:

python evaluate.py --test-suite basic_shapes --models sonnet --num-passes 3

Don't save images (faster, smaller output):

python evaluate.py --test-suite basic_shapes --no-save-images

Utility Commands

Fix consolidated results (if models list is missing):

python -m evaluation.utils fix --test-suite basic_shapes --screen-size custom

Update test suites index:

python -m evaluation.utils index

Viewing Results

Option 1: Using the Python server (Recommended)

Run the included server script:

python serve_viewer.py

This will:

  • Start a local web server on port 8000
  • Automatically open the viewer in your browser
  • Allow the viewer to load JSON files and images

Option 2: Using Python's built-in server

python -m http.server 8000

Then open http://localhost:8000/index.html in your browser.

Using the Enhanced Viewer (index.html):

  1. Select the results directory (default: results)
  2. Select a test suite from the dropdown
  3. Click "Load Results"
  4. Use the filters to:
    • Show/hide specific models
    • Show/hide specific passes (for multi-pass runs)
  5. Browse through test images with visual overlays showing:
    • Colored dots: Each model's prediction (multiple passes shown with reduced opacity)
    • Legend: Click legend items to toggle model visibility
    • Statistics: Mean distance, standard deviation, and min/max across passes

Output Structure

Results are organized in the output directory as follows:

results/
├── test_suites.json              # Index of all test suites
├── basic_shapes/                 # Test suite name
│   └── custom/                   # Screen size name
│       ├── images/               # Test images (one per test case)
│       │   ├── simple_circle.png
│       │   ├── top_right_square.png
│       │   └── ...
│       ├── consolidated_results.json  # All models' predictions per test
│       ├── runs_index.json       # Index of all runs
│       ├── sonnet_pass1_*.json   # Individual run results
│       ├── opus_pass1_*.json
│       └── ...
└── ...

Test Configuration

Default Test Cases

The benchmark includes 8 default test cases:

  • Simple circle (center)
  • Top-right square corner
  • Middle of X mark
  • Transparent button
  • Small circle
  • Bottom-left triangle
  • Overlapping shapes
  • Low contrast button

Custom Test Cases

Create a JSON file with custom test configurations:

[
  {
    "name": "custom_circle",
    "prompt": "Point to the center of the purple circle",
    "shape": "circle",
    "color": "purple",
    "position": "center",
    "expected_coords": [540, 1200]
  },
  {
    "name": "custom_button",
    "prompt": "Point to the center of the red button",
    "shape": "button",
    "color": "red",
    "position": "center",
    "size": "large"
  }
]

Available options:

  • name: Unique test identifier
  • prompt: Natural language instruction for the model
  • shape: circle, square, triangle, x, button
  • color: purple, red, blue, green, yellow, orange, gray, lightgray, transparent
  • position: center, top_left, top_right, bottom_left, bottom_right
  • size: small, medium (default), large
  • expected_coords: [x, y] - Optional exact coordinates
  • overlap: true - For overlapping shapes test
  • background: Background color name

Metrics

The benchmark calculates several accuracy metrics:

  • Extraction Rate: Percentage of successful coordinate extractions
  • Mean Distance: Average pixel distance from ground truth
  • Median Distance: Median pixel distance
  • Accuracy within 10px: Percentage of predictions within 10 pixels
  • Accuracy within 5%: Percentage within 5% of screen diagonal

Model Colors in Viewer

The enhanced viewer (index.html) uses the following color scheme:

  • Sonnet: rgb(168, 2, 15) (Red)
  • Opus: rgb(255, 132, 0) (Orange)
  • Gemini3: rgb(0, 255, 76) (Green)
  • ChatGPT: rgb(17, 160, 207) (Blue)
  • Haiku: rgb(164, 11, 224) (Purple)
  • Grok-4.1: rgb(255, 20, 147) (Deep Pink)
  • Qwen3-VL: rgb(34, 139, 34) (Forest Green)
  • GLM-4.6V: rgb(25, 6, 133) (Orange)
  • Gemini-2.5-Flash: rgb(255, 0, 43) (Dodger Blue)
  • Ground Truth: Black circle with white center

Colors are displayed as dots next to model names in the statistics section.

Requirements

  • Python 3.8+
  • API keys for the models you want to test
  • Modern web browser for viewing results

License

This project is provided as-is for evaluation and research purposes.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •