Skip to content

HLC-Lab/pico

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

506 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PICO β€” Performance Insights for Collective Operations

GitHub stars License: MIT Contributions welcome Paper: ISC 2026

πŸ’« If you find PICO useful for your research or benchmarking work, please consider giving it a ⭐ on GitHub!


PICO is a lightweight, extensible, and reproducible benchmarking suite for evaluating and tuning collective communication operations across diverse libraries and hardware platforms.

Built for researchers, developers, and system administrators, PICO streamlines the entire benchmarking workflowβ€”from configuration to execution, tracing, and analysisβ€”across MPI, NCCL, and user-defined collectives.

⭐ Highlights

  • πŸ“¦ Unified micro-benchmarking of both CPU and GPU collectives, across a variety of MPI libraries (Open MPI, MPICH, Cray MPICH), NCCL and user-defined algorithms.
  • πŸŽ›οΈ Guided configuration via a fully fledged Textual TUI or CLI-driven JSON/flag workflow with per-site presets.
  • πŸ“‹ Reproducible runs through environment capture, metadata logging, and timestamped result directories.
  • 🧩 Built-in correctness checks for custom collectives and automatic ground-truth validation.
  • 🧭 Per-phase instrumentation, going beyond micro-benchmarking, hence the name PICO
  • 🧡 Queue-friendly orchestration that compiles, ships, and archives jobs seamlessly on SLURM clusters or in local mode for debugging.
  • πŸ“Š Bundled plotting, tracing, and scheduling utilities for streamlined post-processing and algorithm engineering.

πŸ“š Publication

If you use PICO in your work, please cite the PICO paper:

S. Pasqualoni, T. Bonato, L. Piarulli, T. Hoefler, M. Canini and D. De Sensi, "PICO: Performance Insights for Collective Operations", ISC High Performance 2026 Research Paper Proceedings (41st International Conference), Hamburg, Germany, 2026, pp. 1-14, DOI: 10.23919/ISC.2026.11520486.

@INPROCEEDINGS{11520486,
  author={Pasqualoni, Saverio and Bonato, Tommaso and Piarulli, Lorenzo and Hoefler, Torsten and Canini, Marco and De Sensi, Daniele},
  booktitle={ISC High Performance 2026 Research Paper Proceedings (41st International Conference)}, 
  title={PICO: Performance Insights for Collective Operations}, 
  year={2026},
  volume={},
  number={},
  pages={1-14},
  keywords={Algorithms;High performance computing;Libraries;Uniform resource locators;Modeling;Management information base;Conferences;Distance measurement;Broadcasting;Graphics processing units;High performance computing;Performance analysis;Computer networks;Message passing;Software Tools},
  doi={10.23919/ISC.2026.11520486}}

Architecture at a Glance

πŸ“ Configuration
 β”œβ”€ 🧩 Sources: Textual TUI β€’ JSON β€’ CLI flags
 └─ βš™οΈ Validation & module loading via submit_wrapper.sh

πŸš€ Orchestration
 β”œβ”€ 🧡 scripts/orchestrator.sh iterates over:
 β”‚    β€’ Libraries Γ— Collectives Γ— Message Sizes
 └─ πŸ—οΈ Builds binaries and dispatches jobs (SLURM or local)

🧠 Execution
 β”œβ”€ pico_core / libpico executables
 β”œβ”€ βœ… Correctness checks
 └─ 🧭 Optional per-phase instrumentation

πŸ“Š Results
 β”œβ”€ results/<system>/<timestamp>/
 β”‚    β€’ CSV metrics
 β”‚    β€’ Logs
 β”‚    β€’ Metadata
 β”‚    β€’ Archives
 └─ Post-processing utilities:
      β€’ plot/ β€’ tracer/ β€’ schedgen/ β€’ selector/

πŸš€ Quickstart

The recommended way to use PICO is through its Textual TUI, which guides you from configuration to job submission.

βš™οΈ 1. Configure Your Environment

Ensure you have at least one valid environment definition under config/environment/ (TUI) or config/environments/ (legacy CLI).

A working local sample is provided, modify it for your local machine.

For remote clusters, you should mirror one of the existing environment templates and adapt it to your site (a setup wizard to simplify this configuration is on its way!)

🧭 2. Create a virtual env and launch the TUI

Create and activate a Python virtual environment, then install the Python dependencies used by the TUI and analysis tools:

pip install -r requirements.txt

Start the interactive interface (see tui/README.md for a full walkthrough of the four-step wizard) to configure the environment, select libraries, choose algorithms, and export.

python tui/main.py

🧩 3. Generate a Test Description

Within the TUI, define:

  • The target collective(s)
  • Message sizes and iteration counts
  • Backends (MPI / NCCL / custom)
  • Instrumentation and validation settings

The TUI will produce a test descriptor file encapsulating all these options.

The export lands in tests/<name>.json (full configuration) and tests/<name>.sh (shell exports).

πŸš€ 4. Run the Benchmark

Execute the generated descriptor using the wrapper script, which handles compilation, dispatch, and archival:

scripts/submit_wrapper.sh -f [path_to_test_sh_file]

This command will orchestrate the full benchmarking workflow β€” locally or on SLURM clusters β€” using your defined environment.

🧰 Optional: CLI Workflow (Legacy)

You can still invoke PICO directly via the CLI to explore options or run ad-hoc tests. If that is desired, after step 1 do:

scripts/submit_wrapper.sh --help

⚠️ Note: The CLI path is currently partially maintained; some flags may be deprecated as functionality transitions to the TUI.

Example CLI invocation:

scripts/submit_wrapper.sh \
  --location leonardo \
  --nodes 8 \
  --ntasks-per-node 32 \
  --collectives allreduce,allgather \
  --types int32,double \
  --sizes 64,1024,65536 \
  --segment-sizes 0 \
  --time 01:00:00 \
  --gpu-awareness no
  • Provide comma-separated lists for datatypes, message sizes, and segment sizes.
  • Use --gpu-awareness yes and --gpu-per-node to benchmark NCCL or CUDA-aware MPI collectives.
  • Pass --debug yes for quick validation runs with reduced iterations and debug builds.
  • When --compile-only yes is set, the script stops after building bin/pico_core and its GPU counterpart.

πŸ’» Dependencies

  • A C/C++ compiler and MPI implementation (Open MPI, MPICH, or Cray MPICH). CUDA-aware MPI or NCCL is optional for GPU runs.
  • (Optional) CUDA toolkit and a compatible NCCL build for GPU collectives.
  • Python 3.9+ with pip for the TUI and analysis utilities (pip install -r requirements.txt).
  • SLURM for cluster submissions; local mode is supported for functional testing.
  • Basic build tools (make) and a Bash-compatible shell.

🧠 Core Components

  • pico_core/ β€” C benchmarking driver that allocates buffers, times collectives, checks results, and writes output.
  • libpico/ β€” Library of custom collective algorithms and instrumentation helpers, selectable alongside vendor MPI/NCCL paths.
  • scripts/submit_wrapper.sh β€” Entry point that parses CLI flags or TUI exports, loads site modules, builds binaries, activates Python envs, and launches SLURM or local runs.
  • scripts/orchestrator.sh β€” Node-side runner that sweeps libraries, algorithm sets, GPU modes, message sizes, and datatypes while invoking metadata capture and optional compression.
  • config/ β€” Declarative environment, library, and algorithm descriptions consumed by the TUI and CLI (modules to load, compiler wrappers, task/GPU limits). See config/environment/README.md for the environment schema.
  • tui/ β€” Textual-based UI that guides the user through environment selection, library selection, algorithm mix, and exports the shell/JSON bundle for later submission. See tui/README.md for usage and extension details.
  • plot/ β€” Python package and CLI (python -m plot …) that turns CSV summaries into line charts, bar charts, heatmaps, and tables. See plot/README.md for available subcommands and data pipelines.
  • tracer/ β€” Tools for network-awareness studies (link utilization estimates, cluster job monitoring, scatterplots/boxplots). See tracer/README.md for details.
  • schedgen/ β€” Adapted SPCL scheduler generator used to derive algorithm schedules from communication traces. See schedgen/README.md for usage and built-in algorithms.
  • selector/ β€” Open MPI tuning rule generation (generate_ompi_tuning.py) and dynamic rule selection helpers.
  • results/ β€” Storage for raw outputs, metadata CSVs (per system), and helper scripts such as generate_metadata.py.

πŸ’‘ What Happens During a Run

  1. Environment sourcing loads modules, compiler wrappers, MPI/NCCL paths, and queue defaults. In the CLI workflow this happens via config/environments/<location>.sh; in the TUI workflow the test descriptor (tests/<name>.sh) already carries all resolved settings.
  2. The Makefile builds libpico first, then pico_core (CPU) and optionally pico_core_cuda (GPU) or pico_core_nccl (NCCL), honouring debug and instrumentation flags.
  3. A Python virtual environment is activated and populated with plotting/tracing dependencies on demand.
  4. scripts/orchestrator.sh iterates over every selected library, collective, datatype, message size, and GPU mode. For each combination it:
    • Prepares per-collective environment variables and propagates algorithm lists to the workers.
    • Generates metadata entries through results/generate_metadata.py, capturing cluster, job, library, GPU, and note fields.
    • Runs pico_core, which allocates buffers, initializes randomized inputs (deterministic when debugging), executes warmups, measures iterations, and compares the outcome against vendor MPI results.
    • Optionally enables LibPICO instrumentation tags to time internal algorithm phases.
  5. Outputs are written under results/<location>/<timestamp>/; in non-debug runs the directory can be tarred and optionally deleted.

πŸ“ˆ Results and Analysis

  • CSV files follow the <count>_<algorithm>_<datatype>.csv naming scheme (or <count>_<algorithm>_<segsize>_<datatype>.csv for segmented collectives). Instrumented builds append _instrument before the extension. Rows contain per-iteration timing or summary statistics depending on --output-level (supported values: all, minimal).
  • Allocation maps (alloc_<tasks>.csv) record rank-to-node placement. GPU runs append _GPU.
  • SLURM logs reside alongside the CSVs (slurm_<jobid>.out/.err) unless in debug mode.
  • Metadata is appended to results/<location>_metadata.csv, enabling cross-run filtering by timestamp, collective, library version, GPU involvement, and notes.
  • Example plotting commands:
python -m plot summary --summary-file results/leonardo/<timestamp>/summary.csv
python -m plot heatmap --system leonardo --nnodes 8 --collective allreduce
python -m plot boxplot --system lumi --nnodes 8 --notes "production"
  • The tracer package (tracer/trace_communications.py) estimates traffic on global links for recorded allocations, while tracer/sinfo can processes week-long job snapshots from monitored clusters.
  • selector/generate_ompi_tuning.py β€” Produce Open MPI tuning rules from benchmark results, consumable by OMPI_MCA_coll_tuned_dynamic_rules_filename. See selector/README_generate_ompi_tuning.md for usage.

πŸ§ͺ Instrumentation and Custom Collectives

  • Building with -DPICO_INSTRUMENT exposes the PICO_TAG_BEGIN/END macros defined in include/libpico.h.
    • These can be inserted into LibPICO collective implementations to record per-phase timings, which are emitted into _instrument.csv files. Detailed usage and examples are provided in libpico/instrument.md.
    • Instrumentation is supported for CPU collectives; the macros are transparent when GPU paths are enabled.
  • To add new algorithms, follow the step-by-step guide in libpico/adding_algorithms.md. The TUI and CLI automatically surface new options once registered.

🧱 Extending PICO

  • Environments: See config/environment/README.md for the full schema reference and step-by-step guide to adding new cluster profiles. Real-world examples are available under config/environment/.
  • Libraries: Update <env>_libraries.json to expose additional MPI/NCCL builds, compiler flags, GPU capabilities, and metadata strings. The TUI reads these files at runtime.

πŸ—‚οΈ Repository Layout

pico/
β”œβ”€β”€ include/                # Public LibPICO API and instrumentation macros
β”œβ”€β”€ libpico/                # Custom collective implementations
β”œβ”€β”€ pico_core/              # Benchmark driver and MPI/NCCL glue code
β”œβ”€β”€ config/                 # Environment, library, and algorithm JSON descriptors
β”œβ”€β”€ scripts/                # Submission, orchestration, metadata, and shell helpers
β”œβ”€β”€ tui/                    # Textual UI for configuration authoring
β”œβ”€β”€ plot/                   # Plotting package and CLI
β”œβ”€β”€ tracer/                 # Network tracing and allocation analysis tools
β”œβ”€β”€ schedgen/               # Communication schedule generator (SPCL fork)
β”œβ”€β”€ selector/               # Dynamic rule selection helpers for Open MPI
β”œβ”€β”€ tests/                  # Sample exported configurations
└── results/                # Generated data, metadata CSVs, and helper scripts

πŸͺͺ Credits and License

PICO is developed by Daniele De Sensi, Saverio Pasqualoni and Lorenzo Protano at the Department of Computer Science, Sapienza University of Rome. The project is licensed under the MIT License.

Schedgen code was originally released by SPCL @ ETH Zurich under the BSD 4-Clause license. The version bundled with PICO includes targeted modifications to support its extended scheduling and tracing workflow.

πŸ“¬ Contact

About

PICO: Performance Insights for Collective Operations

Resources

License

Stars

Watchers

Forks

Contributors