feat(training): refactor training module to new adapter/executor architecture by zzhfz · Pull Request #28 · InfiniTensor/InfiniMetrics

zzhfz · 2026-02-28T03:17:56Z

Description

重构训练模块，使其符合 InfiniMetrics 新的适配器架构设计

Changes

新增 TrainingAdapter，继承 BaseAdapter 实现统一测试执行
MegatronImpl 实现 Megatron-LM 训练支持
InfinitrainImpl 占位实现（待后续开发）
更新 testcase_utils，支持训练类型的 testcase
在 dispatcher 中注册训练适配器

Test Results

输入json

{ 
  "run_id": "train.megatron.gpt.20240208_101",
  "testcase": "train.megatron.GPT",
  "config": {
    "output_dir": "./output",
    "framework": "megatron",
    "model": "gpt",
    "megatron_path": "/home/sunjinge/Megatron-LM",
    "device": {
      "gpu_platform": "nvidia",
      "device_ids": [0, 1]
    },
    "train_dataset": "mock",
    "warmup_iterations": 2,
    "train_args": {
      "parallel": {
        "dp": 1,
        "tp": 1,
        "pp": 1,
        "sp": 0
      },
      "mbs": 2,
      "gbs": 8,
      "seq_len": 128,
      "lr": 0.00015,
      "train_iters": 10,
      "num_layers": 2,
      "hidden_size": 512,
      "num_attention_heads": 8,
      "vocab_size": 128256,
      "max_position_embeddings": 128,
      "precision": "bf16",
      "optimizer": "adam",
      "weight_decay": 0.01,
      "clip_grad": 1.0,
      "beta1": 0.9,
      "beta2": 0.95,
      "lr_scheduler": "cosine",
      "min_lr": 0.0,
      "eval_interval": 100,
      "eval_iters": 10,
      "save_interval": 1000,
      "extra_args": []
    }
  }
}

运行结果

(megatron) sunjinge@server:~/InfiniMetrics$ python main.py test_training.json 
2026-02-28 03:13:09,512 - infinimetrics.utils.input_loader - INFO - Loaded 1 input(s) from test_training.json
2026-02-28 03:13:09,512 - infinimetrics.dispatcher - INFO - Processing 1 valid inputs (skipped 0 invalid)
2026-02-28 03:13:09,517 - infinimetrics.dispatcher - INFO - Validation complete: 1 valid, 0 skipped
2026-02-28 03:13:09,517 - infinimetrics.dispatcher - INFO - [1/1] Executing train.megatron.GPT
2026-02-28 03:13:09,517 - infinimetrics.executor - INFO - Executor: Running train.megatron.GPT
.......... ............. .......
2026-02-28 02:47:19,399 - infinimetrics.training.frameworks.megatron_impl - INFO -     evaluate .......................................: (318.84, 318.84)
2026-02-28 02:47:19,399 - infinimetrics.training.frameworks.megatron_impl - INFO - WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode validate_results
2026-02-28 02:47:19,399 - infinimetrics.training.frameworks.megatron_impl - INFO - WARNING:megatron.core.rerun_state_machine:Setting RerunStateMachine mode validate_results
2026-02-28 02:47:19,399 - infinimetrics.training.frameworks.megatron_impl - INFO - ----------------------------------------------------------------------------------------------------------
2026-02-28 02:47:19,399 - infinimetrics.training.frameworks.megatron_impl - INFO -  validation loss at iteration 10 on test set | lm loss value: 1.132849E+01 | lm loss PPL: 8.315699E+04 | 
2026-02-28 02:47:19,399 - infinimetrics.training.frameworks.megatron_impl - INFO - ----------------------------------------------------------------------------------------------------------
2026-02-28 02:47:19,911 - infinimetrics.training.frameworks.megatron_impl - INFO - [rank0]:[W228 02:47:19.614357547 ProcessGroupNCCL.cpp:1250] Warning: WARNING: process group has NOT been destroyed before we destruct ProcessGroupNCCL. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL operations have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4 (function operator())
2026-02-28 02:47:23,065 - infinimetrics.training.frameworks.megatron_impl - INFO - Results saved to output/training
2026-02-28 02:47:23,065 - infinimetrics.training.training_adapter - INFO - Training completed successfully: train.megatron.GPT
2026-02-28 02:47:23,656 - infinimetrics.utils.accelerator_monitor - INFO - nvidia monitoring stopped
2026-02-28 02:47:23,802 - infinimetrics.training.training_adapter - INFO - TrainingAdapter teardown complete
2026-02-28 02:47:23,804 - infinimetrics.executor - INFO - Executor: train.megatron.GPT completed with code=0
2026-02-28 02:47:23,805 - infinimetrics.dispatcher - INFO - Summary saved to summary_output/dispatcher_summary_20260228_024723.json

============================================================
Test Summary
============================================================
Total tests:   1
Successful:    1
Failed:        0
Success rate:  100.0%
============================================================

输出json

(megatron) sunjinge@server:~/InfiniMetrics$ cat output/train/train.megatron.gpt.20240208_101_results.json
{
  "run_id": "train.megatron.gpt.20240208_101",
  "time": "2026-02-28 03:13:23",
  "testcase": "train.megatron.GPT",
  "success": 0,
  "environment": {
    "cluster_scale": 1,
    "topology": "2x1 ring mesh",
    "cluster": [
      {
        "machine": {
          "cpu_model": "Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz",
          "memory_gb": 2015,
          "accelerators": [
            {
              "model": "NVIDIA A100-SXM4-80GB",
              "count": 2,
              "memory_gb_per_card": 80,
              "driver": "580.105.08",
              "cuda": "13.0",
              "type": "nvidia"
            }
          ]
        },
        "framework": [
          {
            "name": "unknown",
            "version": "unknown"
          }
        ]
      }
    ]
  },
  "result_code": 0,
  "config": {
    "output_dir": "./output",
    "framework": "megatron",
    "model": "gpt",
    "megatron_path": "/home/sunjinge/Megatron-LM",
    "device": {
      "gpu_platform": "nvidia",
      "device_ids": [
        0,
        1
      ]
    },
    "train_dataset": "mock",
    "warmup_iterations": 2,
    "train_args": {
      "parallel": {
        "dp": 1,
        "tp": 1,
        "pp": 1,
        "sp": 0
      },
      "mbs": 2,
      "gbs": 8,
      "seq_len": 128,
      "lr": 0.00015,
      "train_iters": 10,
      "num_layers": 2,
      "hidden_size": 512,
      "num_attention_heads": 8,
      "vocab_size": 128256,
      "max_position_embeddings": 128,
      "precision": "bf16",
      "optimizer": "adam",
      "weight_decay": 0.01,
      "clip_grad": 1.0,
      "beta1": 0.9,
      "beta2": 0.95,
      "lr_scheduler": "cosine",
      "min_lr": 0.0,
      "eval_interval": 100,
      "eval_iters": 10,
      "save_interval": 1000,
      "extra_args": []
    }
  },
  "metrics": [
    {
      "name": "train.throughput",
      "type": "timeseries",
      "raw_data_url": "output/training/train.megatron.gpt.20240208_101_train_throughput.csv",
      "unit": "tokens/s/gpu"
    },
    {
      "name": "train.loss",
      "type": "timeseries",
      "raw_data_url": "output/training/train.megatron.gpt.20240208_101_train_loss.csv",
      "unit": ""
    },
    {
      "name": "train.ppl",
      "type": "timeseries",
      "raw_data_url": "output/training/train.megatron.gpt.20240208_101_train_ppl.csv",
      "unit": null
    },
    {
      "name": "train.peak_memory_usage",
      "type": "scalar",
      "value": 37.060547,
      "unit": "GB"
    }
  ],
  "resolved": {
    "nodes": 1,
    "gpus_per_node": 2,
    "device_used": 2,
    "accelerator_type": "nvidia"
  }
}

loss.csv示例

(megatron) sunjinge@server:~/InfiniMetrics$ cat output/training/train.megatron.gpt.20240208_101_train_loss.csv 
iteration,loss
1,11.88619
2,11.82143
3,11.74973
4,11.6724
5,11.5742
6,11.50262
7,11.45898
8,11.48349
9,11.63348
10,11.31823

feat: add Streamlit dashboard for InfiniMetrics

baominghelly · 2026-02-28T03:29:32Z

设计文档中写的是train.megatron.SFT和train.megatron.LoRA，你写的是train.megatron.GPT这个地方改一下，另外这两个case都可以支持不？

我看json中有个framework unknown, 要不这种情况就不返回这个字段了吧？

baominghelly · 2026-02-28T06:00:46Z

infinimetrics/training/training_adapter.py

+            from infinimetrics.training.frameworks.infinitrain_impl import (
+                InfinitrainImpl,
+            )
+
+            self.runner = InfinitrainImpl(config, resolved_device_count, self.run_id)
+            logger.info(
+                f"Created InfiniTrain implementation (placeholder) with {resolved_device_count} devices"
+            )


如果是只是占位的话，我觉得直接在这里raise NotImplementedError("InfiniTrain implementation is not ready yet") 也行，把infinitrain_impl删掉

baominghelly · 2026-02-28T06:06:27Z

infinimetrics/training/training_adapter.py

+        except Exception as e:
+            logger.error(f"Training failed: {e}", exc_info=True)
+            return self._create_error_response(
+                str(e), test_input=test_dict, result_code=1
+            )
+


adapter层的报错会统一在executor层处理，所以不要在这里把exception吞掉，可以参考inference_adapter处理逻辑

baominghelly · 2026-02-28T06:42:12Z

infinimetrics/training/frameworks/megatron_impl.py

+            for it, val in sorted(metrics["losses_by_iter"].items()):
+                f.write(f"{it},{val}\n")
+
+        # Save PPL
+        with open(self.ppl_csv, "w") as f:
+            f.write("iteration,ppl\n")
+            for it, loss in sorted(metrics["losses_by_iter"].items()):
+                try:
+                    ppl = float(math.exp(loss))
+                except Exception:
+                    ppl = float("inf")
+                f.write(f"{it},{ppl}\n")
+
+        # Save throughput
+        with open(self.throughput_csv, "w") as f:
+            f.write("iteration,throughput\n")
+            for it, val in sorted(metrics["throughput_by_iter"].items()):


这个地方不能用sort, 我们需要它保留时间顺序，而不是按大小顺序排列

确认一下 metrics["throughput_by_iter"] 得到的这个 dict 的 key 是不是 int 类型的，sorted 以后是不是按照 step 0/1/2/3 排的，如果是的话就无需 sorted

baominghelly · 2026-02-28T06:42:39Z

infinimetrics/training/frameworks/infinitrain_impl.py

+logger = logging.getLogger(__name__)
+
+
+class InfinitrainImpl:


这个可以暂时删了

baominghelly

请根据意见修改一下代码

Chamberlain0w0 · 2026-03-02T02:24:41Z

infinimetrics/training/frameworks/megatron_impl.py

+            for it, val in sorted(metrics["losses_by_iter"].items()):
+                f.write(f"{it},{val}\n")
+
+        # Save PPL
+        with open(self.ppl_csv, "w") as f:
+            f.write("iteration,ppl\n")
+            for it, loss in sorted(metrics["losses_by_iter"].items()):
+                try:
+                    ppl = float(math.exp(loss))
+                except Exception:
+                    ppl = float("inf")
+                f.write(f"{it},{ppl}\n")
+
+        # Save throughput
+        with open(self.throughput_csv, "w") as f:
+            f.write("iteration,throughput\n")
+            for it, val in sorted(metrics["throughput_by_iter"].items()):


确认一下 metrics["throughput_by_iter"] 得到的这个 dict 的 key 是不是 int 类型的，sorted 以后是不是按照 step 0/1/2/3 排的，如果是的话就无需 sorted

Chamberlain0w0 · 2026-03-02T02:24:51Z

infinimetrics/training/frameworks/megatron_impl.py

+        # Save loss
+        with open(self.loss_csv, "w") as f:
+            f.write("iteration,loss\n")
+            for it, val in sorted(metrics["losses_by_iter"].items()):


Chamberlain0w0 · 2026-03-02T02:24:55Z

infinimetrics/training/frameworks/megatron_impl.py

+        # Save PPL
+        with open(self.ppl_csv, "w") as f:
+            f.write("iteration,ppl\n")
+            for it, loss in sorted(metrics["losses_by_iter"].items()):


Chamberlain0w0 and others added 2 commits February 28, 2026 02:59

Merge pull request #23 from InfiniTensor/feat/dashboard-streamlit

3a53af9

feat: add Streamlit dashboard for InfiniMetrics

Merge branch 'master' into feat/training-refactor

b198f7e

zzhfz requested review from Chamberlain0w0 and baominghelly February 28, 2026 03:18

baominghelly reviewed Feb 28, 2026

View reviewed changes

baominghelly requested changes Feb 28, 2026

View reviewed changes

Chamberlain0w0 requested changes Mar 2, 2026

View reviewed changes

refactor(training): improve training module based on review feedback

87243d9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(training): refactor training module to new adapter/executor architecture#28

feat(training): refactor training module to new adapter/executor architecture#28
zzhfz wants to merge 3 commits intomasterfrom
feat/training-refactor

zzhfz commented Feb 28, 2026

Uh oh!

baominghelly commented Feb 28, 2026

Uh oh!

baominghelly Feb 28, 2026

Uh oh!

baominghelly Feb 28, 2026

Uh oh!

baominghelly Feb 28, 2026

Uh oh!

Chamberlain0w0 Mar 2, 2026

Uh oh!

baominghelly Feb 28, 2026

Uh oh!

baominghelly left a comment

Uh oh!

Chamberlain0w0 Mar 2, 2026

Uh oh!

Chamberlain0w0 Mar 2, 2026

Uh oh!

Chamberlain0w0 Mar 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		logger = logging.getLogger(__name__)


		class InfinitrainImpl:

Conversation

zzhfz commented Feb 28, 2026

Description

Changes

Test Results

Uh oh!

baominghelly commented Feb 28, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

baominghelly left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants