feat(training): refactor training module to new adapter/executor architecture#28
feat(training): refactor training module to new adapter/executor architecture#28
Conversation
feat: add Streamlit dashboard for InfiniMetrics
|
设计文档中写的是 我看json中有个framework unknown, 要不这种情况就不返回这个字段了吧? |
| from infinimetrics.training.frameworks.infinitrain_impl import ( | ||
| InfinitrainImpl, | ||
| ) | ||
|
|
||
| self.runner = InfinitrainImpl(config, resolved_device_count, self.run_id) | ||
| logger.info( | ||
| f"Created InfiniTrain implementation (placeholder) with {resolved_device_count} devices" | ||
| ) |
There was a problem hiding this comment.
如果是只是占位的话,我觉得直接在这里raise NotImplementedError("InfiniTrain implementation is not ready yet") 也行,把infinitrain_impl删掉
| except Exception as e: | ||
| logger.error(f"Training failed: {e}", exc_info=True) | ||
| return self._create_error_response( | ||
| str(e), test_input=test_dict, result_code=1 | ||
| ) | ||
|
|
There was a problem hiding this comment.
adapter层的报错会统一在executor层处理,所以不要在这里把exception吞掉,可以参考inference_adapter处理逻辑
| for it, val in sorted(metrics["losses_by_iter"].items()): | ||
| f.write(f"{it},{val}\n") | ||
|
|
||
| # Save PPL | ||
| with open(self.ppl_csv, "w") as f: | ||
| f.write("iteration,ppl\n") | ||
| for it, loss in sorted(metrics["losses_by_iter"].items()): | ||
| try: | ||
| ppl = float(math.exp(loss)) | ||
| except Exception: | ||
| ppl = float("inf") | ||
| f.write(f"{it},{ppl}\n") | ||
|
|
||
| # Save throughput | ||
| with open(self.throughput_csv, "w") as f: | ||
| f.write("iteration,throughput\n") | ||
| for it, val in sorted(metrics["throughput_by_iter"].items()): |
There was a problem hiding this comment.
这个地方不能用sort, 我们需要它保留时间顺序,而不是按大小顺序排列
There was a problem hiding this comment.
确认一下 metrics["throughput_by_iter"] 得到的这个 dict 的 key 是不是 int 类型的,sorted 以后是不是按照 step 0/1/2/3 排的,如果是的话就无需 sorted
| logger = logging.getLogger(__name__) | ||
|
|
||
|
|
||
| class InfinitrainImpl: |
| for it, val in sorted(metrics["losses_by_iter"].items()): | ||
| f.write(f"{it},{val}\n") | ||
|
|
||
| # Save PPL | ||
| with open(self.ppl_csv, "w") as f: | ||
| f.write("iteration,ppl\n") | ||
| for it, loss in sorted(metrics["losses_by_iter"].items()): | ||
| try: | ||
| ppl = float(math.exp(loss)) | ||
| except Exception: | ||
| ppl = float("inf") | ||
| f.write(f"{it},{ppl}\n") | ||
|
|
||
| # Save throughput | ||
| with open(self.throughput_csv, "w") as f: | ||
| f.write("iteration,throughput\n") | ||
| for it, val in sorted(metrics["throughput_by_iter"].items()): |
There was a problem hiding this comment.
确认一下 metrics["throughput_by_iter"] 得到的这个 dict 的 key 是不是 int 类型的,sorted 以后是不是按照 step 0/1/2/3 排的,如果是的话就无需 sorted
| # Save loss | ||
| with open(self.loss_csv, "w") as f: | ||
| f.write("iteration,loss\n") | ||
| for it, val in sorted(metrics["losses_by_iter"].items()): |
| # Save PPL | ||
| with open(self.ppl_csv, "w") as f: | ||
| f.write("iteration,ppl\n") | ||
| for it, loss in sorted(metrics["losses_by_iter"].items()): |
Description
重构训练模块,使其符合 InfiniMetrics 新的适配器架构设计
Changes
TrainingAdapter,继承BaseAdapter实现统一测试执行MegatronImpl实现 Megatron-LM 训练支持InfinitrainImpl占位实现(待后续开发)Test Results