#1179 NPU环境Pytorch1.11镜像DDP报错

Open
created 5 months ago by WuxinWang · 2 comments
<!-- 为了更有效地识别与解决您的问题,请尽可能的补充如下信息 --> ### 问题描述 基于Pytorch1.11的NPU镜像使用Ascend910多卡并行时报错。 ### 相关环境(GPU/NPU) NPU ### 相关集群(启智/智算) 智算集群 ### 任务类型(调试/训练/推理) 训练任务 ### 任务名 wuxin202311291026437 ### 日志说明或问题截图 Traceback (most recent call last): File "/cache/code/nuwa/src/utils/utils.py", line 42, in wrap metric_dict, object_dict = task_func(cfg=cfg) File "/cache/code/nuwa/src/tasks/train_task.py", line 63, in train trainer.fit(model=model, datamodule=datamodule, ckpt_path=cfg.get("ckpt_path")) File "/home/ma-user/anaconda3/envs/PyTorch-1.11/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path File "/home/ma-user/anaconda3/envs/PyTorch-1.11/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs) File "/home/ma-user/anaconda3/envs/PyTorch-1.11/lib/python3.7/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch return function(*args, **kwargs) File "/home/ma-user/anaconda3/envs/PyTorch-1.11/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/home/ma-user/anaconda3/envs/PyTorch-1.11/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1172, in _run self.__setup_profiler() File "/home/ma-user/anaconda3/envs/PyTorch-1.11/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1797, in __setup_profiler self.profiler.setup(stage=self.state.fn._setup_fn, local_rank=local_rank, log_dir=self.log_dir) File "/home/ma-user/anaconda3/envs/PyTorch-1.11/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 2249, in log_dir dirpath = self.strategy.broadcast(dirpath) File "/home/ma-user/anaconda3/envs/PyTorch-1.11/lib/python3.7/site-packages/lightning_npu-0.0.0-py3.7.egg/lightning_npu/strategies/npu_parallel.py", line 105, in broadcast self.broadcast_object_list(obj, src, group=_group.WORLD) File "/home/ma-user/anaconda3/envs/PyTorch-1.11/lib/python3.7/site-packages/lightning_npu-0.0.0-py3.7.egg/lightning_npu/strategies/npu_parallel.py", line 75, in broadcast_object_list broadcast(object_sizes_tensor, src=src, group=group) File "/home/ma-user/anaconda3/envs/PyTorch-1.11/lib/python3.7/site-packages/torch_npu/distributed/distributed_c10d.py", line 1060, in broadcast work = default_pg.broadcast([tensor], opts) RuntimeError: HCCL error in: /usr1/workspace/FPTA_Daily_open_pytorchv1.11.0-3.0.tr6/CODE/torch_npu/csrc/distributed/ProcessGroupHCCL.cpp:402 EI9999: Inner Error, Please contact support engineer! EI9999 host nic listen start failed, ip[0x6a0010ac], port[60002], return[11][FUNC:StartListenHostSocket][FILE:network_manager.cc][LINE:452] TraceBack (most recent call last): THPModule_npu_shutdown success. /home/ma-user/anaconda3/envs/PyTorch-1.11/lib/python3.7/site-packages/torchvision/transforms/functional_pil.py:207: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead. def resize(img, size, interpolation=Image.BILINEAR): /home/ma-user/anaconda3/envs/PyTorch-1.11/lib/python3.7/site-packages/torchvision/transforms/functional_pil.py:280: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead. def perspective(img, perspective_coeffs, interpolation=Image.BICUBIC, fill=None): /home/ma-user/anaconda3/envs/PyTorch-1.11/lib/python3.7/site-packages/torch/utils/tensorboard/__init__.py:4: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead. if not hasattr(tensorboard, '__version__') or LooseVersion(tensorboard.__version__) < LooseVersion('1.15'): Error executing job with overrides: ['trainer=npu_parallel.yaml', 'model=vit.yaml', 'paths=forecast_openi.yaml', 'datamodule=h5forecast.yaml'] Traceback (most recent call last): File "/cache/code/nuwa/src/train.py", line 27, in main metric_dict, _ = train(cfg) File "/cache/code/nuwa/src/utils/utils.py", line 45, in wrap raise ex File "/cache/code/nuwa/src/utils/utils.py", line 42, in wrap metric_dict, object_dict = task_func(cfg=cfg) File "/cache/code/nuwa/src/tasks/train_task.py", line 63, in train trainer.fit(model=model, datamodule=datamodule, ckpt_path=cfg.get("ckpt_path")) File "/home/ma-user/anaconda3/envs/PyTorch-1.11/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path File "/home/ma-user/anaconda3/envs/PyTorch-1.11/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/home/ma-user/anaconda3/envs/PyTorch-1.11/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/home/ma-user/anaconda3/envs/PyTorch-1.11/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1172, in _run self.__setup_profiler() File "/home/ma-user/anaconda3/envs/PyTorch-1.11/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1797, in __setup_profiler self.profiler.setup(stage=self.state.fn._setup_fn, local_rank=local_rank, log_dir=self.log_dir) File "/home/ma-user/anaconda3/envs/PyTorch-1.11/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 2249, in log_dir dirpath = self.strategy.broadcast(dirpath) File "/home/ma-user/anaconda3/envs/PyTorch-1.11/lib/python3.7/site-packages/lightning_npu-0.0.0-py3.7.egg/lightning_npu/strategies/npu_parallel.py", line 105, in broadcast self.broadcast_object_list(obj, src, group=_group.WORLD) File "/home/ma-user/anaconda3/envs/PyTorch-1.11/lib/python3.7/site-packages/lightning_npu-0.0.0-py3.7.egg/lightning_npu/strategies/npu_parallel.py", line 75, in broadcast_object_list broadcast(object_sizes_tensor, src=src, group=group) File "/home/ma-user/anaconda3/envs/PyTorch-1.11/lib/python3.7/site-packages/torch_npu/distributed/distributed_c10d.py", line 1060, in broadcast work = default_pg.broadcast([tensor], opts) RuntimeError: store->get() got error: HCCL_BLOCKING_WAIT ### 期望的解决方案或建议 希望能够定位一下报错原因并提供完整的使用Pytorch NPU镜像的多卡并行的示例,或者能够支持Pytorch_NPU进行ddp训练的完整镜像。
liuzx commented 5 months ago
Collaborator
这个镜像Pytorch1.11目前还不支持并行训练
WuxinWang commented 5 months ago
Poster
> 这个镜像Pytorch1.11目前还不支持并行训练 希望可以更新支持并行的镜像,看Ascend官方文档是支持的
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.