add ddp

Erpim commented 10 months ago

这个方案需要再讨论一下，会导致用户用不用DDP的使用方式不一样

wtcheng commented 9 months ago

反向维持原状，这个PR不修改

FAILED testing/ut/pytorch/optim/test_optim.py::test_adam - ValueError: For 'MultitypeFuncGraph', cannot find fn match given args. Got (sigs, fn): [((mindspore.tensor, mindspore.tensor, mindspore.RowTensor), <function _tensor_apply_decay_with_sparse at 0x7fa103e4e710>), ((mindspore.tensor, mindspore.tensor, mindspore.tensor), <function _tensor_apply_decay at 0x7fa103e4e7a0>)], and (dtype, args): (mindspore.tensor[float32], mindspore.tensor[float32], mindspore.type_none).
FAILED testing/ut/pytorch/optim/test_optim.py::test_adam_state_dict - ValueError: For 'MultitypeFuncGraph', cannot find fn match given args. Got (sigs, fn): [((mindspore.tensor, mindspore.tensor, mindspore.RowTensor), <function _tensor_apply_decay_with_sparse at 0x7fa103e4e710>), ((mindspore.tensor, mindspore.tensor, mindspore.tensor), <function _tensor_apply_decay at 0x7fa103e4e7a0>)], and (dtype, args): (mindspore.tensor[float32], mindspore.tensor[float32], mindspore.type_none).
===== 2 failed, 1797 passed, 38 skipped, 220 warnings in 110.88s (0:01:50) =====

frelam commented 9 months ago

这里不从入参传入，而从parameters取，会不会影响性能？
如果从parameters取，在易用性上有什么好处？

frelam commented 9 months ago

声明一个_grad来存储梯度，我理解用法上会接近torch。
但由于mindspore本身不支持，可能会多出set_grad和get_grad的开销。
这部分对易用性上的收益会有多大？

wtcheng commented 9 months ago

反向维持原状，这个PR不修改

=========================== short test summary info ============================
FAILED testing/ut/pytorch/distributed/test_ddp.py::test_ddp_basic - Exception: /bin/sh: mpirun: command not found
FAILED testing/ut/pytorch/nn/test_dist.py::test_dist_basic - Exception: /bin/sh: mpirun: command not found
===== 2 failed, 1874 passed, 38 skipped, 225 warnings in 126.62s (0:02:06) =====

=========================== short test summary info ============================
ERROR testing/ut/pytorch/distributed/test_ddp.py
!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!
======================== 4 warnings, 1 error in 13.90s =========================

************* Module msadapter.pytorch.cuda.init
msadapter/pytorch/cuda/init.py0: R0401: Cyclic import (msadapter.pytorch.nn -> msadapter.pytorch.nn.parallel -> msadapter.pytorch.nn.parallel.distributed) (cyclic-import)

************* Module msadapter.pytorch.distributed.distributed_c10d
msadapter/pytorch/distributed/distributed_c10d.py:29:27: C0321: More than one statement on a single line (multiple-statements)
msadapter/pytorch/distributed/distributed_c10d.py:35:19: C0321: More than one statement on a single line (multiple-statements)
msadapter/pytorch/distributed/distributed_c10d.py:40:27: C0321: More than one statement on a single line (multiple-statements)
msadapter/pytorch/distributed/distributed_c10d.py:42:27: C0321: More than one statement on a single line (multiple-statements)
msadapter/pytorch/distributed/distributed_c10d.py:81:12: W0613: Unused argument 'tensors' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:82:12: W0613: Unused argument 'opts' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:117:12: W0613: Unused argument 'output' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:118:12: W0613: Unused argument 'input' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:119:12: W0613: Unused argument 'opts' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:124:12: W0613: Unused argument 'output_lists' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:125:12: W0613: Unused argument 'input_list' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:126:12: W0613: Unused argument 'opts' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:178:12: W0613: Unused argument 'outputTensor' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:179:12: W0613: Unused argument 'inputTensor' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:218:12: W0613: Unused argument 'tensors' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:219:12: W0613: Unused argument 'dstRank' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:220:12: W0613: Unused argument 'tag' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:225:12: W0613: Unused argument 'tensors' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:226:12: W0613: Unused argument 'srcRank' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:227:12: W0613: Unused argument 'tag' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:230:71: C0321: More than one statement on a single line (multiple-statements)
msadapter/pytorch/distributed/distributed_c10d.py:230:29: W0613: Unused argument 'tensors' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:230:52: W0613: Unused argument 'tag' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:232:42: C0321: More than one statement on a single line (multiple-statements)
msadapter/pytorch/distributed/distributed_c10d.py:232:22: W0613: Unused argument 'opts' (unused-argument)

Your code has been rated at 9.98/10

************* Module msadapter.pytorch.distributed.distributed_c10d msadapter/pytorch/distributed/distributed_c10d.py:29:27: C0321: More than one statement on a single line (multiple-statements) msadapter/pytorch/distributed/distributed_c10d.py:35:19: C0321: More than one statement on a single line (multiple-statements) msadapter/pytorch/distributed/distributed_c10d.py:40:27: C0321: More than one statement on a single line (multiple-statements) msadapter/pytorch/distributed/distributed_c10d.py:42:27: C0321: More than one statement on a single line (multiple-statements) msadapter/pytorch/distributed/distributed_c10d.py:81:12: W0613: Unused argument 'tensors' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:82:12: W0613: Unused argument 'opts' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:117:12: W0613: Unused argument 'output' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:118:12: W0613: Unused argument 'input' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:119:12: W0613: Unused argument 'opts' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:124:12: W0613: Unused argument 'output_lists' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:125:12: W0613: Unused argument 'input_list' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:126:12: W0613: Unused argument 'opts' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:178:12: W0613: Unused argument 'outputTensor' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:179:12: W0613: Unused argument 'inputTensor' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:218:12: W0613: Unused argument 'tensors' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:219:12: W0613: Unused argument 'dstRank' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:220:12: W0613: Unused argument 'tag' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:225:12: W0613: Unused argument 'tensors' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:226:12: W0613: Unused argument 'srcRank' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:227:12: W0613: Unused argument 'tag' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:230:71: C0321: More than one statement on a single line (multiple-statements) msadapter/pytorch/distributed/distributed_c10d.py:230:29: W0613: Unused argument 'tensors' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:230:52: W0613: Unused argument 'tag' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:232:42: C0321: More than one statement on a single line (multiple-statements) msadapter/pytorch/distributed/distributed_c10d.py:232:22: W0613: Unused argument 'opts' (unused-argument) ----------------------------------- Your code has been rated at 9.98/10

同步更新supportedlist

同步更新supportedlist

已添加

Erpim commented 9 months ago

torch没有这个接口，为什么我们要对外这个接口？

panshaowu commented 9 months ago

因为MSAdapter需要适配GPU和Ascend，torch仅支持GPU所以只有is_nccl_available ，对应的Ascend接口是hccl，所以新开了一个API is_hccl_available。建议修改“Torch没有这个函数”措辞，说明上述原因。

panshaowu commented 9 months ago

已达成一致

Erpim commented 9 months ago

pytorch 是不是不感知hccl

Erpim commented 9 months ago

可以在整个torch.distributed 最上面描述下策略，比如Ascend上就默认nccl底层走hccl，或者写到通用限制里

Erpim commented 9 months ago

同上

/lgtm

zoulq commented 9 months ago

当前这个接口相关的写法样例没有完善，在这里加上一句话“torch.distributed相关接口为实验性API, 后续可能修改或删除。分布式训练功能迁移请参考用户手册样例描述。”
然后链接到https://openi.pcl.ac.cn/OpenI/MSAdapter/src/branch/master/USER_GUIDE.md#user-content-3-3-%E4%BD%BF%E7%94%A8%E5%88%86%E5%B8%83%E5%BC%8F%E8%AE%AD%E7%BB%83%E5%8A%A0%E9%80%9F%E8%AE%AD%E7%BB%83

Deleting a branch is permanent. It CANNOT be undone. Continue?

Dear OpenI User

Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.

For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》

Erpim reviewed 10 months ago

msadapter/pytorch/nn/parallel/distributed.py

frelam reviewed 9 months ago

msadapter/pytorch/optim/adam.py

msadapter/pytorch/nn/parameter.py

Erpim reviewed 9 months ago

SupportedList.md

ConstraintList.md

ConstraintList_en.md

zoulq reviewed 9 months ago

zoulq referenced this issue from a commit 9 months ago

add ddp (#621) add ddp Co-authored-by: w00517672 <wutiancheng@huawei.com> Reviewed-on: https://openi.pcl.ac.cn/OpenI/MSAdapter/pulls/621

zoulq merged commit 720bb9400b into master 9 months ago

@@ -0,0 +42,4 @@
             logits, grads = self.grad_fn(*inputs, **kwargs)
             grads = self.grad_reducer(grads)
             for i, param in enumerate(self.network.trainable_params()):
                 param.set_grad(grads[i])

@@ -26,1 +26,4 @@
         return super()._ms_load_state_dict(state_dict, 'exp_avg', 'exp_avg_sq', 'max_exp_avg_sq', 'state_step')
     def step(self, grads=None, closure=None):
         grads = [_.get_grad() for _ in self.parameters]

@@ -158,2 +159,4 @@
         return self.dtype in [mstype.float32, mstype.float16, mstype.float64]
     def set_grad(self, grad):
         self._grad = grad

@@ -1229,0 +1239,4 @@
 | is_initialized | 支持 |  |
 | is_mpi_available | 支持 |  |
 | is_nccl_available | 支持 |  |
 | is_hccl_available | Torch没有这个函数 |  |

@@ -313,0 +317,4 @@
 | MSAdapter接口 | 约束条件                                                                                        |
 | --------------- |-----------------------------------------------------------------------------------------|
 | init_process_group | 不支持`timeout`, `rank`, `store`, `group_name`, `pg_options`，部分支持`init_method`：以环境变量模式配置初始化  |
 | new_group | 不支持`timeout`, `pg_options`，`backend`部分支持（nccl和hccl） |

@@ -314,0 +318,4 @@
 | MSAdapter APIs | Constraint conditions                                                                   |
 | --------------- |-----------------------------------------------------------------------------------------|
 | init_process_group | `timeout`, `rank`, `store`, `group_name`, `pg_options` are not supported, `init_method` is partly supported: initialization can be configured only in environment variable mode.  |
 | new_group | `timeout`, `pg_options` are not supported, `backend` is partly supported (nccl and hccl) |

@@ -1253,1 +1255,4 @@
 ### <span id="jump10">torch.distributed</span>
 <span id="jump10">分布式统一约束:</span>
 - 在Ascend后端上，由于设备差异，NCCL相关接口默认会被替换为HCCL相关接口。

#621 add ddp