#621 add ddp

Merged
zoulq merged 1 commits from wtcheng/MSAdapter:master into master 9 months ago
Erpim reviewed 10 months ago
msadapter/pytorch/nn/parallel/distributed.py
@@ -0,0 +42,4 @@
logits, grads = self.grad_fn(*inputs, **kwargs)
grads = self.grad_reducer(grads)
for i, param in enumerate(self.network.trainable_params()):
param.set_grad(grads[i])
Erpim commented 10 months ago
这个方案需要再讨论一下,会导致用户用不用DDP的使用方式不一样
wtcheng commented 9 months ago
反向维持原状,这个PR不修改
Erpim commented 10 months ago
Collaborator
FAILED testing/ut/pytorch/optim/test_optim.py::test_adam - ValueError: For 'MultitypeFuncGraph', cannot find fn match given args. Got (sigs, fn): [((mindspore.tensor, mindspore.tensor, mindspore.RowTensor), <function _tensor_apply_decay_with_sparse at 0x7fa103e4e710>), ((mindspore.tensor, mindspore.tensor, mindspore.tensor), <function _tensor_apply_decay at 0x7fa103e4e7a0>)], and (dtype, args): (mindspore.tensor[float32], mindspore.tensor[float32], mindspore.type_none). FAILED testing/ut/pytorch/optim/test_optim.py::test_adam_state_dict - ValueError: For 'MultitypeFuncGraph', cannot find fn match given args. Got (sigs, fn): [((mindspore.tensor, mindspore.tensor, mindspore.RowTensor), <function _tensor_apply_decay_with_sparse at 0x7fa103e4e710>), ((mindspore.tensor, mindspore.tensor, mindspore.tensor), <function _tensor_apply_decay at 0x7fa103e4e7a0>)], and (dtype, args): (mindspore.tensor[float32], mindspore.tensor[float32], mindspore.type_none). ===== 2 failed, 1797 passed, 38 skipped, 220 warnings in 110.88s (0:01:50) =====
frelam reviewed 9 months ago
msadapter/pytorch/optim/adam.py
@@ -26,1 +26,4 @@
return super()._ms_load_state_dict(state_dict, 'exp_avg', 'exp_avg_sq', 'max_exp_avg_sq', 'state_step')

def step(self, grads=None, closure=None):
grads = [_.get_grad() for _ in self.parameters]
frelam commented 9 months ago
这里不从入参传入, 而从parameters取, 会不会影响性能? 如果从parameters取, 在易用性上有什么好处?
wtcheng marked this conversation as resolved
frelam reviewed 9 months ago
msadapter/pytorch/nn/parameter.py
@@ -158,2 +159,4 @@
return self.dtype in [mstype.float32, mstype.float16, mstype.float64]

def set_grad(self, grad):
self._grad = grad
frelam commented 9 months ago
声明一个_grad来存储梯度, 我理解用法上会接近torch。 但由于mindspore本身不支持, 可能会多出set_grad和get_grad的开销。 这部分对易用性上的收益会有多大?
wtcheng commented 9 months ago
反向维持原状,这个PR不修改
wtcheng marked this conversation as resolved
Erpim commented 9 months ago
Collaborator
=========================== short test summary info ============================ FAILED testing/ut/pytorch/distributed/test_ddp.py::test_ddp_basic - Exception: /bin/sh: mpirun: command not found FAILED testing/ut/pytorch/nn/test_dist.py::test_dist_basic - Exception: /bin/sh: mpirun: command not found ===== 2 failed, 1874 passed, 38 skipped, 225 warnings in 126.62s (0:02:06) =====
Erpim commented 9 months ago
Collaborator
=========================== short test summary info ============================ ERROR testing/ut/pytorch/distributed/test_ddp.py !!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!! ======================== 4 warnings, 1 error in 13.90s =========================
Erpim commented 9 months ago
Collaborator
************* Module msadapter.pytorch.cuda.__init__ msadapter/pytorch/cuda/__init__.py:1:0: R0401: Cyclic import (msadapter.pytorch.nn -> msadapter.pytorch.nn.parallel -> msadapter.pytorch.nn.parallel.distributed) (cyclic-import)
Erpim commented 9 months ago
Collaborator
************* Module msadapter.pytorch.distributed.distributed_c10d msadapter/pytorch/distributed/distributed_c10d.py:29:27: C0321: More than one statement on a single line (multiple-statements) msadapter/pytorch/distributed/distributed_c10d.py:35:19: C0321: More than one statement on a single line (multiple-statements) msadapter/pytorch/distributed/distributed_c10d.py:40:27: C0321: More than one statement on a single line (multiple-statements) msadapter/pytorch/distributed/distributed_c10d.py:42:27: C0321: More than one statement on a single line (multiple-statements) msadapter/pytorch/distributed/distributed_c10d.py:81:12: W0613: Unused argument 'tensors' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:82:12: W0613: Unused argument 'opts' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:117:12: W0613: Unused argument 'output' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:118:12: W0613: Unused argument 'input' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:119:12: W0613: Unused argument 'opts' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:124:12: W0613: Unused argument 'output_lists' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:125:12: W0613: Unused argument 'input_list' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:126:12: W0613: Unused argument 'opts' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:178:12: W0613: Unused argument 'outputTensor' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:179:12: W0613: Unused argument 'inputTensor' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:218:12: W0613: Unused argument 'tensors' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:219:12: W0613: Unused argument 'dstRank' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:220:12: W0613: Unused argument 'tag' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:225:12: W0613: Unused argument 'tensors' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:226:12: W0613: Unused argument 'srcRank' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:227:12: W0613: Unused argument 'tag' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:230:71: C0321: More than one statement on a single line (multiple-statements) msadapter/pytorch/distributed/distributed_c10d.py:230:29: W0613: Unused argument 'tensors' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:230:52: W0613: Unused argument 'tag' (unused-argument) msadapter/pytorch/distributed/distributed_c10d.py:232:42: C0321: More than one statement on a single line (multiple-statements) msadapter/pytorch/distributed/distributed_c10d.py:232:22: W0613: Unused argument 'opts' (unused-argument) ----------------------------------- Your code has been rated at 9.98/10
Erpim commented 9 months ago
Collaborator
同步更新supportedlist
wtcheng commented 9 months ago
Poster
> 同步更新supportedlist 已添加
Erpim reviewed 9 months ago
SupportedList.md
@@ -1229,0 +1239,4 @@
| is_initialized | 支持 | |
| is_mpi_available | 支持 | |
| is_nccl_available | 支持 | |
| is_hccl_available | Torch没有这个函数 | |
Erpim commented 9 months ago
torch没有这个接口,为什么我们要对外这个接口?
panshaowu commented 9 months ago
因为MSAdapter需要适配GPU和Ascend,torch仅支持GPU所以只有is_nccl_available ,对应的Ascend接口是hccl,所以新开了一个API is_hccl_available。建议修改“Torch没有这个函数”措辞,说明上述原因。
panshaowu commented 9 months ago
已达成一致
panshaowu marked this conversation as resolved
Erpim reviewed 9 months ago
ConstraintList.md
@@ -313,0 +317,4 @@
| MSAdapter接口 | 约束条件 |
| --------------- |-----------------------------------------------------------------------------------------|
| init_process_group | 不支持`timeout`, `rank`, `store`, `group_name`, `pg_options`,部分支持`init_method`:以环境变量模式配置初始化 |
| new_group | 不支持`timeout`, `pg_options`,`backend`部分支持(nccl和hccl) |
Erpim commented 9 months ago
pytorch 是不是不感知hccl
Erpim commented 9 months ago
可以在整个torch.distributed 最上面描述下策略,比如Ascend上就默认nccl底层走hccl,或者写到通用限制里
panshaowu marked this conversation as resolved
Erpim reviewed 9 months ago
ConstraintList_en.md
@@ -314,0 +318,4 @@
| MSAdapter APIs | Constraint conditions |
| --------------- |-----------------------------------------------------------------------------------------|
| init_process_group | `timeout`, `rank`, `store`, `group_name`, `pg_options` are not supported, `init_method` is partly supported: initialization can be configured only in environment variable mode. |
| new_group | `timeout`, `pg_options` are not supported, `backend` is partly supported (nccl and hccl) |
Erpim commented 9 months ago
同上
panshaowu marked this conversation as resolved
Erpim commented 9 months ago
Collaborator
/lgtm
zoulq reviewed 9 months ago
SupportedList.md
@@ -1253,1 +1255,4 @@

### <span id="jump10">torch.distributed</span>
<span id="jump10">分布式统一约束:</span>
- 在Ascend后端上,由于设备差异,NCCL相关接口默认会被替换为HCCL相关接口。
zoulq commented 9 months ago
当前这个接口相关的写法样例没有完善,在这里加上一句话“torch.distributed相关接口为实验性API, 后续可能修改或删除。分布式训练功能迁移请参考用户手册样例描述。” 然后链接到https://openi.pcl.ac.cn/OpenI/MSAdapter/src/branch/master/USER_GUIDE.md#user-content-3-3-%E4%BD%BF%E7%94%A8%E5%88%86%E5%B8%83%E5%BC%8F%E8%AE%AD%E7%BB%83%E5%8A%A0%E9%80%9F%E8%AE%AD%E7%BB%83
panshaowu marked this conversation as resolved
zoulq merged commit 720bb9400b into master 9 months ago
The pull request has been merged as 720bb9400b.
Sign in to join this conversation.
No reviewers
No Label
No Milestone
No Assignees
5 Participants
Notifications
Due Date

No due date set.

Dependencies

This pull request currently doesn't have any dependencies.

Loading…
There is no content yet.