Deleting a branch is permanent. It CANNOT be undone. Continue?
Deleting a branch is permanent. It CANNOT be undone. Continue?
Dear OpenI User
Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.
For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》
这个方案需要再讨论一下,会导致用户用不用DDP的使用方式不一样
反向维持原状,这个PR不修改
FAILED testing/ut/pytorch/optim/test_optim.py::test_adam - ValueError: For 'MultitypeFuncGraph', cannot find fn match given args. Got (sigs, fn): [((mindspore.tensor, mindspore.tensor, mindspore.RowTensor), <function _tensor_apply_decay_with_sparse at 0x7fa103e4e710>), ((mindspore.tensor, mindspore.tensor, mindspore.tensor), <function _tensor_apply_decay at 0x7fa103e4e7a0>)], and (dtype, args): (mindspore.tensor[float32], mindspore.tensor[float32], mindspore.type_none).
FAILED testing/ut/pytorch/optim/test_optim.py::test_adam_state_dict - ValueError: For 'MultitypeFuncGraph', cannot find fn match given args. Got (sigs, fn): [((mindspore.tensor, mindspore.tensor, mindspore.RowTensor), <function _tensor_apply_decay_with_sparse at 0x7fa103e4e710>), ((mindspore.tensor, mindspore.tensor, mindspore.tensor), <function _tensor_apply_decay at 0x7fa103e4e7a0>)], and (dtype, args): (mindspore.tensor[float32], mindspore.tensor[float32], mindspore.type_none).
===== 2 failed, 1797 passed, 38 skipped, 220 warnings in 110.88s (0:01:50) =====
这里不从入参传入, 而从parameters取, 会不会影响性能?
如果从parameters取, 在易用性上有什么好处?
声明一个_grad来存储梯度, 我理解用法上会接近torch。
但由于mindspore本身不支持, 可能会多出set_grad和get_grad的开销。
这部分对易用性上的收益会有多大?
反向维持原状,这个PR不修改
=========================== short test summary info ============================
FAILED testing/ut/pytorch/distributed/test_ddp.py::test_ddp_basic - Exception: /bin/sh: mpirun: command not found
FAILED testing/ut/pytorch/nn/test_dist.py::test_dist_basic - Exception: /bin/sh: mpirun: command not found
===== 2 failed, 1874 passed, 38 skipped, 225 warnings in 126.62s (0:02:06) =====
=========================== short test summary info ============================
ERROR testing/ut/pytorch/distributed/test_ddp.py
!!!!!!!!!!!!!!!!!!!! Interrupted: 1 error during collection !!!!!!!!!!!!!!!!!!!!
======================== 4 warnings, 1 error in 13.90s =========================
************* Module msadapter.pytorch.cuda.init
msadapter/pytorch/cuda/init.py0: R0401: Cyclic import (msadapter.pytorch.nn -> msadapter.pytorch.nn.parallel -> msadapter.pytorch.nn.parallel.distributed) (cyclic-import)
************* Module msadapter.pytorch.distributed.distributed_c10d
msadapter/pytorch/distributed/distributed_c10d.py:29:27: C0321: More than one statement on a single line (multiple-statements)
msadapter/pytorch/distributed/distributed_c10d.py:35:19: C0321: More than one statement on a single line (multiple-statements)
msadapter/pytorch/distributed/distributed_c10d.py:40:27: C0321: More than one statement on a single line (multiple-statements)
msadapter/pytorch/distributed/distributed_c10d.py:42:27: C0321: More than one statement on a single line (multiple-statements)
msadapter/pytorch/distributed/distributed_c10d.py:81:12: W0613: Unused argument 'tensors' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:82:12: W0613: Unused argument 'opts' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:117:12: W0613: Unused argument 'output' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:118:12: W0613: Unused argument 'input' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:119:12: W0613: Unused argument 'opts' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:124:12: W0613: Unused argument 'output_lists' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:125:12: W0613: Unused argument 'input_list' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:126:12: W0613: Unused argument 'opts' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:178:12: W0613: Unused argument 'outputTensor' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:179:12: W0613: Unused argument 'inputTensor' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:218:12: W0613: Unused argument 'tensors' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:219:12: W0613: Unused argument 'dstRank' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:220:12: W0613: Unused argument 'tag' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:225:12: W0613: Unused argument 'tensors' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:226:12: W0613: Unused argument 'srcRank' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:227:12: W0613: Unused argument 'tag' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:230:71: C0321: More than one statement on a single line (multiple-statements)
msadapter/pytorch/distributed/distributed_c10d.py:230:29: W0613: Unused argument 'tensors' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:230:52: W0613: Unused argument 'tag' (unused-argument)
msadapter/pytorch/distributed/distributed_c10d.py:232:42: C0321: More than one statement on a single line (multiple-statements)
msadapter/pytorch/distributed/distributed_c10d.py:232:22: W0613: Unused argument 'opts' (unused-argument)
Your code has been rated at 9.98/10
同步更新supportedlist
已添加
torch没有这个接口,为什么我们要对外这个接口?
因为MSAdapter需要适配GPU和Ascend,torch仅支持GPU所以只有is_nccl_available ,对应的Ascend接口是hccl,所以新开了一个API is_hccl_available。建议修改“Torch没有这个函数”措辞,说明上述原因。
已达成一致
pytorch 是不是不感知hccl
可以在整个torch.distributed 最上面描述下策略,比如Ascend上就默认nccl底层走hccl,或者写到通用限制里
同上
/lgtm
当前这个接口相关的写法样例没有完善,在这里加上一句话“torch.distributed相关接口为实验性API, 后续可能修改或删除。分布式训练功能迁移请参考用户手册样例描述。”
然后链接到https://openi.pcl.ac.cn/OpenI/MSAdapter/src/branch/master/USER_GUIDE.md#user-content-3-3-%E4%BD%BF%E7%94%A8%E5%88%86%E5%B8%83%E5%BC%8F%E8%AE%AD%E7%BB%83%E5%8A%A0%E9%80%9F%E8%AE%AD%E7%BB%83
720bb9400b
.