使用提供的docker运行pytorch版本的generate_samples_Pangu.py报NCCL错误

docker、nvidia-docker都已配置，nvidia-smi显示也正常，但是docker启动时会报：

Various files include modifications (c) NVIDIA CORPORATION. All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.
WARNING: Detected NVIDIA NVIDIA A100-SXM4-80GB GPU, which is not yet supported in this version of the container
WARNING: Detected NVIDIA NVIDIA A100-SXM4-80GB GPU, which is not yet supported in this version of the container
WARNING: Detected NVIDIA NVIDIA A100-SXM4-80GB GPU, which is not yet supported in this version of the container
WARNING: Detected NVIDIA NVIDIA A100-SXM4-80GB GPU, which is not yet supported in this version of the container
ERROR: No supported GPU(s) detected to run this container

NOTE: MOFED driver for multi-node communication was not detected.
Multi-node communication performance may be reduced.这个错误

使用generate_samples_Pangu.py推理，在加载模型参数时，会报：

这个错误
Traceback (most recent call last):
File "tools/generate_samples_Pangu.py", line 185, in
main()
File "tools/generate_samples_Pangu.py", line 138, in main
_ = load_checkpoint(model, None, None)
File "/workspace/PanGu-Alpha-GPU/panguAlpha_pytorch/megatron/checkpointing.py", line 242, in load_checkpoint
torch.distributed.barrier()
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1485, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:514, internal error, NCCL version 2.6.3

image.png

87 KiB

image.png

78 KiB

同样的问题，使用提供的docker yands/pangu-alpha-megatron-lm-nvidia-pytorch:20.03.2 运行命令时报错：
cd PanGu-Alpha-GPU/panguAlpha_pytorch
python tools/generate_samples_Pangu.py
--model-parallel-size 1
--num-layers 31
--hidden-size 2560
--load /dataset/Pangu-alpha_2.6B_mgt/
--num-attention-heads 32
--max-position-embeddings 1024
--tokenizer-type GPT2BPETokenizer
--fp16
--batch-size 1
--seq-length 1024
--out-seq-length 50
--temperature 1.0
--vocab-file megatron/tokenizer/bpe_4w_pcl/vocab
--num-samples 0
--top_k 2
--finetune

错误栈信息如下：
loading checkpoint ...
global rank 0 is loading checkpoint /dataset/Pangu-alpha_2.6B_mgt/iter_0001000/mp_rank_00/model_optim_rng.pt
could not find arguments in the checkpoint ...
Traceback (most recent call last):
File "tools/generate_samples_Pangu.py", line 185, in
main()
File "tools/generate_samples_Pangu.py", line 138, in main
_ = load_checkpoint(model, None, None)
File "/nfs/users/test/Pangu-alpha/PanGu-Alpha-GPU/panguAlpha_pytorch/megatron/checkpointing.py", line 242, in load_checkpoint
torch.distributed.barrier()
File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1485, in barrier
work = _default_pg.barrier()
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:514, internal error, NCCL version 2.6.3

nvidia-smi显示驱动和cuda版本如下：

使用的模型文件Pangu-alpha_2.6B_mgt为Pangu-alpha_2.6B_fp16_mgt.zip
来自于https://git.openi.org.cn/attachments/72aec03d-6bdb-4652-ac2a-8099db4b0bed 下载无误，md5码一致

感谢答复

image.png

3.0 KiB

image.png

2.5 KiB

image.png

33 KiB

image.png

26 KiB

Deleting a branch is permanent. It CANNOT be undone. Continue?

Dear OpenI User

Thank you for your continuous support to the Openl Qizhi Community AI Collaboration Platform. In order to protect your usage rights and ensure network security, we updated the Openl Qizhi Community AI Collaboration Platform Usage Agreement in January 2024. The updated agreement specifies that users are prohibited from using intranet penetration tools. After you click "Agree and continue", you can continue to use our services. Thank you for your cooperation and understanding.

For more agreement content, please refer to the《Openl Qizhi Community AI Collaboration Platform Usage Agreement》

#15 使用提供的docker运行pytorch版本的generate_samples_Pangu.py报NCCL错误