#15 使用提供的docker运行pytorch版本的generate_samples_Pangu.py报NCCL错误

Open
created 1 year ago by Martrix · 1 comments
Martrix commented 1 year ago
docker、nvidia-docker都已配置,nvidia-smi显示也正常,但是docker启动时会报: ![image](/attachments/0b8ad07b-127a-4b93-bef7-511cb802d30e) Various files include modifications (c) NVIDIA CORPORATION. All rights reserved. NVIDIA modifications are covered by the license terms that apply to the underlying project or file. WARNING: Detected NVIDIA NVIDIA A100-SXM4-80GB GPU, which is not yet supported in this version of the container WARNING: Detected NVIDIA NVIDIA A100-SXM4-80GB GPU, which is not yet supported in this version of the container WARNING: Detected NVIDIA NVIDIA A100-SXM4-80GB GPU, which is not yet supported in this version of the container WARNING: Detected NVIDIA NVIDIA A100-SXM4-80GB GPU, which is not yet supported in this version of the container ERROR: No supported GPU(s) detected to run this container NOTE: MOFED driver for multi-node communication was not detected. Multi-node communication performance may be reduced.这个错误 使用generate_samples_Pangu.py推理,在加载模型参数时,会报: ![image](/attachments/6d44ef49-3c50-4720-a28d-92ecb85a2477) 这个错误 Traceback (most recent call last): File "tools/generate_samples_Pangu.py", line 185, in <module> main() File "tools/generate_samples_Pangu.py", line 138, in main _ = load_checkpoint(model, None, None) File "/workspace/PanGu-Alpha-GPU/panguAlpha_pytorch/megatron/checkpointing.py", line 242, in load_checkpoint torch.distributed.barrier() File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1485, in barrier work = _default_pg.barrier() RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:514, internal error, NCCL version 2.6.3
Martrix changed title from 使用提供的docker运行pytorch版本的generate报NCCL错误 to 使用提供的docker运行pytorch版本的generate_samples_Pangu.py报NCCL错误 1 year ago
**同样的问题,使用提供的docker yands/pangu-alpha-megatron-lm-nvidia-pytorch:20.03.2 运行命令时报错:** cd PanGu-Alpha-GPU/panguAlpha_pytorch python tools/generate_samples_Pangu.py \ --model-parallel-size 1 \ --num-layers 31 \ --hidden-size 2560 \ --load /dataset/Pangu-alpha_2.6B_mgt/ \ --num-attention-heads 32 \ --max-position-embeddings 1024 \ --tokenizer-type GPT2BPETokenizer \ --fp16 \ --batch-size 1 \ --seq-length 1024 \ --out-seq-length 50 \ --temperature 1.0 \ --vocab-file megatron/tokenizer/bpe_4w_pcl/vocab \ --num-samples 0 \ --top_k 2 \ --finetune **错误栈信息如下:** loading checkpoint ... global rank 0 is loading checkpoint /dataset/Pangu-alpha_2.6B_mgt/iter_0001000/mp_rank_00/model_optim_rng.pt could not find arguments in the checkpoint ... Traceback (most recent call last): File "tools/generate_samples_Pangu.py", line 185, in <module> main() File "tools/generate_samples_Pangu.py", line 138, in main _ = load_checkpoint(model, None, None) File "/nfs/users/test/Pangu-alpha/PanGu-Alpha-GPU/panguAlpha_pytorch/megatron/checkpointing.py", line 242, in load_checkpoint torch.distributed.barrier() File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1485, in barrier work = _default_pg.barrier() RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:514, internal error, NCCL version 2.6.3 **nvidia-smi显示驱动和cuda版本如下:** ![image](/attachments/282533d1-4107-458f-97aa-3efefe001976) **使用的模型文件Pangu-alpha_2.6B_mgt为Pangu-alpha_2.6B_fp16_mgt.zip** 来自于https://git.openi.org.cn/attachments/72aec03d-6bdb-4652-ac2a-8099db4b0bed 下载无误,md5码一致 感谢答复
Sign in to join this conversation.
No Label
No Milestone
No Assignees
2 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.