NewBarry03
  • Joined on Apr 19, 2023
Loading Heatmap…

NewBarry03 created CPU/GPU type debugging task newbee_2dataset

1 year ago

NewBarry03 created CPU/GPU type debugging task newba202304201264862

1 year ago

NewBarry03 created CPU/GPU type debugging task newbee_2dataset(deleted)

1 year ago

NewBarry03 upload dataset PanguAlpha_2.6B_fp16.zip

1 year ago

NewBarry03 created CPU/GPU type debugging task newba202304201264862(deleted)

1 year ago

NewBarry03 commented on issue PCL-Platform.Inte.../PanGu-Alpha-GPU#16

想问一下训练大概需要多少G的显存?

> 尝试了一下Pangu-alpha_2.6B_mgt这个2.6B 的checkpoint,大概需要6G左右的内存

1 year ago

NewBarry03 created CPU/GPU type debugging task newba202304200123947

1 year ago

NewBarry03 commented on issue PCL-Platform.Inte.../PanGu-Alpha-GPU#15

使用提供的docker运行pytorch版本的generate_samples_Pangu.py报NCCL错误

**同样的问题,使用提供的docker yands/pangu-alpha-megatron-lm-nvidia-pytorch:20.03.2 运行命令时报错:** cd PanGu-Alpha-GPU/panguAlpha_pytorch python tools/generate_samples_Pangu.py \ --model-parallel-size 1 \ --num-layers 31 \ --hidden-size 2560 \ --load /dataset/Pangu-alpha_2.6B_mgt/ \ --num-attention-heads 32 \ --max-position-embeddings 1024 \ --tokenizer-type GPT2BPETokenizer \ --fp16 \ --batch-size 1 \ --seq-length 1024 \ --out-seq-length 50 \ --temperature 1.0 \ --vocab-file megatron/tokenizer/bpe_4w_pcl/vocab \ --num-samples 0 \ --top_k 2 \ --finetune **错误栈信息如下:** loading checkpoint ... global rank 0 is loading checkpoint /dataset/Pangu-alpha_2.6B_mgt/iter_0001000/mp_rank_00/model_optim_rng.pt could not find arguments in the checkpoint ... Traceback (most recent call last): File "tools/generate_samples_Pangu.py", line 185, in <module> main() File "tools/generate_samples_Pangu.py", line 138, in main _ = load_checkpoint(model, None, None) File "/nfs/users/test/Pangu-alpha/PanGu-Alpha-GPU/panguAlpha_pytorch/megatron/checkpointing.py", line 242, in load_checkpoint torch.distributed.barrier() File "/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1485, in barrier work = _default_pg.barrier() RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:514, internal error, NCCL version 2.6.3 **nvidia-smi显示驱动和cuda版本如下:** ![image](/attachments/282533d1-4107-458f-97aa-3efefe001976) **使用的模型文件Pangu-alpha_2.6B_mgt为Pangu-alpha_2.6B_fp16_mgt.zip** 来自于https://git.openi.org.cn/attachments/72aec03d-6bdb-4652-ac2a-8099db4b0bed 下载无误,md5码一致 感谢答复

1 year ago

NewBarry03 created CPU/GPU type debugging task newba202304200123947(deleted)

1 year ago

NewBarry03 created repository NewBarry03/PanGu-NewBee

1 year ago