#1117 NPU启智集群训练任务出现网络问题

Closed
created 8 months ago by JeffDing · 3 comments
NPU启智集群下载模型到pretrain出现网络故障,报错如下,这是什么原因 任务地址:https://openi.pcl.ac.cn/JeffDing/WuKong-HuaHua/modelarts/train-job/198246 ``` pretrain_url_json: [{'model_url': 's3://open-data/aimodels/c/c/cc64487c-a9f8-4e6a-96fc-87003d393be3/wukong-huahua-inpaint-ms.ckpt', 'model_name': 'wukong-huahua-inpaint-ms.ckpt'}] INFO:root:Using MoXing-v2.0.0.rc2.4b57a67b-4b57a67b INFO:root:Using OBS-Python-SDK-3.20.9.1 Successfully Download s3://open-data/aimodels/c/c/cc64487c-a9f8-4e6a-96fc-87003d393be3/wukong-huahua-inpaint-ms.ckpt to /cache/pretrain//wukong-huahua-inpaint-ms.ckpt WARNING:root:Retry=9, Wait=0.1, Timestamp=1692947303.1140246 WARNING:root:Retry=8, Wait=0.2, Timestamp=1692947303.2637005 WARNING:root:Retry=7, Wait=0.4, Timestamp=1692947303.480923 WARNING:root:Retry=6, Wait=0.8, Timestamp=1692947303.91262 WARNING:root:Retry=5, Wait=1.6, Timestamp=1692947304.7646766 WARNING:root:Retry=4, Wait=3.2, Timestamp=1692947306.3835661 INFO:root:Using MoXing-v2.0.0.rc2.4b57a67b-4b57a67b INFO:root:Using OBS-Python-SDK-3.20.9.1 WARNING:root:Retry=3, Wait=6.4, Timestamp=1692947309.6906157 WARNING:root:Retry=2, Wait=12.8, Timestamp=1692947316.1115925 WARNING:root:Retry=1, Wait=25.6, Timestamp=1692947328.940594 INFO:root:Using MoXing-v2.0.0.rc2.4b57a67b-4b57a67b INFO:root:Using OBS-Python-SDK-3.20.9.1 ERROR:root:Failed to call: func=<bound method ObsClient.getObject of <moxing.framework.file.src.obs.client.ObsClient object at 0xffff24b62e10>> args=('open-data', 'attachment/e/b/eb1271c0-7910-4b26-ae97-7a2fe19f713ceb1271c0-7910-4b26-ae97-7a2fe19f713c') kwargs={loadStreamInMemory:False, cache:False, } ERROR:root: stat:404 errorCode:NoSuchKey errorMessage:The specified key does not exist. reason:Not Found request-id:0000018A2B862FA383DD6B3CA18CB1AD retry:0 moxing download s3://open-data/attachment/e/b/eb1271c0-7910-4b26-ae97-7a2fe19f713ceb1271c0-7910-4b26-ae97-7a2fe19f713c/ to /cache/data//wukong_dataset failed: [Errno {'status': 404, 'reason': 'Not Found', 'errorCode': 'NoSuchKey', 'errorMessage': 'The specified key does not exist.', 'body': None, 'requestId': '0000018A2B862FA383DD6B3CA18CB1AD', 'hostId': 'g6tYp12PjymovIL3MtjgLUp+flIqFZwAIpOjkGXuROi+jL0EDqOBenx89HzQTLHz', 'header': [('x-reserved', 'amazon, aws and amazon web services are trademarks or registered trademarks of Amazon Technologies, Inc'), ('request-id', '0000018A2B862FA383DD6B3CA18CB1AD'), ('id-2', '32AAAQAAEAABAAAQAAEAABAAAQAAEAABCSrbI2Vql+DEGRKBitCmdRf0iwsvkGj8'), ('content-type', 'application/xml'), ('date', 'Fri, 25 Aug 2023 07:08:48 GMT'), ('content-length', '378')]}] file or directory or bucket not found. pretrain_url_json: [{'model_url': 's3://open-data/aimodels/c/c/cc64487c-a9f8-4e6a-96fc-87003d393be3/wukong-huahua-inpaint-ms.ckpt', 'model_name': 'wukong-huahua-inpaint-ms.ckpt'}] INFO:root:Using MoXing-v2.0.0.rc2.4b57a67b-4b57a67b INFO:root:Using OBS-Python-SDK-3.20.9.1 Successfully Download s3://open-data/aimodels/c/c/cc64487c-a9f8-4e6a-96fc-87003d393be3/wukong-huahua-inpaint-ms.ckpt to /cache/pretrain//wukong-huahua-inpaint-ms.ckpt random seed: 3407 Traceback (most recent call last): File "/home/work/user-job-dir/code/run_train.py", line 314, in <module> main(args) File "/home/work/user-job-dir/code/run_train.py", line 170, in main dataset, rank_id, device_id, device_num = init_env(opts) File "/home/work/user-job-dir/code/run_train.py", line 91, in init_env sample_num=-1 File "/cache/user-job-dir/code/ldm/data/dataset.py", line 46, in load_data raise ValueError("Data directory does not exist!") ValueError: Data directory does not exist! ```
JeffDing changed title from NPU启智集群下载模型到pretrain出现网络故障 to NPU启智集群训练任务出现网络问题 8 months ago
liuzx commented 8 months ago
Collaborator
将/cache/data//wukong_dataset改为/cache/data/wukong_dataset后报的是同一错误吗
JeffDing commented 8 months ago
Poster
> 将/cache/data//wukong_dataset改为/cache/data/wukong_dataset后报的是同一错误吗 一样报错 https://openi.pcl.ac.cn/JeffDing/WuKong-HuaHua/modelarts/train-job/198246 ``` Successfully Download s3://open-data/attachment/e/b/eb1271c0-7910-4b26-ae97-7a2fe19f713ceb1271c0-7910-4b26-ae97-7a2fe19f713c/ to /cache/data/wukong_dataset pretrain_url_json: [{'model_url': 's3://open-data/aimodels/c/c/cc64487c-a9f8-4e6a-96fc-87003d393be3/wukong-huahua-inpaint-ms.ckpt', 'model_name': 'wukong-huahua-inpaint-ms.ckpt'}] INFO:root:Using MoXing-v2.0.0.rc2.4b57a67b-4b57a67b INFO:root:Using OBS-Python-SDK-3.20.9.1 Successfully Download s3://open-data/aimodels/c/c/cc64487c-a9f8-4e6a-96fc-87003d393be3/wukong-huahua-inpaint-ms.ckpt to /cache/pretrain/wukong-huahua-inpaint-ms.ckpt WARNING:root:Retry=9, Wait=0.1, Timestamp=1692959627.8634877 WARNING:root:Retry=8, Wait=0.2, Timestamp=1692959627.9812672 WARNING:root:Retry=7, Wait=0.4, Timestamp=1692959628.210841 WARNING:root:Retry=6, Wait=0.8, Timestamp=1692959628.6533546 WARNING:root:Retry=5, Wait=1.6, Timestamp=1692959629.4716794 WARNING:root:Retry=4, Wait=3.2, Timestamp=1692959631.1032703 WARNING:root:Retry=3, Wait=6.4, Timestamp=1692959634.3229666 INFO:root:Using MoXing-v2.0.0.rc2.4b57a67b-4b57a67b INFO:root:Using OBS-Python-SDK-3.20.9.1 WARNING:root:Retry=2, Wait=12.8, Timestamp=1692959640.757176 WARNING:root:Retry=1, Wait=25.6, Timestamp=1692959653.5985005 INFO:root:Using MoXing-v2.0.0.rc2.4b57a67b-4b57a67b INFO:root:Using OBS-Python-SDK-3.20.9.1 ERROR:root:Failed to call: func=<bound method ObsClient.getObject of <moxing.framework.file.src.obs.client.ObsClient object at 0xffff42eb4d90>> args=('open-data', 'attachment/e/b/eb1271c0-7910-4b26-ae97-7a2fe19f713ceb1271c0-7910-4b26-ae97-7a2fe19f713c') kwargs={loadStreamInMemory:False, cache:False, } ERROR:root: stat:404 errorCode:NoSuchKey errorMessage:The specified key does not exist. reason:Not Found request-id:0000018A2C423EC783DABAF1ECFADE80 retry:0 moxing download s3://open-data/attachment/e/b/eb1271c0-7910-4b26-ae97-7a2fe19f713ceb1271c0-7910-4b26-ae97-7a2fe19f713c/ to /cache/data/wukong_dataset failed: [Errno {'status': 404, 'reason': 'Not Found', 'errorCode': 'NoSuchKey', 'errorMessage': 'The specified key does not exist.', 'body': None, 'requestId': '0000018A2C423EC783DABAF1ECFADE80', 'hostId': 'FFljQQGyws77Wa1S52z7k8E3KL+rz6Hly418JHmdEzaBo2gbz+rJzh2LnF+juapO', 'header': [('x-reserved', 'amazon, aws and amazon web services are trademarks or registered trademarks of Amazon Technologies, Inc'), ('request-id', '0000018A2C423EC783DABAF1ECFADE80'), ('id-2', '32AAAQAAEAABAAAQAAEAABAAAQAAEAABCSH73Jx1l07OIdN3MIRhgZ+x+PRjXbka'), ('content-type', 'application/xml'), ('date', 'Fri, 25 Aug 2023 10:34:13 GMT'), ('content-length', '378')]}] file or directory or bucket not found. pretrain_url_json: [{'model_url': 's3://open-data/aimodels/c/c/cc64487c-a9f8-4e6a-96fc-87003d393be3/wukong-huahua-inpaint-ms.ckpt', 'model_name': 'wukong-huahua-inpaint-ms.ckpt'}] INFO:root:Using MoXing-v2.0.0.rc2.4b57a67b-4b57a67b INFO:root:Using OBS-Python-SDK-3.20.9.1 Successfully Download s3://open-data/aimodels/c/c/cc64487c-a9f8-4e6a-96fc-87003d393be3/wukong-huahua-inpaint-ms.ckpt to /cache/pretrain/wukong-huahua-inpaint-ms.ckpt random seed: 3407 Filter small images, filter size: 256 Traceback (most recent call last): File "/home/work/user-job-dir/code/run_train.py", line 319, in <module> main(args) File "/home/work/user-job-dir/code/run_train.py", line 170, in main dataset, rank_id, device_id, device_num = init_env(opts) File "/home/work/user-job-dir/code/run_train.py", line 91, in init_env sample_num=-1 File "/cache/user-job-dir/code/ldm/data/dataset.py", line 51, in load_data print(f"The first image path is {all_images[0]}, and the caption is {all_captions[0]}") IndexError: list index out of range ```
liuzx commented 8 months ago
Collaborator
可能是数据集的拷贝方式不对哦,启智需要用示例中的 from openi import openi_multidataset_to_env as DatasetToEnv 智算用 from openi import c2net_multidataset_to_env as DatasetToEnv 可以试下
liuzx closed this issue 6 months ago
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.