#777 master章鱼优化任务监控数据查询接口

Merged
yangxzh1 merged 26 commits from openioctopus/octopus:master into master 3 months ago
## 需求 - [监控接口返回燧原、昇腾、寒武纪、天数等其他芯片的gpu使用率、显存使用率](https://openi.pcl.ac.cn/OpenI/octopus/issues/775) - 接口返回数据 ``` { "success": true, "payload": { "cpuUsage": [ #cpu平均负载 -1 ], "memUsage": [ #内存使用量 -1 ], "gpuUtil": [ #接口原返回字段,只返回gpu负载 -1 ], "gpuMemUsage": [ #接口原返回字段,只返回gpu内存负载 -1 ], "memUsagePercent": [ #内存使用率 -1 ], "accCardUtil": [ #新增字段,加速卡使用率,包含所有类型加速卡 -1 ], "accCardMemUsage": [ #新增字段,加速卡内存使用率,包含所有类型加速卡 -1 ], "networkReceiveBytes": [ #新增字段,网络I/O数据接收量:Bs/1m avg -1 ], "networkTransmitBytes": [ #新增字段,网络I/O数据发送量:Bs/1m avg -1 ], "fsUsageBytes": [ #新增字段,文件系统使用量:b -1 ], "company": "" #新增字段,卡类型公司,枚举:"nvidia", "huawei", "cambricon", "enflame", "iluvatar", "metax-tech" }, "error": null } ``` - [notebook增加监控API](https://openi.pcl.ac.cn/OpenI/octopus/issues/774) - 接口及请求参数 ``` {{openai_addr}}/v1/developmanage/notebookmetric?id=s52efc213ac245a0b067ee9a57a067a5&taskIndex=0&start=1705314518&size=1&step=30 ``` - 接口返回数据 - 与训练任务接口返回结构相同 - [监控API接口cpuUsage取值修改为0-100](https://openi.pcl.ac.cn/OpenI/octopus/issues/773) - 监控接口返回磁盘使用及网络带宽 - 自定义资源支持为逻辑删除 - NPU grafana监控数据表达式更新 ## 测试环境 - 192.168.242.41 ## 资源 ### enflame gcu #燧原 - 镜像 swr.cn-south-1.myhuaweicloud.com/openioctopus/enflame:1.3.20231214 - cd /userhome/openi/gcu;python3 train.py - 节点 192.168.242.33 ### cambricon mlu #寒武纪 - 镜像 swr.cn-south-1.myhuaweicloud.com/openioctopus/cambricon-pytorch:v1.0.7 - . /torch/venv3/pytorch/bin/activate;cd /userhome/openi/mlu;python3 train.py - 节点 192.168.242.78 ### huawei npu #华为 - 镜像 swr.cn-south-1.myhuaweicloud.com/openioctopus/ascend-mindspore:MindSpore2.2.0-cann7.0rc1_py_3.9-euler_2.8.3-D910A - 有效镜像 192.168.202.74:5000/openi/c79-base:latest - cd /userhome/openi/npu; python3 train.py - 节点 192.168.206.27 ### nvidia gpu #英伟达 - 镜像 swr.cn-south-1.myhuaweicloud.com/openioctopus/detectron-base:v1.0 - cd /userhome/openi/cifar10; python3 main.py - 节点 192.168.202.72 ### iluvatar gpu #天数 - 镜像 swr.cn-south-1.myhuaweicloud.com/openioctopus/corex:3.1.1-bi-py39 - cd /userhome/openi/gpu; python3 train.py - 节点 192.168.204.151
yangxzh1 was assigned by wakinzhang 3 months ago
yangxzh1 was unassigned by wakinzhang 3 months ago
yangxzh1 was assigned by wakinzhang 3 months ago
yangxzh1 changed title from WIP: master章鱼优化任务监控数据查询接口 to master章鱼优化任务监控数据查询接口 3 months ago
yangxzh1 merged commit dbb210497f into master 3 months ago
The pull request has been merged as dbb210497f.
Sign in to join this conversation.
No reviewers
No Milestone
No Assignees
1 Participants
Notifications
Due Date

No due date set.

Dependencies

This pull request currently doesn't have any dependencies.

Loading…
There is no content yet.