#64 任务启动时简况信息实现方案优化问题

Closed
created 2 months ago by shamartor · 2 comments

问题描述

任务简况信息是任务在启动时,按时间顺序记录下来的一些发送事件,比如任务调度事件,拉取镜像事件,启动成功事件等,用户可通过对这些事件发生观测进一步了解任务状态.

简况

该功能已在Octopus之前的版本中存在,但运行时发现,获取任务简况的信息时延迟时间很久,甚至不发获取.目前初步分析,随着集群任务量较大时,问题更加明显

拟解决

阅读这一块实现代码后,初步优化方案有两步:

  1. 目前实现方式中,每当启动任务时,都有实例化一个任务级别的ClientInformer,这样与k8s apiserver的连接数会随着任务量增多而增多,考虑减少Informer数量,https://git.openi.org.cn/OpenI/octopus/src/branch/master/server/taskset/pkg/pipeline/services/kubernetes/logs_helper.go#L60

  2. 任务的事件记录目前是累加的方式存放在数据库job表的stateSummary字段中,考虑将这些数据存放在时序数据库中,https://git.openi.org.cn/OpenI/octopus/src/branch/master/server/taskset/pkg/pipeline/models/job/job.go#L39

# 问题描述 任务简况信息是任务在启动时,按时间顺序记录下来的一些发送事件,比如任务调度事件,拉取镜像事件,启动成功事件等,用户可通过对这些事件发生观测进一步了解任务状态. ![简况](https://git.openi.org.cn/attachments/17515873-d2ee-471b-9a9a-fa65bde6f5b8) 该功能已在Octopus之前的版本中存在,但运行时发现,获取任务简况的信息时延迟时间很久,甚至不发获取.目前初步分析,随着集群任务量较大时,问题更加明显 # 拟解决 阅读这一块实现代码后,初步优化方案有两步: 1. 目前实现方式中,每当启动任务时,都有实例化一个任务级别的ClientInformer,这样与k8s apiserver的连接数会随着任务量增多而增多,考虑减少Informer数量,https://git.openi.org.cn/OpenI/octopus/src/branch/master/server/taskset/pkg/pipeline/services/kubernetes/logs_helper.go#L60 2. 任务的事件记录目前是累加的方式存放在数据库`job`表的`stateSummary`字段中,考虑将这些数据存放在时序数据库中,https://git.openi.org.cn/OpenI/octopus/src/branch/master/server/taskset/pkg/pipeline/models/job/job.go#L39
shamartor commented 2 months ago
Poster

相关问题:
#27
#12
#11

相关问题: #27 #12 #11
shamartor added the
optimizing
label 2 months ago
shamartor added this to the v4.0.1 milestone 2 months ago
lijunmao was assigned by shamartor 2 months ago
yangxzh1 referenced this issue from a commit 2 months ago
yangxzh1 referenced this issue from a commit 2 months ago
yangxzh1 referenced this issue from a commit 2 months ago
yangxzh1 referenced this issue from a commit 2 months ago
yangxzh1 referenced this issue from a commit 2 months ago
yangxzh1 referenced this issue from a commit 2 months ago
lijunmao commented 2 months ago
Collaborator

已采用方案:

  1. 不再在pipeline服务的代码中主动采集任务事件,并不再使用job表里的statesummary字段储存事件信息;
  2. 使用时序数据库influxdb存储任务事件,时序数据库将创建octopus数据库,events数据表,数据库默认用户名和密码都为octopus,具体可部署时在values.yaml中配置
  3. 使用第三方开源服务eventrouter,该服务将自动采集任务信息,写入到events数据表中,写入的字段包括:tag keys: cluster_name, component, hostname, kind, namespace_name, object_name, pod_id, reason, type, uid;
    field key: message;
  4. 前端可通过/trainmanage/trainjobevent和/developmanage/notebook向openai-server请求任务事件;参数包括:
    id: 任务jobId
    pageIndex: 分页索引,从1开始
    pageSize: 分页大小
    taskIndex:子任务索引,从1开始
    replicaIndex:副本索引,从1开始
    返回的结果包括:
    totalSize: 该副本事件总数,
    jobEvents: 事件数组,
    数组每一项包括:
    timestamp:事件发生时间,
    name: 副本名,
    reason: 事件原因,
    message: 事件消息
已采用方案: 1. 不再在pipeline服务的代码中主动采集任务事件,并不再使用job表里的statesummary字段储存事件信息; 2. 使用时序数据库influxdb存储任务事件,时序数据库将创建octopus数据库,events数据表,数据库默认用户名和密码都为octopus,具体可部署时在values.yaml中配置 3. 使用第三方开源服务eventrouter,该服务将自动采集任务信息,写入到events数据表中,写入的字段包括:tag keys: cluster_name, component, hostname, kind, namespace_name, object_name, pod_id, reason, type, uid; field key: message; 4. 前端可通过/trainmanage/trainjobevent和/developmanage/notebook向openai-server请求任务事件;参数包括: id: 任务jobId pageIndex: 分页索引,从1开始 pageSize: 分页大小 taskIndex:子任务索引,从1开始 replicaIndex:副本索引,从1开始 返回的结果包括: totalSize: 该副本事件总数, jobEvents: 事件数组, 数组每一项包括: timestamp:事件发生时间, name: 副本名, reason: 事件原因, message: 事件消息
liwei03 referenced this issue from a commit 1 month ago
liwei03 referenced this issue from a commit 1 month ago
liwei03 referenced this issue from a commit 1 month ago
liwei03 referenced this issue from a commit 1 month ago
liwei03 referenced this issue from a commit 1 month ago
liwei03 referenced this issue from a commit 1 month ago
liwei03 referenced this issue from a commit 1 month ago
liwei03 referenced this issue from a commit 1 month ago
liwei03 referenced this issue from a commit 1 month ago
liwei03 referenced this issue from a commit 1 month ago
liwei03 referenced this issue from a commit 1 month ago
liwei03 referenced this issue from a commit 1 month ago
liwei03 referenced this issue from a commit 1 month ago
liwei03 referenced this issue from a commit 1 month ago
liwei03 referenced this issue from a commit 1 month ago
liwei03 referenced this issue from a commit 1 month ago
liwei03 referenced this issue from a commit 1 month ago
liwei03 referenced this issue from a commit 1 month ago
yangxzh1 referenced this issue from a commit 1 month ago
yangxzh1 referenced this issue from a commit 1 month ago
liwei03 referenced this issue from a commit 1 month ago
liwei03 referenced this issue from a commit 1 month ago
liwei03 referenced this issue from a commit 1 month ago
yangxzh1 added the
测试通过
label 1 month ago
yangxzh1 closed this issue 1 month ago
yangxzh1 referenced this issue from a commit 2 weeks ago
yangxzh1 referenced this issue from a commit 2 weeks ago
liwei03 referenced this issue from a commit 1 week ago
liwei03 referenced this issue from a commit 1 week ago
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.