Browse Source

master (#3)

add readme.md
tags/v4.0.0-alpha.162
shamartor 1 month ago
parent
commit
a4c8308743
3 changed files with 131 additions and 0 deletions
  1. BIN
      logo.png
  2. +65
    -0
      readme.md
  3. +66
    -0
      readme_en.md

BIN
logo.png View File

Before After
Width: 50  |  Height: 50  |  Size: 2.0 KiB

+ 65
- 0
readme.md View File

@@ -0,0 +1,65 @@
# Octopus Platform

<img src="./logo.png" width="100">

---

[EN](./readme_en.md)

**Octopus**是一款面向多计算场景的一站式计算融合平台。平台主要针对AI、HPC等场景的计算与资源管理的需求来设计,向算力使用用户提供了对数据、算法、镜像、模型与算力等资源的管理与使用功能,方便用户一站式构建计算环境,实现计算。同时,向集群管理人员提供了集群资源管理与监控,计算任务管理与监控等功能,方便集群管理人员对整体系统进行操作与分析。

**Octopus**平台底层基于容器编排平台[Kubernetes](https://kubernetes.io/zh/docs/concepts/overview/what-is-kubernetes) ,充分利用容器敏捷、轻量、隔离等特点来实现计算场景多样性的需求。

## 特点与场景

Octopus具有如下特点:

- **一站式开发**,为用户提供一站式AI、HPC计算场景的开发功能,通过数据管理、模型开发和模型训练,打通计算全链路;
- **方便管理**,为平台管理者提供一站式的资源管理平台,通过资源配置、监控、权限管控等可视化工具,大大降低平台管理者的管理成本;
- **易于部署**,Octopus 支持[Helm](https://helm.sh)方式的快速部署,简化复杂的部署流程;
- **性能优越**,提供高性能的分布式计算体验,通过多方面优化来保证各个环境的流畅运行,同时通过资源调度优化与分布式计算优化,进一步提高模型训练效率;
- **兼容性好**,平台支持异构硬件,如 GPU、NPU、FPGA 等,满足各种不同的硬件集群部署需求,通过支持多种深度学习框架,如 TensorFlow、Pytorch、PaddlePaddle 等,并可以通过自定义镜像方式支持新增框架。

Octopus适合在如下场景中使用:

- 构建大规模 AI 计算平台;
- 希望共享计算资源;
- 希望在统一的环境下完成模型训练;
- 希望使用集成的插件辅助模型训练,提升效率。

## 开始

**Octopus**管理计算资源并针对AI、HPC等场景的计算任务进行优化。通过镜像与容器技术([Docker](https://docs.docker.com))实现计算硬件与软件解耦,从而轻松切换不同计算环境中。

由于Octopus的使用用户通常有两种不同的角色:

- **集群管理员**是计算资源的所有者和维护者。管理员负责集群的部署和可用性。
- **集群用户**是集群计算资源的消费者。根据部署场景,集群用户可以是机器学习和深度学习的研究人员、数据科学家、实验室教师、学生等。

Octopus 为集群用户和管理员提供端到端的手册。

### 对于集群管理员

与集群管理员相关的文档包括如下:

- ***集群部署指南***: 此部分主要提供的内容包括:集群依赖环境与组件的准备与安装、Octopus系统部署指南以及后续系统的升级说明等,以方便安装维护。详细内容请参考[这里](https://octopus.pcl.ac.cn/docs/deployment/environment) 。

- ***集群管理手册***: 此部分主要介绍集群管理员通过管理系统页面入口进入Octopus管理系统后可进行的操作,主要功能说明包括:平台监控、资源管理、用户管理、机时管理、数据管理、算法管理以及开发与训练管理等功能。详细内容请参考[这里](https://octopus.pcl.ac.cn/docs/management/intro) 。

### 对于集群用户

与集群用户相关的文档主要如下:

- ***用户使用手册***: 此部分主要介绍集群用户通过Octopus系统页面入口进入Octopus系统后可进行的操作,主要功能说明包括:数据管理、算法管理、镜像管理以及开发与训练管理等功能。详细内容请参考[这里](https://octopus.pcl.ac.cn/docs/manual/intro) 。

## 文档

详细文档请参考[这里](https:///octopus.pcl.ac.cn/docs/introduction/intro)。

## 如何贡献

详细贡献指南请参考[这里](https://octopus.pcl.ac.cn/docs/community/contribution) 。

## License

[Apache License](https://octopus.pcl.ac.cn/docs/community/LICENSE)

+ 66
- 0
readme_en.md View File

@@ -0,0 +1,66 @@
# Octopus Platform

<img src="./logo.png" width="100">

---

[简体中文](./readme.md)

**Octopus** is a one-stop computing fusion platform for multiple computing scenarios. the platform is mainly designed for the needs of computing and resource management in AI, HPC and other scenarios. It provides users with computing power management and use functions for data, algorithms, mirroring, models, and computing power, which is convenient for users to build a one-stop shop Computing environment, realizing calculation.
At the same time, cluster management personnel are provided with functions such as cluster resource management and monitoring, computing task management and monitoring, etc., to facilitate cluster management personnel to operate and analyze the overall system.

**Octopus** is based on the container orchestration platform [Kubernetes](https://kubernetes.io/zh/docs/concepts/overview/what-is-kubernetes) , octopus makes full use of the agility, light weight, and isolation of containers to meet the needs of diverse computing scenarios.

## Features and Scenarios

Octopus has the following characteristics:

- **One-stop Development**, provide users with one-stop AI and HPC computing scenarios development functions, through data management, model development and model training, open up the entire computing link;
- **Easy to manage**, provide a one-stop resource management platform for platform managers, and greatly reduce the management cost of platform managers through visual tools such as resource configuration, monitoring, and authority management and control;
- **Easy to deploy**, octopus supports rapid deployment in [Helm](https://helm.sh), simplifying the complex deployment process;
- **Superior performance**, provide a high-performance distributed computing experience, and ensure the smooth operation of each environment through multiple optimizations. At the same time, through resource scheduling optimization and distributed computing optimization, the efficiency of model training is further improved;
- **Good compatibility**, the platform supports heterogeneous hardware, such as GPU, NPU, FPGA, etc., to meet various hardware cluster deployment needs. It supports multiple deep learning frameworks, such as TensorFlow, Pytorch, PaddlePaddle, etc., and can support new additions through custom mirroring frame.

Octopus is suitable for use in the following scenarios:

- Build a large-scale AI computing platform;
- Hope to share computing resources;
- Hope to complete model training in a unified environment;
- Hope to use the integrated plug-in to assist model training and improve efficiency.

## Get Started

**Octopus** manages computing resources and optimizes computing tasks for scenarios such as AI and HPC. Decoupling computing hardware and software through mirroring and container technology ([Docker](https://docs.docker.com)) enables easy switching between different computing environments.

Octopus users usually have two different roles:

- **Cluster administrators** are the owners and maintainers of computing resources. The administrator is responsible for the deployment and availability of the cluster.
- **Cluster users** are consumers of cluster computing resources. According to the deployment scenario, cluster users can be machine learning and deep learning researchers, data scientists, laboratory teachers, students, etc.

Octopus provides end-to-end manuals for cluster users and administrators.

### For cluster administrators

Documents related to cluster administrators include the following:

- ***Cluster Deployment Guide***: the main contents provided in this part include: preparation and installation of cluster dependent environment and components, Octopus system deployment guide and follow-up system upgrade instructions to facilitate installation and maintenance. For details, please refer to [here](https://octopus.pcl.ac.cn/docs/deployment/environment) 。

- ***Cluster Management Manual***: This part mainly introduces the operations that the cluster administrator can perform after entering the Octopus management system through the management system page entrance. The main function descriptions include: platform monitoring, resource management, user management, machine time management, data management, algorithm management, development and training management And other functions. For details, please refer to [here](https://octopus.pcl.ac.cn/docs/management/intro) 。

### For cluster users

The main documents related to cluster users are as follows:

- ***User Manual***: this part mainly introduces the operations that cluster users can perform after entering the Octopus system through the Octopus system page entrance. The main function descriptions include: data management, algorithm management, mirroring management, development and training management and other functions. For details, please refer to [here](https://octopus.pcl.ac.cn/docs/manual/intro) 。

## Documentations

For detailed documentation, please refer to [here](https:///octopus.pcl.ac.cn/docs/introduction/intro).

## How to Contribute

For detailed contribution guidelines, please refer to [here](https://octopus.pcl.ac.cn/docs/community/contribution).

## License

[Apache License](https://octopus.pcl.ac.cn/docs/community/LICENSE)

Loading…
Cancel
Save