wenzhang.liu 5331a76ec3 MAPPO for MindSpore		1 year ago
common	VDN and QMIX for MindSpore	1 year ago

configs	MAPPO for MindSpore	1 year ago

environment	COMA	1 year ago

xuance_ms	MAPPO for MindSpore	1 year ago

xuance_torch	COMA, MFAC, MFQ, IDDPG, MADDPG for MindSpore	1 year ago

.DS_Store	first code	1 year ago

README.md	6th: MARL minor	1 year ago

main.py	COMA, MFAC, MFQ, IDDPG, MADDPG for MindSpore	1 year ago

requirements.txt	Sencond Major	1 year ago

README.md

XuanPolicy —— "玄策"

Version: 2.0

XuanPolicy is an open-source ensemble of Deep Reinforcement Learning (DRL) algorithm implementations.

We call it as XuanCe.
"Xuan" means magic box and "Ce" means the policy in Chinese.

DRL algorithms are sensitive to hyper-parameters tuning, varying in performance with different tricks,
and suffering from unstable training processes, therefore, sometimes DRL algorithms seems elusive and "Xuan".
This project gives a thorough, high-quality and easy-to-understand implementation of RL algorithms,
and hope this implementation can give a hint on the magics of reinforcement learning.

We expect it to be compatible with multiple deep learning toolboxes (torch, mindspore, and tensorlayer),
and hope it can really become a zoo full of DRL algorithms.

This project is supported by Peng Cheng Laboratory.

Installation and Setup

Step 1: Create and activate a new conda environment (python=3.7 is suggested):

$ conda create -n xuanpolicy python=3.7
$ conda activate xuanpolicy

Step 2: Install the python modules with:

$ pip install -r requirement.txt

Note: Some modules should be installed manually according to the difference devices.

Currently Supported Agents

DRL

Vanilla Policy Gradient - PG (pytorch)
Natural Policy Gradient - NPG (pytorch)
Advantage Actor Critic - A2C (pytorch)
Trust Region Policy Optimization - TRPO (pytorch)
Proximal Policy Optimization - PPO (pytorch)
Deep Q Network - DQN(pytorch)
DQN with Double Q-learning - Double DQN (pytorch)
DQN with Dueling network - Dueling DQN (pytorch)
DQN with Prioritized Experience Replay - PER (pytorch)
DQN with Parameter Space Noise for Exploration - NoisyNet(pytorch)
Deep Deterministic Policy Gradient - DDPG (pytorch)
Twin Delayed Deep Deterministic Policy Gradient - TD3 (pytorch)

MARL

Independent Q-learning - IQL
Value Decomposition Networks - VDN
Q-mixing networks - QMIX
Weighted Q-mixing networks - WQMIX
Q-transformation - QTRAN
Deep Coordination Graphs - DCG
Independent Deep Deterministic Policy Gradient - IDDPG
Multi-agent Deep Deterministic Policy Gradient - MADDPG
Counterfactual Multi-agent Policy Gradient - COMA
Multi-agent Proximal Policy Optimization - MAPPO
Mean-Field Q-learning - MFQ
Mean-Field Actor-Critic - MFAC

Used Tricks

Vectorized Environment
Multi-processing Training
Generalized Advantage Estimation
Observation Normalization
Reward Normalization
Advantage Normalization
Gradient Clipping

You can block any last five tricks as you like by changing the default parameters in functions.

Basic Usage

Run a Demo

The following four lines of code are enough to start training an RL agent.

$ python main.py --method dqn

As our project support multiprocess communication by mpi4py, so you can run with the following command to start training with K sub-process.

mpiexec -n K python test_agent.py

Customize Usage

If you want to train an RL agent in your own environments, you can write an environment wrapper and implement the core function reset() and step(action) and add it in make_env_funcs.py file. The environment template is shown in "./envs/wrappers/xxx_wrappers.py".
If you want to train an agent with some novel network architecture, you can modify content in the function define_network in the xxx_agent.py file in "agents/xxx/xxx_xx_agent". (Hints: Better not playing with the content in define_optimization() function.)

Logger

You can use tensorboard to visualize what happened in the training process. After training, the log file will be automatically generated in the directory ".results/" and you should be able to see some training data after running the command.

tensorboard --logdir ./results/

If everything going well, you should get a similar display like below.

To visualize the training scores, training times and the performance, you need to initialize the environment as

env = MonitorVecEnv(DummyVecEnv(...))

then, after training terminated, two extra files "xxx.npy" and "xxx.gif" will be generated in the "./results/" directory. The "xxx.npy" record the scores and clock time for each episode in training. But we haven't provided a plotter.py to draw the curves for this.

Experiments

MuJoCo

We train our agents in MuJoCo benchmark (HalfCheetah,...) for 1M experience and compare with some other implementations (stable-baselines, stable-baselines3, ...). The performance is shown below. We noticed that the scale of reward in our experiment is different, and we reckon it is mainly because the version of mujoco and the timesteps for each episode. For fair comparsion, we use the same
hyperparameters for all the implementations.

A2C

Environments(1M,4 parallels)	Ours	Stable-baselines(tf)	Stable-baselines3(torch)
HalfCheetah-v3
Hopper-v3
Walker2d-v3
Ant-v3
Swimmer-v3
Humanoid-v3

ACER

Environments(1M,4 parallels)	Ours	Stable-baselines(tf)	Stable-baselines3(torch)
HalfCheetah-v3
Hopper-v3
Walker2d-v3
Ant-v3
Swimmer-v3
Humanoid-v3

ACKTR

Environments(1M,4 parallels)	Ours	Stable-baselines(tf)	Stable-baselines3(torch)
HalfCheetah-v3
Hopper-v3
Walker2d-v3
Ant-v3
Swimmer-v3
Humanoid-v3

TRPO

Environments(1M,4 parallels)	Ours	Stable-baselines(tf)	Stable-baselines3(torch)
HalfCheetah-v3
Hopper-v3
Walker2d-v3
Ant-v3
Swimmer-v3
Humanoid-v3

PPO

Environments(1M,4 parallels)	Ours	Stable-baselines(tf)
HalfCheetah-v3	~3283	~1336.76(std~133.12)
Hopper-v3		~2764.86(std~1090.03)
Walker2d-v3		~3094.35(std~83.41)
Ant-v3		~2508.44(std~106.25)
Swimmer-v3		~43.13(std~1.58)
Humanoid-v3		~549.35(std~92.78)
Reacher-v3		~360.45(std~43.95)
InvertedPendulum-v3
InvertedDoublePendulum-v3

DDPG

Environments(1M,4 parallels)	Ours	Stable-baselines(tf)	Stable-baselines3(torch)
HalfCheetah-v3
Hopper-v3
Walker2d-v3
Ant-v3
Swimmer-v3
Humanoid-v3

TD3

Environments(1M,4 parallels)	Ours	Stable-baselines(tf)	Stable-baselines3(torch)
HalfCheetah-v3
Hopper-v3
Walker2d-v3
Ant-v3
Swimmer-v3
Humanoid-v3

SAC

Environments(1M,4 parallels)	Ours	Stable-baselines(tf)	Stable-baselines3(torch)
HalfCheetah-v3
Hopper-v3
Walker2d-v3
Ant-v3
Swimmer-v3
Humanoid-v3

XuanPolicy

OpenRelearnware

Sep. 21, 2022

A reinforcement learning library by OpenRelearnware Group of PCL.

mindspore pytorch reinforcement-learning

Python

lwz@liuwenzhangs-MacBook-Pro.local chenggr@seu.edu.cn

How to access data resources in code