启智社区最近又发版了,新增模型推理功能,欢迎大家体验>>>
You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.
papers.liu 337e8eaec0 qtran mask 18 hours ago
.idea qtran mask 19 hours ago
arguments qtran mask 18 hours ago
cloud_images move respository 2 months ago
common MFQ 2 days ago
diplomat_ms ms runners basic models 2 weeks ago
diplomat_tf organize marl memory 1 month ago
diplomat_torch qtran mask 18 hours ago
envs large scale 1 day ago
original_papers move respository 2 months ago
spam modi namespaces 2 months ago
.DS_Store magent 1 week ago
README.md introduce the installation 1 month ago
main_marl.py qtran mask 18 hours ago
plot_training.py wenzhang.liu 1 month ago
requirements.txt add torch_scatter 1 week ago
test_agent.py before sleep 1 month ago
test_agent_ms.py '2' 2 weeks ago
test_agent_torch.py modi namespaces 2 months ago

README.md

Xuan Policy —— “玄策”

This is an ensemble of RL algorithm implementations under development.
We call it as XuanCe.
“Xuan” means magic box and “Ce” means the policy.
RL algorithms are sensitive to hyper-parameters tuning, varying in performance with different tricks, and suffering from unstable training processes, therefore, sometimes RL algorithms seems elusive and “Xuan”.
This project gives a thorough, high-quality and easy-to-understand implementation of RL algorithms and hope this implementation can give a hint on the magics of reinforcement learning.
We expect it to be compatible with multiple deep learning toolboxes (tensorflow, torch, tensorlayer, mindspore...) and hope it can really become a zoo full of RL algorithms.
This project is supported by PengCheng Lab.

Currently Supported Agents

Stable

Unfortunately, at present, no algorithms experienced large-scale empirical tests.

Beta

  • SARSA-tabular
  • SARSA-linear function approximator
  • Q-Learning-tabular
  • Q-learning-linear function approximator
  • Vanilla Policy Gradient - VPG (tensorflow)
  • Natural Policy Gradient - NPG (tensorflow)
  • Advantage Actor Critic - A2C (tensorflow)
  • Asynchronous Advantage Actor Critic - A3C (tensorflow)
  • Trust Region Policy Optimization - TRPO (tensorflow)
  • Proximal Policy Optimization - PPO (tensorflow)
  • Deep Q Network - DQN(pytorch)
  • DQN with Double Q-learning - Double DQN (pytorch)
  • DQN with Dueling network - Dueling DQN (pytorch)
  • DQN with Prioritized Experience Replay - PER (pytorch)
  • DQN with Parameter Space Noise for Exploration - NoisyNet(pytorch)
  • Deep Deterministic Policy Gradient - DDPG (pytorch)
  • Twin Delayed Deep Deterministic Policy Gradient - TD3 (pytorch)

Used Tricks

  • Vectorized Environment
  • Multi-processing Training
  • Generalized Advantage Estimation
  • Observation Normalization
  • Reward Normalization
  • Advantage Normalization
  • Gradient Clipping

You can block any last five tricks as you like by changing the default parameters in functions.

Basic Usage

Installation

You need to first create a new conda environment (python=3.7 is suggested):

conda create -n XuanPolicy python=3.7

Then, install the python modules by using the following command:

pip install -r requirement.txt

Run a Demo

The following four lines of code are enough to start training an RL agent. The template is shown in test_agent.py.

# trying with toy environments or mujoco environments
make_env_fns = [make_env_fn("CartPole-v0",i) for i in range(8)]
envs = DummyVecEnv(make_env_fns)
# choose an agent you like
agent = TRPO_TF_Agent(envs)
agent.train_agent(init_steps=2000,num_steps=10000)

As our project support multiprocess communication by mpi4py, so you can run with the following command to start training with K sub-process.

mpiexec -n K python test_agent.py

You can also launch the training regularly as

python test_agent.py

Customize Usage

  • If you want to train an RL agent in your own environments, you can write an environment wrapper and implement the core function reset() and step(action) and add it in make_env_funcs.py file. The environment template is shown in “./envs/wrappers/xxx_wrappers.py”.
  • If you want to train an agent with some novel network architecture, you can modify content in the function define_network in the xxx_agent.py file in “agents/xxx/xxx_xx_agent”. (Hints: Better not playing with the content in define_optimization() function.)

Logger

You can use tensorboard to visualize what happened in the training process. After training, the log file will be automatically generated in the directory “.results/” and you should be able to see some training data after running the command.

tensorboard --logdir ./results/

If everything going well, you should get a similar display like below.

Tensorboard

To visualize the training scores, training times and the performance, you need to initialize the environment as

env = MonitorVecEnv(DummyVecEnv(...))

then, after training terminated, two extra files “xxx.npy” and “xxx.gif” will be generated in the “./results/” directory. The “xxx.npy” record the scores and clock time for each episode in training. But we haven’t provided a plotter.py to draw the curves for this.

Experiments

MuJoCo

We train our agents in MuJoCo benchmark (HalfCheetah,...) for 1M experience and compare with some other implementations (stable-baselines, stable-baselines3, ...). The performance is shown below. We noticed that the scale of reward in our experiment is different, and we reckon it is mainly because the version of mujoco and the timesteps for each episode. For fair comparsion, we use the same
hyperparameters for all the implementations.

A2C

Environments(1M,4 parallels) Ours Stable-baselines(tf) Stable-baselines3(torch)
HalfCheetah-v3
Hopper-v3
Walker2d-v3
Ant-v3
Swimmer-v3
Humanoid-v3

ACER

Environments(1M,4 parallels) Ours Stable-baselines(tf) Stable-baselines3(torch)
HalfCheetah-v3
Hopper-v3
Walker2d-v3
Ant-v3
Swimmer-v3
Humanoid-v3

ACKTR

Environments(1M,4 parallels) Ours Stable-baselines(tf) Stable-baselines3(torch)
HalfCheetah-v3
Hopper-v3
Walker2d-v3
Ant-v3
Swimmer-v3
Humanoid-v3

TRPO

Environments(1M,4 parallels) Ours Stable-baselines(tf) Stable-baselines3(torch)
HalfCheetah-v3
Hopper-v3
Walker2d-v3
Ant-v3
Swimmer-v3
Humanoid-v3

PPO

Environments(1M,4 parallels) Ours Stable-baselines(tf) Stable-baselines3(torch)
HalfCheetah-v3 ~3283 ~1336.76(std~133.12)
Hopper-v3 ~2764.86(std~1090.03)
Walker2d-v3 ~3094.35(std~83.41)
Ant-v3 ~2508.44(std~106.25)
Swimmer-v3 ~43.13(std~1.58)
Humanoid-v3 ~549.35(std~92.78)
Reacher-v3 ~360.45(std~43.95)
InvertedPendulum-v3
InvertedDoublePendulum-v3

DDPG

Environments(1M,4 parallels) Ours Stable-baselines(tf) Stable-baselines3(torch)
HalfCheetah-v3
Hopper-v3
Walker2d-v3
Ant-v3
Swimmer-v3
Humanoid-v3

TD3

Environments(1M,4 parallels) Ours Stable-baselines(tf) Stable-baselines3(torch)
HalfCheetah-v3
Hopper-v3
Walker2d-v3
Ant-v3
Swimmer-v3
Humanoid-v3

SAC

Environments(1M,4 parallels) Ours Stable-baselines(tf) Stable-baselines3(torch)
HalfCheetah-v3
Hopper-v3
Walker2d-v3
Ant-v3
Swimmer-v3
Humanoid-v3

Xuan Policy

OpenRelearnware

Nov. 15, 2021

简介

A reinforcement learning library by OpenRelearnware Group of PCL.

Python SVG other