Policy Parameterization for a Continuous Action Space

Published in

Geek Culture

8 min readAug 9, 2021

In the past few Policy Gradient and Actor-Critic algorithms I’ve implemented, I’ve been using the classical control environment, CartPole, as my experiment benchmark. The CartPole environment has a simple discrete action space: pushing the cart left or right. In this post, I will be exploring another way to parameterize our policy function to deal with an environment with a continuous action space.

Introduction

One of the advantages of a policy-based method is the ability to deal with large or even continuous action spaces. When we are doing Q-Learning methods, the agent learns the Q value for each action in a state and builds an optimal policy by picking the action that gives the highest Q value. While this may work for some environments, this only works with a discrete action space as there is an infinite number of actions in a range of continuous values. In policy-based methods, the agent learns the policy directly and picks an action out of a probability distribution of the action space. For a continuous action space, we can learn the parameters of a probability distribution and pick a value from there instead. First, let’s review how the policy is parameterized for discrete actions:

The function takes in a discrete action, such as “push left” or “push right”, a state s, and a set of parameters θ. On the right side of the equation, there is the h(s, a, θ) function, which is a preference function that gives higher values to an action that the policy prefers. We use an exponential softmax distribution here so the preference value is converted to a probability value between [0,1], and all actions in a state will sum up to 1.

Taken from Sutton&Barto 2017

The preference function here is just a simple set of weights θ multiplied by feature vector x(s, a). During implementation, the set of weights can be learned through a neural network, so the preference function can be more complicated than just a simple linear function.

Now, let’s see how the policy is parameterized for continuous actions. The policy function is based on the probability density function for a Gaussian distribution:

We can see that the parameters of this distribution are μ, the mean, and σ, the standard deviation. Note that p(x) does not give us the probability of x but the density of the probability of x. Unlike the softmax distribution for discrete actions, the Gaussian density distribution gives us a real value instead of a probability. To find the probability for a range of x values, you would need to find the definite integral between those bounds. Using this probability density function, we can define our policy function to be:

Here, we use our θ parameters to approximate the values of μ and σ to get the probability density distribution. The μ and σ can be approximated this way:

We can approximate the mean by a linear function, so a simple matrices multiplication of the transposed weights and feature vector. For the standard deviation, the value should be positive, so we exponentiate the linear function. Just like the parameterization for discrete actions, we only need to use one network as both μ and σ are approximated by the same linear function.

So that's it! There isn’t much that is changed in the underlying logic of any algorithm that uses a continuous action space, but rather the way the policy function is parameterized.

Implementation

For the implementation, I will parameterize the policy function using a neural network. Then, I will test my implementation with the vanilla policy gradient method REINFORCE. Looking at the classical control environments in OpenAI’s gym, I chose to tackle the Continuous Mountain Car problem.

The premise of the MountainCar environment is simple. At each step, the agent exerts power proportional to a value between [-1, 1], and the goal is to drive the cart to the flag. The agent is rewarded when the cart reaches the flag and penalized for the energy used. However, achieving the goal is a little more complicated. Since the power is constrained, the idea is for the cart to rock back and forth to generate enough momentum to climb uphill. The agent will also need to minimize the amount of energy used to get a higher score. The environment is considered solved when achieving a score of 90+.

Here are the hyperparameters I used for my implementation:

γ (Discount factor): 0.99
α (Learning rate): 0.001
NUM_EPISODES (episodes to run): 1000
MAX_STEPS (max steps per episode): 500

Here is the policy network:

We have two outputs here, μ and σ. Here is a snippet of code I used to select an action:

Note that I still calculate the log probability of distribution as the update equation for the policy still doesn’t change, we are only changing how we parameterize the policy function.

Here’s the training history of the agent over 250 episodes with REINFORCE:

When I first saw the training history, I was surprised at the result. The agent’s score increased rapidly and progress flatlined early on at around 50 episodes. Upon rerunning the experiment 10 more times, the same history pattern appears, though sometimes it does manage to exceed a score of 0 for one episode. A conclusion you may come up with is that the agent uses too much energy to generate the momentum needed to climb the hill, resulting in the reward canceling out the energy usage. However, looking at the agent play with the learned policy, the cart appears to only slightly rock back and forth at the base of the hill. So what happened?

The problem here lies with the environment’s sparse rewards. The MountainCar environment penalizes the agent for energy usage at each step and only rewards the agent after reaching the flag. Therefore, the agent will not know of the reward until it reaches the flag by random chance. As the agent is being penalized at each step, its goal naturally would be to minimize energy usage resulting in a policy where the cart just sits still.

A potential solution to this problem could be to make custom rewards for the environment instead. However, after trying to use position, velocity, mechanical energy, and more, I was still receiving the same results. The reason for this is due to the nature of our algorithm.

Our policy-based method relies on improving the policy directly and samples actions from our learned policy. While an off-policy method can use exploration methods such as epsilon-greedy, the only exploration our agent has is when the sampled action deviates far from the mean or the standard deviation is large. As the agent tries to update the policy without reaching the goal, the policy will converge suboptimally.

Looks like it's back to the good ole CartPole environment. I will be using a continuous CartPole implementation by iandanforth with OpenAI’s solved condition of an average 195 score over the most recent 100 episodes. The actions will be clipped to be [-1, 1]. Here’s the training history with REINFORCE:

Our agent was able to achieve an average score of 300.72 over 50 playthroughs. The agent was also able to solve CartPole in an average of 384.8 episodes over 5 trials. Compared to a score of 79.6 for CartPole with a discrete action space using REINFORCE, this result was far better. The agent was able to solve the environment under 1000 episodes. This result is to be expected as using continuous actions would provide precise control over the cart compared to applying +1 or -1 force.

One thing to note is the instability and frequent catastrophic forgetting. Since we are doing REINFORCE without a baseline, high variance is to be expected. However, we can observe that the agent “forgets” quite frequently starting from around episode 150. I hypothesize that this is because since our agent is learning so quickly, there isn’t time to converge to a stable policy. Adding onto the fact that our actions can widely vary, several outlier actions can cause the policy to collapse.

Here’s a peek at the training history of the action distribution parameters:

The training history of the means is to be expected. As the agent is trying to balance the pole, the policy learns to alternate the mean between negative and positive actions. We can also observe that the majority of the negative means are under -1 while the positive means are between 1 and 2 except for outliers. Since we clipped our actions to be between [-1, 1], the positive actions would be all +1. Negative means occurs 20% more frequently than the positive means to compensate for the larger positive means. I hypothesize that the magnitude of the mean increases over time since the actions are clipped so the mean can keep climbing without punishment.

The history of the standard deviation was a bit unexpected and can explain the training score history. We can see that our policy is less confident in its actions by its high st. dev starting from step 20,000. This is around the same time when the agent starts to catastrophically forget and there is a high score variance. The st. dev falls back down again once the agent solves the environment and converges to a maxima.

Let’s look at the training history of the scores and action distribution parameters with unclipped actions:

Unlike the clipped actions, the magnitude of the mean does not increase over time as with unclipped actions there would be punishment for exerting extra force. The standard deviation history also makes sense as the learned policy is more confident in its action when it converges towards the solution.

Final Thoughts

Overall, this was a fun and short experiment that did not take too long to do. I only wanted to try to implement a different kind of policy parameterization so I modified parts of the code from the REINFORCE algorithm. In future posts, I will be moving on to more complex algorithms that do deal with continuous action spaces such as DDPG, PPO, SAC so hopefully, those algorithms will be able to solve more complex environments.

Code: https://github.com/chengxi600/RLStuff/blob/master/Policy%20Gradients/REINFORCE-Continuous.ipynb

References:

Reinforcement Learning: An Introduction (Sutton&Barto 2017)

Policy Parameterization for a Continuous Action Space

Introduction

Implementation

Final Thoughts

Written by Cheng Xi Tsou