Thank you.
My name is Chris Pinn and I'm a PhD researcher at the University of Southampton and we will
talk about safe reinforcement learning.
So the first helpful presentation will be theoretical, we will define what reinforcement
learning is, when is it safe and so on.
In the second half we will talk about practical scenarios which hopefully will sort of clear
things up.
So when do we use reinforcement learning?
When we want to solve control problems.
So you can imagine we have a robotic arm that we want to control with an algorithm or that
we have a space shuttle that should land on a moon and what these scenarios have in common
is that there is an agent that interacts with an environment.
The agent is choosing actions and the environment reacts to those actions by choosing actions,
by choosing states and rewards, sorry.
Mathematically the environment is defined as a set of states, as a mark of decision process
with a set of states, set of actions, reward function and a transition function.
The goal of this process is to find an agent that maximizes an optimality criterion.
The optimality criterion is a long term sum of rewards that come from the environment.
Now the agent's life cycle has two phases.
First is the training phase and then we have the deployment phase.
So in the training phase the agent explores the environment by taking random actions and
over time it starts exploiting what it has learned about the environment to get further.
Now when we are happy that the agent has been trained we deploy it and then the agent only
exploits what it has learned.
It doesn't explore anymore, it doesn't take random actions, it just does the same action
over and over.
Now the training is done with respect to the optimality criterion.
So we want to find an agent that maximizes that long term sum of rewards coming from
the environment.
Open source software is a great enabler of reinforcement learning.
We have gymnasium with a bunch of toy examples, Atari games, physics based games and so on.
We have physics simulators like Mujoko that is used in robotics research but can be used
in reinforcement learning as well.
We can have traffic control and self driven vehicles with Sumo and Carla or multi agents
scenarios, whether agent cooperate or compete.
Now this shows the breadth of applications that reinforcement learning can have and obviously
if your scenarios can match one of these simulators they are open source you can extend them and
use them in your training.
Of course in many scenarios you will have to create your own environment and here is
a simple example on how to do that.
So you would import gymnasium package, extend the environment class and among the things
that you have to define would be a step function.
Now the step function accepts the agent's action and returns a reward and a state.
So basically it defines how the environment behaves.
Similarly you would have a class of agent that accepts the state and returns an action
and so this is the loop we have seen before but in code.
This example is simplified but if you had to do the gymnasium documentation you would
see something very similar there.
Now why do we care about safer reinforcement learning?
Because an agent maximizing an optimative criterion has to be creative and this creativity
can be very dangerous.
Last year we have seen this scenario from US Air Force where they said that their drone
killed a human operator because the operator was stopping it from maximizing an signal.
Now that didn't happen, it was a hypothetical scenario but it's a very good example of why
we want to make reinforcement learning safe.
So safety in this case means controlling the behaviors that the agent can discover over
the training phase.
Now we have two options when we want to include safety in the reinforcement learning process.
Either we modify the optimative criterion that the agent is maximizing.
If this optimative criterion contains some safety information and the agent maximizes
it during the training phase it will become safer as well.
Or we can modify the agent's actions and again if we do it correctly we can prohibit certain
actions from being taken and again increase safety of the agent.
Now I'll give you two examples of this.
Let's start by modification of the optimative criterion.
So we begin with a Markov decision process with a set of states, set of actions, a reward
function and a transition function.
Now we would like to add a function H to it that contains the safety information that
we want to pass to the agent.
This function H is inferred from data in more complicated scenarios and on how to do that
I refer you to the paper but here I'm just going to self-engineery because it's a simple
scenario.
So on the left you see the new optimative criterion.
So now we want to find an agent that maximizes the long-term sum of rewards as well as the
function H. And that's how we can use that's one example of changing the optimative criterion
to include safety information.
So let's imagine that we have a reward function that rewards the agent for reaching the goal
with 100 points and gives it a minus one point otherwise.
This motivates the agent to complete the goal as fast as possible.
And then let's say that we have inferred this function H from data and that it punishes
the agent for going into the water with minus 100 points.
So in a trajectory where the agent avoids all water we get a bunch of minus ones and
a hundred at the end and the function H isn't used in this example but in a trajectory where
we go through water the agent is punished by this function H. And this enables it to
distinguish between these two trajectories and when the training phase is over it will
naturally start avoiding these water tiles.
And this example is obviously very simple but it works in much more bigger and scalable
scenarios than this.
And so some properties is that here we have safety only during the deployment phase not
the training phase.
The agent has to try all the dangerous stuff in order to realize that they are bad and
avoid them doing the deployment phase.
Similarly we have to have the data set of safe behaviors because we needed to infer
the function H from it.
So again these safe behaviors could be easy or difficult to obtain depending on your scenario.
And in this example we don't need to define what safety actually means because it's implicitly
contained in the data set.
And again defining safe behaviors can be actually pretty hard when you think about it.
And if you could have a data set that just contains all the safe trajectories and you
can learn from it that actually is pretty helpful.
As we will see in the next scenario.
So in the second scenario we modify the agent's actions and in this paper researchers use
formal methods and they start with the environment which is Markov decision process and they
turn it into something called a transition system.
Now similarly they have a set of safety requirements that the agent should adhere to and they turn
them into specifications in linear typology.
And then they bring these two things together with reactive synthesis to generate a shield.
And this shield has the formal mathematical guarantees of ensuring that the agent never
does a dangerous action.
And what this means, let's take a look at the first trajectory again.
Now here the shield isn't used and it's exactly the same as before.
So we get a bunch of minus ones and a hundred at the end.
But when we take a look at the second trajectory now when the agent should go through the water
the shield will actually prohibit it from happening at all.
So instead the shield will send the agent in some other direction and the agent will
never experience going through the dangerous state.
And so here we have safety during both training and deployment and compare this with the modification
of the optimality criterion where we only had safety during the deployment phase not
the training phase.
However it comes at the cost of being able to specify the transition system.
So this can actually be pretty difficult if we don't know the Markov decision process
or we can't specify the transition system or if we don't know the safety specifications
we can't use the reactive synthesis to generate a shield at all.
So there is a trade-off.
This approach is much more manual but what it enables us to do is guarantee more safety
during the training and deployment phase.
And yeah, thank you.
Do we have any questions?
Hi, thank you for the presentation.
I had a quick question about the shielding aspect when we're basically blocking in action
because we decided it's unsafe or the like.
In this circumstance when the AI or the reinforcement learner is not able to expand in this direction
because it's been blocked and it doesn't understand that part of the state space, what if down
the line it ends up in a scenario where it like violates the safety because it came across
it from a different trajectory?
How would we be able to train it away from that?
Because it's, I'm not sure if I explained it.
Yeah, that's a good question.
The agent can never violate the property if we specify the transition system correctly.
If we make a mistake there or if we don't understand the Markov-District Process properly,
that means if we don't know the transition function in the Markov-District Process for
example, then obviously we'll get garbage and that's true of all formal methods.
If you don't specify the transition model correctly or the specifications correctly,
then you are not guaranteeing anything.
So it's completely dependent on us having that knowledge and the capability of making
that transition system.
Got time for one more.
There you go.
Hi, in your example the rewards for falling in water of the environment was only minus
one, but we're saying that's deadly, that's unsafe, we need to avoid it.
Why are we not just addressing this by fixing the environment?
Why don't we address this simply by fixing the environment?
That's a good question.
We can't modify the environment because the environment is not in our control.
So the transitions will happen whether we want it or not.
Now the question is how do we train the agents to avoid it?
In this example, we punish the agent with minus 100 points and over time it will realize
in the neural network for example that these trajectories involving water are just not
worth it and it will start avoiding them.
But we can't do anything about the environment, the environment is given and we can't modify
it in any way.
All right, y'all give Crispin another round of applause.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.
Thank you.