Sep 217 min read

Reinforcement Learning Models.

Updated: Sep 24

Reinforcement Learning (RL) is a type of machine learning where an agent learns to make decisions by interacting with an environment. The agent learns to achieve a goal or maximize some notion of cumulative reward through trial and error. The central idea of RL is to learn a policy, which is a mapping from states of the environment to actions, that maximizes the cumulative reward over time.

How to estimate the value of a state or state-action pair ?

1. TD [ Temporal Difference ] Prediction:

What is TD Prediction?

TD prediction is a technique used in reinforcement learning to estimate the value of a state or state-action pair by bootstrapping from successor states. It's like trying to predict what will happen next while you're in the middle of experiencing something. It combines ideas from dynamic programming and Monte Carlo methods.

Think of TD prediction like this: You're trying to predict what's going to happen next while watching a movie. You start with a guess about how much you'll enjoy the movie (value of the current state), then as you watch, you update your guess based on how much you're actually enjoying it (reward) and what you think will happen next (value of the next state).

Here's how it works:

How Does TD Prediction Work?

Initialization: Start with some initial estimate of the value function 𝑉(𝑠) for each states.

Interaction with the Environment: Agent interacts with the environment by taking actions, observing rewards, and transitioning between states.

Update Rule: At each time step t the agent updates its estimate of the value function based on the observed transition from current state ' st ' to next state st+1 and the immediate reward 𝑟𝑡+1 using the following update rule:

V(st)←V(st)+α[rt+1+γV(st+1)−V(st)]

Where:

- 𝛼 is the learning rate (step size parameter) which determines how much we update our estimates based on new information.

- γ is the discount factor, representing the importance of future rewards.

- V(st) is the estimated value of the current state .

- rt+1 is the immediate reward obtained after transitioning from state st+1to state .

- V(st+1) is the estimated value of the next state 𝑠𝑡+1.

Convergence: With enough iterations, the value function converges to the true value function.

2. SARSA Algorithm:

SARSA stands for State-Action-Reward-State-Action. It's an on-policy reinforcement learning algorithm that estimates the value of a state-action pair under a specific policy.

Learn the optimal Policy with SARSA Algorithm

Imagine you're playing a video game where you need to learn which moves are the best. With SARSA, you learn by playing and remembering what you did. So, you take a move, see what happens, and then update your knowledge based on that experience. It's like learning from your own actions while you're playing the game.

Here's how SARSA works:

- Initialization: Initialize state s, choose an action a using an exploration policy (e.g., ε-greedy).

- Interaction with the Environment: Take action a, observe reward r, and transition to the next state s′.

- Policy Evaluation and Improvement: Update the action-value function Q(s,a) using the SARSA update rule:

Q(s,a)←Q(s,a)+α[r+γQ(s′,a′)−Q(s,a)]

Where:

- 𝛼 is the learning rate.

- γ is the discount factor.

- a′ is the next action chosen according to the current policy (e.g., ε-greedy).

- Policy Improvement: Update the policy based on the updated action-value function.

The SARSA algorithm would be used to learn the optimal policy by updating the action-values based on the observed transitions and rewards during exploration of the grid world.

3. Q-learning Algorithm:

Q-learning is an off-policy reinforcement learning algorithm that learns the value of the best action to take in a given state.

Q-learning is like learning from the experiences of others. You're trying to figure out the best moves in a game by observing what happens when others play. You keep track of which moves lead to the best outcomes and gradually get better at making decisions without actually having to try every possible move yourself.

Here's how Q-learning works:

- Initialization: Initialize the Q-table, which stores the estimated value of each state-action pair.

- Interaction with the Environment: Agent interacts with the environment by taking actions, observing rewards, and transitioning between states.

- Update Rule: At each time step t, the agent updates its estimate of the value of the current state-action pair Q(st,at) using the Q-learning update rule:

Q(st,at)←Q(st,at)+α[rt+1+γmaxaQ(st+1,a)−Q(st,at)]

Where:

- α is the learning rate.

- γ is the discount factor.

- Exploration vs Exploitation: Choose actions either greedily based on the current estimate of Q or randomly (e.g., ε-greedy) to balance exploration and exploitation.

4. Linear Function Approximation:

In reinforcement learning, when the state or action space is too large to store explicit values for each state-action pair, we often use function approximation techniques. Linear function approximation is one such technique where we approximate the value function (or policy) using a linear combination of features.

When you're trying to understand something big, you might break it down into smaller, simpler parts. Linear function approximation does something similar. It takes a big, complex problem and simplifies it using basic features. It's like summarizing a long book with just a few key points.

Here's how it works:

- Feature Representation: First, we define a set of features 𝜙(𝑠,𝑎) that describe the state-action pairs.

- Parameter Vector: We then represent the value function or policy as a linear combination of these features:

V(s)=θTϕ(s)

Where θ is the parameter vector to be learned.

- Gradient Descent: We use techniques like stochastic gradient descent (SGD) to update the parameter vector θ in the direction that minimizes the error between the predicted and actual values.

- Update Rule: The update rule for linear function approximation can be derived using methods like gradient descent or least squares:

θ←θ+α(Gt−V(st))∇V(st)

Where Gt is the target value, typically a bootstrapped estimate based on rewards and successor states.

Linear function approximation is particularly useful when the state or action space is large and discrete, and it allows for efficient generalization across similar states or actions.

5. Deep Q-Networks (DQN):

Imagine you have a really smart friend who helps you understand a tough game. They use their knowledge and past experiences to guide you. Deep Q-Networks (DQN) work like that friend. They use a super smart computer program (a neural network) to learn the best moves in a game by looking at lots of examples and figuring out patterns. This helps you make better decisions when you play the game.

Deep Q-Networks (DQN) are a class of neural network architectures used in reinforcement learning, particularly for solving problems with high-dimensional state spaces. DQN combines deep learning techniques with Q-learning, enabling agents to learn optimal policies directly from raw sensory inputs, such as images or sensor readings. Let's break down the key components and workings of DQN:

Key Components & Stability of deep Q network

1. Neural Network Architecture:

DQN typically consists of a deep neural network that takes the state as input and outputs the Q-values for all possible actions. The neural network can have multiple layers, such as convolutional layers followed by fully connected layers, to handle high-dimensional input spaces efficiently.

2. Experience Replay:

Experience replay is a crucial component of DQN. Instead of updating the neural network parameters using only the most recent experience, DQN stores experiences (state, action, reward, next state) in a replay buffer. During training, mini-batches of experiences are sampled uniformly from the replay buffer. This approach breaks the correlation between consecutive experiences and stabilizes training.

3. Target Network:

To further stabilize training, DQN uses a separate target network with fixed parameters. The target network is a copy of the primary network that is updated less frequently. During training, the target network is used to compute target Q-values for updating the primary network. This technique helps in mitigating divergence issues that can arise when using the same network for both prediction and target calculation.

4. Q-Learning with Temporal Difference:

DQN employs Q-learning with temporal difference (TD) to update the Q-values. The Q-learning update rule is used to minimize the difference between the predicted Q-values and the target Q-values. The loss function is typically the mean squared error (MSE) between the predicted Q-values and the target Q-values.

Workflow of DQN:

1. Initialization: Initialize the primary and target neural networks with random weights.

2. Interaction with the Environment: The agent interacts with the environment by taking actions based on the current state. At each time step, the agent selects an action using an exploration policy, such as ε-greedy, and observes the next state and reward.

3. Experience Replay: Store experiences (state, action, reward, next state) in the replay buffer.

4. Sample Mini-Batches: Sample mini-batches of experiences uniformly from the replay buffer.

5. Compute Target Q-Values: Use the target network to compute target Q-values for each experience in the mini-batch.

6. Update Neural Network: Update the parameters of the primary network using backpropagation and stochastic gradient descent to minimize the MSE loss between predicted and target Q-values.

7. Update Target Network: Periodically update the parameters of the target network to match those of the primary network.

8. Repeat: Continue interacting with the environment, sampling experiences, and updating the neural network until convergence.

Through this iterative process, DQN learns an optimal policy for the given reinforcement learning task by approximating the action-value function. The trained DQN can then be used to make decisions in real-world environments based on raw sensory inputs.

Here are the key components and workings of DQN:

- Neural Network Architecture: DQN uses a deep neural network to approximate the action-value function 𝑄(𝑠,𝑎;𝜃)Q(s,a;θ), where 𝜃 are the network parameters.

- Experience Replay: DQN uses experience replay, where experiences (state, action, reward, next state) are stored in a replay buffer. During training, mini-batches of experiences are sampled uniformly from the replay buffer to break the correlations between consecutive experiences.

- Target Network: To stabilize training, DQN uses a separate target network with parameters 𝜃′ to compute target values. These target values are updated less frequently than the Q-network and help in mitigating the divergence issues during training.

- Loss Function: DQN minimizes the mean squared error (MSE) between the predicted Q-values and the target Q-values: