All Pro Web Designs > > Learning Tutorials > Artificial Intelligence > Q Learning Explained | Reinforcement Learning Using Python | Q Learning in AI | Edureka

Q Learning Explained | Reinforcement Learning Using Python | Q Learning in AI | Edureka

January 6, 2024
Posted by: MainInstructor
Category: Artificial Intelligence Go Python

18 Comments

Video Title: Q Learning Explained | Reinforcement Learning Using Python | Q Learning in AI | Edureka

Hello everyone and welcome to this interesting session on cue learning now let’s step into the world of reinforcement learning and the beautiful branch of artificial intelligence which lets machine learn on your own in a way different from traditional machine learning throughout our lives we perform a number of actions to pursue our dreams

Some of them bring us good rewards others do not along the way we keep exploring different paths and figure out which action might lead to better rewards we work hard towards our dreams utilizing the feedback we get based on our actions to improve our strategies they help us determine how close we are

To achieving our goals our mental states teams continuously representing this closeness in that description of how we pursue our goal in the early life we framed for ourselves our representative analogy of reinforcement learning now let me summarize the above example reformatting the main points of interest our reality contains environment in

Which we perform numerous actions sometimes we get good or positive rewards for some of these action in order to achieve the goals now during the entire course of life our mental and physical states evolve we strengthen our action in order to get as many rewards as possible now the key entities of

Interest are the environment the action reward and the state now this whole paradigm of exploring the environment and learning through actions rewards and States establishes the foundation of reinforcement learning so reinforcement learning solves a particular kind of problem where decision-making is sequential and the coal is long-term such as game playing robotics resource

Management or logistics now for a robot and environment is a place where it has been put to use now remember this robot is itself the agent for example an automobile factory where a robot is used to move materials from one place to another now the task we discussed just

Now have a property in common now these tasks involve an environment and expect the agent to learn from the environment this is where traditional machine learning fades and hence the need for reinforcement learning now it is good to have an established overview of the problem that is to be solved using the

Cue learning or the reinforcement learning so it helps to define the main components of a reinforcement learning solution that is the 18 environment action rewards and States so let’s suppose we are to build a few autonomous robots for an automobile building Factory now these robots will help the

Factory personnel by conveying them the necessary parts that they would need in order to pull the car now these different parts are located at nine different positions within the factory warehouse the car part includes these chassis wheels dash board the engine and so on and the factory workers have

Prioritized the location that contains the body or the chassis to be deep topmost but they have provided the priorities for other locations as well which we will look into the moment but these locations within the factory looks somewhat like this so as you can see

Here we have l1 l2 l3 all of these stations now one thing you might notice here that there are little obstacle prison in between the locations so l6 is the top priority location that contains the chassis for repairing the car bodies now let to ask is to enable the robots

So that they can find the shortest route from any given location to another location on their own now the agents in this case are the robots the environment is the automobile factory warehouse now let’s talk about these states so the states are the location in which a particular robot is present in the

Particular instance of time which will denote its states now machines understand numbers rather than letters so let’s map the location codes to number so as you can see here we have mapped location l1 to the state 0 l2 + 1 and so on we have l8 as state 7 and n

Line at state 8 well next what we will talk about are the actions so in our example the action will be the direct location that a robot can go from a particular location right consider a robot that is at l2 location and the direct locations to which it can move

Our l5 l1 and l3 now the figure here may come in handy to visualize this now as you might have already guessed the set of actions here is nothing but the set of all possible states of the robot for each location the set of actions that a

Robot can take will be different for example the set of actions will change if the robot is in l1 rather than l2 so if the robot is in l1 it can only go to l4 and l2 directory now that we are done with these states and the actions let’s

Talk about the rewards so the states are basically 0 1 2 3 4 and the actions are also 0 1 2 3 4 up till 8 now the rewards now will be given to a robot if a location which is the state is directly reachable from a particular location so

Let’s take an example suppose n line is directly reachable from L 8 right so if a robot goes from LA to n line and vice versa it will be rewarded by 1 and if a location is not directly reachable from a particular liquation we do not give

Any reward a reward of 0 now the reward is just a number here and nothing else it enables the robots to make sense of the movements helping them in deciding what locations are directly reachable and what are not now with this cue we can construct a reward table which contains all the

Reward values mapping between all possible states so as you can see here in the table the positions which are marked green have a positive reward and as you can see here we have all the possible rewards that a robot can get by moving in between the different states

Now comes an interesting decision now remember that the factory administrator prioritize L 6 to be the topmost so how do we incorporate this fact in the above table now this is done by associating the topmost priority location with a very high reward than the usual ones so

Let’s put 999 in the cell L 6 comma L 6 now the table of rewards with the higher reward for the topmost location looks something like this we have now formally defined all the vital components for the solution we are aiming for the problem discussed now we will shift gears a bit

And study some of the fundamental concepts that prevail in the wall of reinforcement learning and Kuehl on the first of all we’ll start with the bellman equation now consider the following square of rooms which is analogous to the actual environment from our original problem but without the barriers now suppose a

Robot needs to go to the room marked in the green from his current position a using the specified direction now how can we enable the robot to do this programmatically one idea would be introduce some kind of a footprint which the robot will be able to follow now

Here a constant value is specified in each of the rooms which will come along the robots way if it follows the direction specified above now in this way if it starts at location a it will be able to scan through this constant value and will move accordingly but this

Will only work if the direction is prefixed and the robot always starts at the location a now consider the robot starts at this location rather than its previous one now the robot now sees footprints in two different directions it is therefore unable to decide which way to go in order to get the

Destination which is the green room it happens primarily because the robot does not have a way to remember the directions to proceed so our job now is to enable the robot with a memory now this is where the bellman equation comes into play so as you can see here the

Main reason of the bellman equation is to enable the robot with the memory that’s the thing we’re going to use so the equation goes something like this V of s gives maximum of a R of s comma a plus gamma of V dash where s is a

Particular state which is a room a is the action moving between the rooms s dash is the state to which the robot goes from s and gamma is the discount factor now we’ll get into it in a moment and obviously R of s comma a is a reward

Function which takes a state s and action a and outputs the reward now V of s is the value of being in a particular state which is the footprint now we consider all the possible actions and take the one that you use the maximal value now there is one constraint however

Regarding the value footprint that is the room marked in yellow just below the green room it will always have the value of 1 to denote that is one of the nearest room adjacent to the green room now this is also to ensure that a robot gets a reward when

It goes from a yellow room to the green room let’s see how to make sense of the equation which we have here so let’s assume a discount factor of 0.9 as a remember gamma is the discount value or the discount factor so let’s take a zero

Point nine now for the room which is marked just below the one or the yellow room which is the astrick mark for this room what will be the V of s that is the value of being in a particular state so for this V of s would be something like

Maximum of a it will take zero which is the initial of our s comma a plus 0.9 which is gamma into 1 and that gives us 0.9 now here the robot will not get any reward for going to a state marked in yellow hence the are s comma a is 0 here but

The robot knows the Valley opening in the yellow room hence V of s dash is 1 following this for the other states we should get 0.9 then again if we put 0.9 in this equation we get 0.8 1 then 0.72 9 and then we again reach the starting

Point so this is how the table looks with some value footprints computed from the bellman equation and a couple of things to notice here is that the mass function helps the robot to always choose the state that gives it the maximum value of being in that state now

The discount factor gamma notifies the robot about how far it is from the destination this is typically specified by the developer of the algorithm that would be installed in the robot now the other states can also be given their respective values in a similar way so as

You can see here the boxes adjacent to the green one have won and if we move away from one we get zero point nine zero point eight one zero point seven two nine and finally we read zero point six six now the robot now can proceed its way through the green room utilizing

DS value footprints even if it’s dropped at any arbitrary in the given location now if a robot lands up in the highlighted sky-blue area it will still find two options to choose from but eventually either of the parts will be good enough for the robot to take because of the way the value

Footprints are now laid out now one thing to note here is that the bellman equation is one of the key equations in the world of reinforcement learning and cue learning so if we think realistically our surroundings do not always walk in the way we expect there is always a bit of stochastic City

Involved in it but this applies to robot as well sometimes it might so happen that the robots machinery got corrupted sometimes the robot may come across some hindrance on its way which may not be known through it beforehand right and sometimes even if the robot knows that

It needs to take the right turn it will not so how do we introduce the stochastic city in our case now here comes the Markov decision process now consider the robot is currently in the Red Room and it needs to go to the green room now let’s now consider the robot has a

Slight chance of dis functioning and might take the left or the right or the bottom turn instead of being the upper turn in order to get to the green room from where it is now which is the red room now the question is how do we

Enable the robot to handle this when it is out in the given environment right now this is a situation where the decision-making regarding which turn is to be taken is partly random and partly another control of the robot now partly random because we are not sure when

Exactly the robot mind is functional and partly under the control of the robot because it is still making a decision of taking a turn right on its own and with the help of the program embedded into it so a Markov decision process is a discrete-time stochastic control process

It provides a mathematical framework for modeling decision-making in situations where the outcomes are partly random and partly under the control of the decision maker now we need to give this concept a mathematical shape most likely an equation which then can be taken further now you might be surprised that we can

Do this with the help of the equation with a few minor tweaks so if we have a look at the original bellman equation V of X is equal to maximum of our s comma A plus gamma V of s – what needs to be changed in the above

Equation so that we can introduce some more amount of randomness here as long as we are not sure when the robot might not take the expected turn we are then also not sure in which room it might end up in which is nothing but the room it

Moves from its current room at this point according to the equation we are not sure of the s – which is the next state or the room but we do know all the probable turns the robot might take now in order to incorporate each of these probabilities into the above equation we

Need to associate a probability with each of the turns to quantify the robot if it has got any explicitness chance of taking this turn now if we do so we get PS is equal to maximum of r s comma a plus gamma into summation of s – p s

Comma a comma s – into V of s – now the PS a and s Dash is the probability of moving from room s to s – with the action a and the submission here is the expectation of the situation that the robot in curves which is the randomness

Now let’s take a look at this example here so when we associate the probabilities to each of these stones we essentially mean that there is an 80% chance that the robot will take the upper turn now if you put all the required values in our equation we get V

Of s is equal to maximum of R of s comma ie plus gamma of 0.8 into V of room up plus 0.1 into V of room down 0.03 interim of V or from left plus 0.03 into V of room right now note that the value footprints will not change due to the

Fact that we are incorporating stochastically here but this time we will not calculate those value food prints instead we will let the robot to figure it out now up until this point we have not considered about rewarding the robot for its action of going into a particular room

We are only reward in the robot when it gets to the destination now ideally there should be a reward for each action the robot takes to help it better assess the quality of the actions but the roars need not to be always be the same but it

Is much better than having some amount of reward for the actions then having no rewards at all right and this idea is known as the living penalty in reality the reward system can be very complex and particularly modeling sparse rewards is an active area of research in the

Domain of reinforcement learning so by now we have got the equation which we have and so what we’re gonna do is now transition to Q learning so this equation gives us the value of going to a particular state taking the stochastic city of the environment into account now

We have also learned very briefly about the idea of living penalty which deals with associating each move of the robot with a reward so Q learning possesses an idea of assessing the quality of an action that is taking to move to a state and rather than determining the possible

Value of the state which is being moved to so earlier we had instead 0.8 into V of s1 0.03 into V of s to 0.1 into V of s 3 and so on now if you incorporate the idea of assessing the quality of the action for moving to a certain state so

The environment with the agent and the quality of the action will look something like this so instead of 0.8 V of s one will have Q of s 1 comma a one will have Q of s 2 comma a 2 Q of S 3 now the robot now has four different

States to choose from and along with that there are four different actions also for the current state it is in so how do we calculate Q of s comma a that is the cumulative quality of the possible actions through what might take so let’s break it down now from the

Equation V of s equals maximum of a are s comma a plus gamma some – B SAS – into V of his – if we discard the maximum function we have is FA + gamma into summation P and V now essentially in the equation that produces V of s we are considering all

Possible actions and all possible states from the current state that robot is in and then we are taking the maximum value caused by taking a certain action and the equation produces a value footprint which is for just one possible action in fact we can think of it as the quality

Of the action so Q of s comma a is equal to RS comma a plus gamma of summation P and V now that we have got an equation to quantify the quality of a particular action we are going to make a little adjustment in the equation we can now

Say that V of s is the maximum of all the possible values of Q of s comma a right so let’s utilize this fact and replace V of s – as a function of Q so Q s comma a becomes R of s comma a plus

Gamma of summation P SAS – and maximum of the Q s – a – so the equation of V is now turned into an equation of Q which is the quality but why would we do that now this is done to ease our calculations because now we have only

One function Q which is also the core of the dynamic programming language we have only one function Q to calculate and R of s comma a is a quantified metric which produces reward of moving to a certain state now the qualities of the actions are called the Q values and from

Now on we will refer to the value footprints as the Q values an important piece of the puzzle is the temporal difference now temporal difference is the component that will help the robot calculate the Q values with respect to the changes in the environment over time so consider our robot is currently in

The mark state and it wants to move to the upper state one thing to note that here is that the robot already knows the q-value of making the action that is moving to the upper state and we know that the environment is stochastic in nature and the reward that the robot

Will get after moving to the upper state might be different from an earlier observation so how do we capture this change and the real difference we calculate the new Q s comma a with the same formula and subtract the previously known Q si from it so this will in turn

Give us the new Q a now the equation that we just tried gives the temporal difference in the Q values which further helps to capture the random changes in the environment which may impose now the name Q s comma a is updated as the

Following so Q T of s comma is equal to Q t minus 1 s comma a plus alpha T DT of a comma s now here alpha is the learning rate which controls how quickly the robot adapts to the random changes imposed by the environment the QT s

Comma is the current state Q value and the QT minus 1 s comma is the previously recorded Q value so if you replace the TD s comma a with its full-form equation we should get Q T of s comma is equal to Q t minus 1 of s comma T plus alpha into

R of s comma a plus gamma maximum of Q s – a – minus QT minus 1 s comma a now that we have all the little pieces of Q line together let’s move forward to its implementation part now this is the final equation of Q learning rate so

Let’s see how we can implement this and obtain the best path for any robot to take now to implement the algorithm we need to understand the warehouse location and how that can be mapped to different states so let’s start by recollecting the sample environment so

As you can see here we have L 1 l 2 L 3 – L line and as you can see here we have certain borders also so first of all let’s map each of the other locations in the warehouse two numbers are the states so that it will ease our calculations right

What I’m going to do is create a new Python 3 file in the Jupiter notebook and I’ll name it as q-learning Numpy okay so let’s define the states but before that what we need to do is import numpy because we’re gonna use numpy for this purpose and let’s initialize the parameters that is the gamma and alpha parameters so gamma is 0.75 which is the discount factor whereas alpha is 0.9 which is a learning

Rate now next what we’re going to do is define the states and map it to numbers so as I mentioned earlier L 1 is 0 and till in line we have defined the states in the numerical form now the next step is to define the actions which is as

Mentioned above represent the transition to the next state so as you can see here we have an array of actions from 0 to 8 now what we’re going to do is define the reward table so as you can see it’s the same matrix that we created just now that I showed

You just now now if you understood it correctly there isn’t any real barrier limitation as depicted in the image for example the transition l4 to l1 is allowed but the reward will be zero to discourage that path or in tough situation what we do is add a minus one

There so that it gets a negative reward when the above code snippet as you can see here we took each of the states and put ones in the respective state that are directly reachable from a certain state now if you refer to that reward table once again what we created the

Above added construction will be easy to understand but one thing to note here is that we did not consider the topper at a location l6 yet we would also need an inverse mapping from the states back to its original location and it will be cleaner when we reach to the utter

Depths of the algorithms so for that what we’re going to do is have the inverse map location state the location we will take the distinct state and location and convert it back now what we’ll do is we’ll not define a function get optimal which is the get optimal

Route which will have a start location and an end location don’t worry the cord is back but I’ll explain you each and every bit of the code now the get optimal route function will take two arguments the starting location in the warehouse and the end location in the warehouse receptively

And it will return the optimal route for reaching the end location from the starting location in the form of an ordered list containing the letters so we’ll start by defining the function by initializing the Q values to be all zeros so as you can see here we have

Given the Q values to be 0 but before that what we need to do is copy the reward matrix to a new one so this is the rewards new and next again what we’re going to do is get the ending state corresponding to the ending location and with this information

Automatically we’ll set the priority of the given ending state with the highest one that we are not defining it now but we’ll automatically set the priority of the Kuban ending state as 999 so what we’re going to do is initialize the Q values to be 0 and in the Q learning

Process what you can see here we are taking I in range 1000 and we’re going to pick up a state randomly so we’re gonna use the NP dot random Rand int and for traversing through the neighbor location in the same maze we’re gonna iterate through the new reward matrix

And get the actions which are greater than 0 and after that what we’re gonna do is pick an action randomly from the list of the playable actions in years to the next state we’re gonna compute the temporal difference which is TD which is the rewards plus gamma into the queue

Next state and we’ll take n P dot Arg max of Q of next state minus Q of the current state we gonna then update the Q values using the bellman equation as you can see here here the bellman equation and we’re going to update the Q values

And after that we’re going to initialize the optimal route with a starting location now here we do not know or the next location yet so initialize it with the value of the starting location which again is the random location now we do not know about the exact number of

Iterations needed to reach to the final location hence while o will be a good choice for the iteration so we’re going to fetch the starting state fetch the highest Q penetrating to the starting state we go to the index of the next state but we need the corresponding letter so we’re

Gonna use that state to location function we just mentioned there and after that we’re gonna update the starting location for the next iteration and finally we will return the root so let’s take the starting location of n 9 + n location of L 1 and see what part do

We actually get so as you can see here we get L 9 l 8 l 5 l2 and l1 and if you have a look at the image here we have if we start from n line to L 1

We got L 8 l 5 l 2 L 1 L 8 l 5 L 2 L 1 that would lead us the maximum value of the maximum reward for the robot so now we have come to the end of this Q learning session and I hope you got to

Know what exactly is Q learning with the analogy all the way starting from the number of rooms and I hope the example which I took the analogy which I took was good enough for you to understand Q learning understand the bellman equation how to make quick changes to the bellman

Equation and how to create the reward table the Q table and how to update the Q values using the bellman equation what does alpha do what does karma do so guys if you liked this video give it a thumbs up and if you want to see more videos

Like this and be technologically updated every time do subscribe to our channel there are a lot of videos regarding deep learning AI machine learning there walks all the different types of technology we even have blockchain with us so if you have any queries regarding this session

Please feel free to mention it in the comment section below till then thank you and happy learning I hope you have enjoyed listening to this video please be kind enough to like it and you can comment any of your doubts and queries and we will reply them at

The earliest do look out for more videos in our playlist and subscribe to any rekha channel to learn more happy learning

18 Comments

@edurekaIN

January 6, 2024 at 10:06 pm Reply

Got a question on the topic? Share it in the comment section below. Please drop a comment if you need the data-sets and codes discussed in this video. For Edureka Python Course curriculum, Visit our Website: http://bit.ly/2OpzQWw
@dipusultan7821

January 6, 2024 at 10:06 pm Reply

It would be better if you attached the code to the description, this is also for your future video!
@prakritreeeeeee

January 6, 2024 at 10:06 pm Reply

Thank you A LOTTTTT. Was really really helpful.
@mohamedchaieb4851

January 6, 2024 at 10:06 pm Reply

Thank you so much It really helped me a lot.Can i have code please
@nikunjraghav9680

January 6, 2024 at 10:06 pm Reply

How was the reward table constructed, please explain?
@madhumathi1521

January 6, 2024 at 10:06 pm Reply

Thank you so much…tom hvng my exam…u thought me very well🖤
@sujathaontheweb3740

January 6, 2024 at 10:06 pm Reply

Very easy to follow. Thank you!
@paochu92

January 6, 2024 at 10:06 pm Reply

Could you please share the code in this video? Thx.
@bikashmahato7991

January 6, 2024 at 10:06 pm Reply

Hi edureka
Please provide me with the source code of q learning

My mail id-mahatob93@gmail.com
Thanks for sharing.

Regards
@korgjuan

January 6, 2024 at 10:06 pm Reply

Hi, can you assist me with the source code please?, my email is jctpicon@gmail.com. Thanks
@perceivedtrading7085

January 6, 2024 at 10:06 pm Reply

Hi Sashin, could you please help to convert the code to Excel VBA or even using the functions in Excel, this is for educational purposes, thanking You,
@akshaybhosale1100

January 6, 2024 at 10:06 pm Reply

I run the code but still I am getting an error as "transition_to_state" is not defined. Please help me out
@aysekarahasan7091

January 6, 2024 at 10:06 pm Reply

thank you! it is a helpful video to understand Q-Learning Algorithm.
@orvilasarker4513

January 6, 2024 at 10:06 pm Reply

Best video on Q learning!
@omarhasan7622

January 6, 2024 at 10:06 pm Reply

make a live full course on RL coding. thank you
@rajeshjha1671

January 6, 2024 at 10:06 pm Reply

really good one……explained very well …
@manishrajput6449

January 6, 2024 at 10:06 pm Reply

Perfectly Explained, Thank you !
@sachinramsuran7372

January 6, 2024 at 10:06 pm Reply

Thank you. This is by far one of the most detailed, illustrative, clearly explained videos on Reinforcement Q Learning.