Python Tutorial | Python Course | Intellipaat
- June 7, 2024
- Posted by: MainInstructor
- Category: .Net Amazon Web Services Angular Artificial Intelligence Assembly BASIC C Data Science E-Commerce Go Java JavaScript MATLAB Node Python Rust VSCode Web Development
![*](https://i0.wp.com/allprowebdesigns.com/wp-content/uploads/2023/12/1703325760_maxresdefault.jpg?resize=840%2C430&ssl=1)
Video Title: Python Tutorial | Python Course | Intellipaat
[Music] hey guys welcome to this session by intellipad so this is a python full course tutorial so in this session we’ll be learning about python and as you know python is the most popular language among programming beginners so this tutorial will be a lot helpful for you also guys before moving on with this session please subscribe to our channel so that you don’t miss our upcoming videos now let’s take a quick glance at the agenda we’ll start off with the introduction to python and then we’ll move ahead to python variables and tokens once that is done we’ll be looking into python data types after that we’ll move on to the conditional and looping statements in python once we finish that we’ll move on to functions in python and also file handling in python once that is done the final thing would be python arrays after that we’ll look into the top interview questions which is asked in in an interview for a python developer or python roles and finally we’ll end it with a quiz also guys if you’re looking for an end-to-end training on python we do provide that and you can check those details in the description now let’s begin the session so what is python well ladies and gentlemen you need to understand that python is uh you know a very nice programming language to work with at the basic level and if you’re thinking okay so what is it used for well what is python exactly and all of these questions flooding your mind right now well let me answer that python is basically an object-oriented language where you know all the data is considered to be individual objects and worked with it so that is what object-oriented basically means and it’s a high-level programming language where you know we don’t use very complex syntax to work with it is easily understandable and provides you know very high readability as well so by high readability what do we mean well high readability make sure that even a non-programmer or a beginner can basically understand a lot of code when looking through a python code well it’s as simple as that and python is primarily used for web development application development and much much more we know how much you guys are excited for web development we know how much you guys are excited for understanding data science understanding artificial intelligence deep learning machine learning everything in u.s might already know that python is used for all of these applications right and that there’s a reason for that basically python is known to be the world’s number one programming language in fact java actually had this bragging right of being called the world’s number one programming language and then python came in and pretty much from then on it has remained on the top spot since then so the next question you might have is is python a new language is it an old language because you know if we do have very experienced developers among us who’ve been working with java all their life c plus plus c sharp or anything even wonder is python a new language well not exactly because python has been in existence since the 1980s you know it’s pretty much close to 40 years now python is almost a 40 year old language ladies and gentlemen and the most important thing you need to understand that it gained a lot of momentum it gained it gained huge popularity in the beginning of the 21st century and basically since then as i just mentioned in the previous slide it has been rising and rising and here’s the person who created python his name is eurova and rossum and he’s an amazing developer you can check out more about him after the session of course and then coming to some quick python facts that you might know or not know well guys do let me know if you know any of these let’s consider firefox you know firefox has over 230 000 lines of code which basically calls to the working of firefox and it’s all written in python how amazing is that 230 000 lines of code of python for firefox ladies and gentlemen and then coming to microsoft microsoft of course uh you know they have their own ide they call it as the visual studio code and this is basically used to promote uh worldwide python development as well and you can get uh you know pretty much free versions of this and python is open source if you’ve been wondering so python is again python has been used by the community developed by the community and implemented very well by the community as well and then coming to netflix did you know that netflix actually makes a lot of use of python especially in the field of data science well everything from automatic image recognition it might be a recommendation for all the tv shows in the movies we can watch on netflix and much much more but do understand that netflix uses a lot of python and then coming to uber uber i’m pretty sure everyone knows about it and they actually make use of jupiter notebooks and they make use of ipython to basically share and store all the data with respect to the driver client interaction and all the ratings and all of that and much much more as you can see on your screen some beautiful usage of python right here well on that note let’s come to the topic of why one should learn python well the first point that i would actually tell you guys of why you should consider learning python is how simple it is and how beautiful the readability is so what do i mean well here’s a simple java code which basically prints the statement hello world you know we’ve all been in our first stages of programming where the first piece of code we might have printed very well might be the hello world program right so here is how you would do it in java okay so how do we do it in python well look at how elegant how beautiful and how simple it looks ladies and gentlemen and i’m pretty sure if you had to write code and if you had to start out with if you’re a beginner you would prefer python for this you know there are no semicolons it is very much readable and even even if you do not know a programming language you can understand that uh print hello world pretty much you know probably equivalence to putting putting out an output which gives out hello world you know you can at least take a guess and understand but then as soon as you look at the java account you might be taken aback you may be like whoa what’s happening right and this is one of the most beautiful reasons why i personally prefer python and why we have thousands and thousands of developers who actually prefer python as well coming to the next point i believe it’s convenience because again of course this point applies to multiple programming language out there but the specialty of python is that you know since it’s basically platform independent you can run it on windows you can run it on a mac os platform you know basically apple’s operating system and you can use any of your favorite operating systems you know in fact if you want to run python on a playstation well you can do that as well ladies and gentlemen yes on your playstation windows mac os playstation and many many other operating systems for that note as well so this you know makes it very convenient for the developers for the people who prefer windows and of course for the population who prefer mac os to everything else so understanding this point is very vital at this stage and next is pretty much cross language operations what do i mean by cross-language operations coming to cross language dynamics you need to understand that python can talk and work well with all of the languages that already exist see for example there are languages uh you know from c java c plus plus we have javascript your language such as rust we have dotnet and we have all of these languages platforms firmware libraries whatever you want to call it and basically python brings all of this together and if you are an expert in javascript and you know you’re trying to implement javascript using python yes you can do that can you run some java code using python of course and can you go on to use.net alongside python of course well as you can see where this trend is heading well it is the number one language of choice for a reason and i believe this is one of those reasons so what do you guys think head to the comment section and let me know so on that note let us quickly walk you through the procedure of how you can install python on windows case it’s a very simple and a straightforward process now i’ve broken it down to a couple of steps and let’s go with it together right so step one says go to python.org and head to the downloads page and let’s do that side by side guys so let me head to the python websites it’s python.org for the people who are wondering uh i’ll just go to download here you can direct directly click on the link there as well since we’re looking for windows python pretty much detects that as for windows so we already are at the page where we hit the download so this pretty much forms our second step as well so let me quickly uh you know save it on the desktop it’s about 25 mb if i think yeah it’s about 25.2 megabytes this will take a couple of seconds to uh download and uh in the meantime let’s quickly check out what are the next steps in our powerpoint slide guys so uh step two is again click on the download which we just did and guys as of this video being made python 3.8.0 is the latest a stable version out there so that brings us to step three guys step three will pretty much be opening the installer and we should be presented with the screen so let’s quickly wait for the download to complete and then we’re going to open up the installation box hey guys uh that took about a minute to install and we’re presented with this screen which is the installation screen just before you hit uh install now you can pretty much choose to customize the installation and uh you know install it at a place of your desire uh make sure you just click this uh button which says add python 3.8 to path well what it basically does is it sets your environment variables and it sets the path variable to python so as soon as you hit python in your command line you can just start using python directly guys so that’s what it does and it would be recommended that you do this even if you would not do this there would be no problem there are other ways to launch python and to use it but if you are a beginner if you just want to hit python and then start up with coding you can pretty much hit the python to path guys and you just hit uh install now and this will ask me for my admin password because we have to install it as an administrator so let me quickly type in my administrator password and we can work with it guys and i just entered my password so pretty much uh it should take about uh i say around two or three minutes to get the entire setup process done ops in the meantime let’s check out what’s step four again step four is check the last option to add it to the path and install and step five uh is the important step guys so we need to verify that our installation has been successful so for that to occur we need to make sure our setup completes so let’s just wait a minute and then we can go ahead with the setup guys so guys now the uh setup is successful let me quickly open up command prompt to verify that it has been installed correctly guys so if i go ahead and type in python there then pretty much python console is ready for us uh so let’s type in a sample code hello world and as soon as i go ahead and run this we have the output guys so pretty much well as of this moment python 3.8.0 is the official stable release of python and that’s what we’ve been using guys so we can check out another uh code snippet as well guys so let me do a equal to one be equal to two c equal to three so much more right up so let us uh create a new variable d and then probably add a and uh b to it so let me just quickly do this we can print d well d is three let’s uh multiply in what let’s let’s create another variable call it e multiply uh d and c let us print e guys i’m just doing this to just show you that everything is working fine so three multiplied with three is three and so much more to just get out of the console you will have to type quit uh followed by parentheses and you’re out of there guys so this is as simple as installing python on your windows guys and we just verified it using a simple piece of code as well guys but python has another thing called as idle guys so idle is this uh pretty much interactive python console that you can go about using and the shell is actually ready so we can go about uh typing again hello learners and done so the python 3.8.0 works perfectly on the idle shell as well so this is another way or to just verify the installation guys so coming back to the presentation step seven is again an optional step but you will get there uh very soon in case even if you’re a beginner or an advanced learner guys so install the uh libraries that you pretty much want there are many libraries that python supports and the most popular ones are tensorflow numpy skype sky kit learn pandas matplotlib keras pytorch light gbm and so much more well these are machine learning deep learning oriented libraries that are present on your screen but then the next step of installing python is to pretty much install all of the packages and libraries that you want to go or work with further on guys now we have python variables so this is the first thing that we are going to learn about when it comes to python so what are variables let’s take a look at a variable is simply a container it contains some value in memory a variable points to a memory location where the value that you want to store is stored so in python the data types will be identified according to the data we provide and i’ll show you what that means in a moment a variable should start with a letter or an underscore and cannot start with numbers so there are two ways of assigning a variable you can assign a single value and you can assign a multiple value so before we begin let me just show you what you can do and how you can execute some python now since i’m showing you some really common tasks i’ll be doing it in the shell now shell is not the way to write code when you’re writing code for large applications but when you’re trying out small things and you’re trying to understand how these things work what it will be the output of one command or the other command shell is a great environment to work in so i’ll show you how i do it so i go to the start button i type python and it opens python 3.7 so i open this and it’s here so now let me just show you what i mean so the first bit of code that i would like you to write is you should let me just uh make a change to this so that you can see it properly so the font size i’m going to increase it to 28 so that you could see it properly and click ok hopefully it’s visible so what i’m going to do now is now i can type any code that i want to write so i will type p-r-i-n-t print print is a function that will take in something that i want to print on the console the function always gets called with these parentheses opening and closing and two ports inside it we’re going to write hello world hopefully you’re writing it with me and once you’re done with this press enter and as you can see it got printed on the screen now why did we have to put quotations here why did we have to put the curly parenthesis there we’ll discuss all of that in a future video but note that you’ve written your first line of python code and it was really easy you only needed to write one line of code and it did something so many languages required to line like 15 or 16 lines of code for it to just print something on the console but here we were able to do it with just one line this is one of the major advantages of using python now we come to data types so as we have already discussed in python variables the data type will be identified according to the data that we provide okay guys a quick info if you’re looking for an end-to-end python certification training we at intellipaat provide that and you can check those details in the description now without any delays let’s continue with this session so what exactly is a data type well data type is basically something that allows us to understand what kind of data are we storing in a variable let’s say that i want to create a variable named name and i want to store some name let’s say the name is john now i want to check what data type it is to press type name so it’s a class of type string so it’s a string a string is nothing but just some characters strung along so it could be a sentence paragraph basically any text that you need to store you store it in a string if you want to store it one continuous long line of text you do it using string now that is done there are many other kinds of data types as well let’s say that you want to store a whole number a number which has no fractional point you use integer for that to give an example let’s say num is equals to 1 as you can see one has no fractional part width it’s a whole number and i press enter and if i look at the type of num i get integer so integers store numbers with no fractional part on the other hand if i want to store some fractional part let’s say 1.5 and i check the type of num now it will be changed from integer to float float is something that stores fractional parts so if you wanted to store the value of pi you would store it in the type float now you don’t have to define what kind of type you will be using because python takes care of it for you this is what is known as duct type duct typing basically means that python takes a look at the class and performs the functions accordingly so if you take a look at the value that you are pointing to and it will take a look at it and understand that okay it’s a number which is floating point so we can just use it uh so we can just assign a float class to it now the main question is why do we need to use different kinds of data types well there are many reasons the first and the form and the most important reason would be that types are important for us to distinguish between different kinds of data so we can for instance add two numbers so one plus one will give us two if the computer does not know what kind of data it is and it tries to add it it will have no idea how to perform it so let’s say that i gave it instead of one plus one i gave it b plus c so what do you think will happen in this scenario if you want to take a guess pause the video for a moment think what’s going to happen and then resume it so what’s going to happen is it’s going to concatenate this tree as you can see it just took b and added the next string to it i could even make it a little bigger b let me just do it this way bsda and css and as you can see that it added it to the uh it added those two three listings together now if this was not a not string and if i were to use a number again what do you think will happen we get an error so this is where we use it because adding a string and a number let’s say that you were giving an exam and they say add a plus 15 that would make no sense and the computer will have no idea how to process it it won’t have an any understanding of whether it should add two numbers whether it should convert it to string whether it should do something else so for computer to be able to comprehend the kind of data that you’re giving it data types are used and there are many other reasons behind it as well another reason being determining the size of the variable so if you have a string you the size of the string is determined by the number of characters it holds if i have a string called abcd it will have four bytes it will be occupying four bytes on the other hand if i have a number it will be occupying two bytes and four bytes depending on the kind of com interpreter you have in the language of the environment you have so this also allows for computer to understand how much memory to allocate for a specific variable this is why we use variables like this so hopefully it’s clear to you so let’s move on so now we’ll take a look at how to assign values to a variable i’ve already given you a little test of how that happens but let’s take a look at this so we can assign a single value or we could assign multiple values so let’s take a look at how that works let’s say that i want to assign as we’ve already i’ve already showed you how i do that but let’s just do it again let’s say that i want the age i want to store someone’s age let’s say they are of age 25 i present this is assigning a single value to a variable so i’m assigning 25 to each if i were to type name comma h n equals 25 this is going to show me an error because it does not know how to unpack this value this is just one value and i’m expecting two values now if i were to move ahead instead of doing this as we’ve already seen the naming conventions are also there so you can’t name a variable with that begins with a number so if i wanted to name something underscore name that would work just fine as you can see this is fine but if i were to start with anything other than underscore that’s a dollar name or if i even if i were to start with the name with a an integer number this is also going to cause an error so anything other than a character which is anything from a to z whether it’s small or big and anything that is not underscore is going to cause an error so always make sure that you’re assigning the variable names correctly another thing that you can do is naming a variable the variable cannot start with the number but it can end with the number so if i were to say name one this will work just fine and if i were to say name dollar this is going to cause an error because uh dollar already holds a value in python language so it’s a token which we’ll discuss in a later video so make sure that you are naming it correctly and always make sure that you’re naming your variables in a way that helps you understand what exactly is it that you’re trying to write you can start it with two underscores as well that will work is fine no matter how many underscores you are passing it it’s going to work just fine so these are the variables that we have declared so far and we have assigned a single value to it now let’s take a look at how to assign multiple values so assigning multiple values is also quite simple as you can see here if you want to assign one value to a multiple variables we can do what being done in the first line which is a equals to b equals to c equals to 10 that means set all the variables on the left hand side which is a b and c to 10. on the other hand if you want to do it you could do it otherwise as well so we could assign multiple values using the comma operator so i’ll show you how that works in a moment let’s just see how that works if i were to go a call a equals b equals c equals 15 and if i were to print a it’s 15 b is 15 c is 15. now if i want to change it i want to assign 10 15 and 20 to all of them instead of doing the equal sign what i do is i use commas and i tell it okay a comma b comma c now i want 10 assigned to a comma 15 assigned to b and 20 assigned to c i press enter a b c so i have to make sure that i’m using the correct ordering order matters here if i were to use a b and c uh if i were to use b c and a then b will be assigned 10 c will be assigned 15 and a will be assigned 20. so make sure that the order matches with the values that you’re providing it now let’s take a look at python tokens all right so now let’s take a look at operators so operators are just special symbols that are used to carry out arithmetic and logical operations in python there are several kinds of operators so we have all of them listed down here so the first kind is arithmetic then there’s assignment operator thus comparison operator logical operator bitwise operator identity operators and membership operators and we’ll take a look at them one by one so let’s move on and see what are arithmetic operators well as the name suggests we use arithmetic operators to perform arithmetic operations or mathematical operations things like addition subtraction multiplication division modulus and exponentiation so addition basically adds two numbers so we can add six plus five in which plus is the operator and six and five are the operands similarly we can subtract we can multiply two numbers we can divide two numbers modulus operators is used to give us the uh remainder so if i were to go 7 modulo 2 it will give me 1 because when we divide 2 7 by 2 we get a remainder of 1 and similarly exponentiation is used to raise a number to a specific part so for instance if i were to uh type 7 then 2 times the multiplication symbol which is the asterisk and then i type 2 that means that we are multiplying 7 by itself two times so that’s what these operators are for these are very easy to understand these are just basic mathematical operators and they are more very commonly used in python then comes assignment operators so assignment operators as the name suggests assigns a value to variable so let’s say i have a variable x and i want to assign the value 10 to it i want it to hold the value 10 so i could just type x is equals to 10 which is what is shown here now there are some extensions to the simple assignment operator which is the equals operator you can add a plus right in front of it and what it will do is it will add more it will add the number that is on the left hand side of the operator to the right hand side of the operator to give an example as you can see we have plus equals just the operator and we type x equals to x plus 2 let’s suppose that before we execute this this was already executed the x is equals to 10 uh statement was executed and x contained the value 10 after i run this code what will happen it is is we’ll take a look at the number that x holds which is 10 and it will add 2 to it so it will become 12. now it’s following the same logic we have minus equals so the minus equals operator is going to do the same thing but instead of adding 2 it is going to remove the number that’s on the right hand side so x equals x minus 29 will subtract 29 from the already existing number in x and assign that new value to x similarly there’s multiply equals that’s divide equals and there’s or equals so this is basically a binary operator or and bitwise operator and we’ll take a look at that in a moment and then there’s comparison operators so comparison operators as the name suggests are used to compare two values let’s say that i have two values that have been entered by the user and i want to check there i want to compare the values in them so this is where this comes into play now let’s say that you have a login form and the user enters the registration name the registration number then they enter their password and just to make sure that they have entered the correct password you have a confirmed password field as well so when you’re validating the form you need to check whether or not password in the password field and the confirm password field are equal to each other so this is where the double equals come into play now the reason why we have two equals and not one is because one equals is used as assignment operator the operand on the left hand side gets assigned the value on the right hand side here we are comparing the two values so operands on both the sides will be compared to each other then just the opposite of it would be the not equal operator the not equal operator is used to check whether or not uh an operator or an operand is not equal to something so if i were to go and check whether 1 is not equal to 2 since 1 is not equal to 2 it will return true similarly we have less than we have greater than we have less than equal to and we have greater than equal to and their names are quite easy to understand as well they check whether or not one value is less than the other value or not or whether it’s greater than or not then there’s logical operators so logical operators are used to combine conditional statements so if i want to check whether or not some condition is true and the other condition is true as well then i use and it will return true if both the statements are true similarly there is an or statement or statements are used to check whether or not something is true or not let me just show you one example so if i were to use python i’m going to show you some python code here so let’s say x is equal to 15 and y is equal to 16. now if i want to check whether x is equal to equal to 15 and y is not equal to 16 so what i can do is i can just check s is equal to equal to 15 it returns true y is not equal to 16 and it returns false now if i want to check if both of these conditions are true so then i can just type and y is not equal to 16 and i present and it returns false because one of those conditions is false so that is the problem because y is equal to 60. now if i this then it will return true so i’m checking whether x is equal to 15 and y is equals to 16. if i were to change the value of x to let’s say 14 and check this condition again i’ll return false now the or operator is actually a little different if i were to use instead of and i use or what will happen is it will check whether x is equals to 15 which is clearly false here and it will check y is equal to 16 if either one of those statements evaluate to true it will return true so since y is equal to 16 we are getting true but if i were to check if y is not equal to 16. since both the statements are false it’ll written false so basically what we’re taking a look at is if we want to check whether all the statements are true then we use the and operator if you want to check just one statement is true out of all the statements that we have provided then we use the or operator and the not operator is used to simply revert the statement so if i were to check 10 is not 10. then it’s not 10 it’s going to return first because 10 is 10 but if i were to check 10 is not 11 then it returns true so i’m just checking for the uh opposite operator and that’s going to work okay guys a quick info if you’re looking for an end-to-end python certification training we at intellipad provide that and you can check those details in the description now without any delays let’s continue with this session now let’s move ahead and then comes bitwise operator so bitwise operators are a bit difficult to understand so let me just explain using notepad so underneath the hood inside any computer programming language numbers are represented or any value is represented using binary so binary is basically just some combination of zeros and one so when you hear something like integers can hold 32 bits basically what it means that it has 32 0 32 positions that can be filled with either 0 or 1 and any value that we see on the screen is due to some combination of 0 and 1 on in those 32 positions to give you an example let’s say that i have a four bit number four bit number means four binary digits number found by four binary digits so i will have this will be zero since all the four numbers are zero now what we’re doing is i want to check uh if in case you don’t know anything about binary numbers i can give you a really quick refresher here uh each position is represented by one two four and also so there so four eight four two so what happens is if each position let’s say so these are the positions that we have we have 4 3 2 1 so if any of these numbers is turned to be 1 then that number is raised to the power of two let me give an example if let’s say this was one and this was one if this was the binary number that we had which was zero one zero one so the way we do this is we take a look at it and we go 2 raised to the power of 4 and you multiplied by the binary digit that’s there which is 0 and we do the same thing for all of so if you have if i was to do the same thing to all of them let’s just do this now we will go 2 raised to the power 3 and the number at the third digit position is 1 and 2 raised to the power of 2 and the number should be 0 and 2 raised to the power 1 uh 2 raised to the power 0 this would be that this would be zeroth position in first position so all we have to do is just do one three two one and zero that will work so this should be three this should be 2 should be 1 it should be 0 and we multiply it by 1. so as you can see the number at the 0th digit is 1 so we multiply it here by 1. then at the first position is zero second position it’s one in the third position zero so what it what ends up happening is we end up creating the sequence in which we go two to the power of three is eight multiplied by 0 plus 2 to the power of 2 which is 4 multiplied by 1 is 1 then we go 2 to the power of 1 which is 2 multiplied by 0 and we got 2 to the power of 0 which is 1 multiplied by 1. and this ends up giving us 0 plus 4 plus 0 plus 1 and we get 5. so this might seem a bit complicated but that’s how the numbers are calculated underneath the hood you can take a look at some binary refresher that will take you look at what it means now because we only have four places to put zeros and one since we have four digits we could go from a number that has all zeros to all one and everything in between this could end up in numbers ranging from 0 to 16 so if we have four binary digits then we can make numbers we can represent numbers in binary format ranging from 0 to 16. anything above that we’ve run out of space we can’t do anything about that 0 to 50. so that’s how we will do it and anything in between any configuration like 1 0 0 0 can also represent some number between 0 and 16. now we come to bitwise operators so what they do is they instead of operating on the numbers that we see they operate on their binary forms so let me give an example we have an and operator so what and does is if i were to give you 0 and 0 it will return 0 similarly 0 and 1 will return 0 1 and 0 will return 0 and 1 and 1 will return one so what it does is very much like the and operator that we had taken a look at previously in the logical operators section it takes a look at all the operands in this case we have two operands because it’s a binary operator and we take a look at the left of operand and the right operand and if both of them are one then we get one if either one of them or both of them are zero then we get zero so this is what it looks like similarly for r what we end up doing is just as we did in the logical operator just one of them need to be true so zero could be considered false and one could be considered true and then we’ll get the value so since both of them are false we could both of them are zero we get zero one we get one for you the second and the third one and all of them we get one so as you can see this is what the operators do the binary and binary or operators xor is a bit special so xor is used for exclusive or that means both of them should not be same either one of them has to be one but if both of them are one or both of them are zero then the value that is returned should be zero these are exclusively used in the building of logical gates and all that but they are also used very much in programming and there are some very nifty tricks that you can do with binary operators that could allow you to create more efficient programs so do take a look at them so these are all the operators that we have studied so far let’s take a look at left shift and right shift so what left shift does is let’s say that i have a number like this and i want to shift it by two so this number represents one now what i do is i want to shift i go left shift two this basically means that whatever number we have shift it to the left by one so what happens is it will shift all of these zeros by one and pad the left part with zero basically move it to the left hand side by the positions that i have so since i have given two it will do it twice so as you can see first one was at the zeroth position then it was shifted two times so now it’s one and two in the second position and from one we have come to four right so that’s that’s how this works so this is what the operator left sheet operator does it takes the binary format and shifts it by two now on the other hand we also have right shaped operators and it does exactly the same thing but it does it in reverse so if i were to give it this and just change the left shift to operator to the right shift operator it will give me back the original result it will give me back zero zero zero one because it will shift these two to the left there are nothing there so it’s removed and add two zeros at the begin and we get this we remove this and we get this so this is what the binary operators do binary operators work on the binary format they shift it to the left they shift it to the right and at the very end let me show you what the not operator does because this is very uh something that’s not shown very accurately not operators turn the number from uh turn all the bits from one to zero and zero to one so if i were to give it zero zero zero one and if i were to use the not operator right in front of it what will happen is it will take a look at all the binary digits and it will flip them so if it’s 0 it’s converted to 1 again it’s 0 convert to 1 0 convert to 1 and 1 then convert it to 0. so essentially we’re just converting the numbers now when you do use it make sure that you understand how the signs work so in binary format all the numbers that are stored in the negative numbers there is a few nifty things that go on in the background so make sure that you understand how the signs work and when you shift the numbers and their binary representation will flip so all the numbers that are 0 will convert to 1 and all the numbers that are 1 will be converted to 0. so that’s how the binary operators work they’re very easy to understand once you understand the basics of how binary works underneath the hood and you can do some pretty powerful and pretty efficient things underneath the hood when you understand the binary digits work so let’s move on and then we have identity operator so what identity operator does is it’s used to check the whether the object that we have is the same or not let me give an example so if i were to open python uh and i want to use the id method and i want to pass it a number one you see this long numbers long number this is what’s called the id of the object since python is an object oriented language numbers are represented as a object of integer class so if i were to type try to figure out what the type is of one i’ll get class of integer so if i would and similarly if i were to check type of two i get this and if i were to get the id of 2 it’s this as you can see these two numbers the id of 2 and id of 1 are not similar similarly when you create your own classes you create your own objects they have their own ids as well this is used to differentiate between two objects that have almost the same structure but still there are two separate objects so if i have two different kinds of uh strings or two different kinds of numbers let’s say i have one 11 and i check the id of this it’s this and this is this so as you can see both of them have 11 and they though they don’t have the same id because strings are immutable so as you can see these have the same numbers one eight one eight seven five two and same so these are used uh to check whether the id of two objects is same or not so if i were to check with the is and is not operator if i were to check one is one since both of them have the same id illusion true but if i were to check one is one it will return false so this is this is what happens with this if i were to check one is not string one then it’s true so that’s how it works and this could also be applied to objects when you’re learning object oriented programming you can create your own objects and then you can take a look at that and then comes the membership operator now this is also very quite uh useful and i think python has it in many other languages don’t so it’s an operator that allows us to check whether or not a number is in a sequence or an object so to give you an example uh let’s say that i have a list of numbers so i have one two three four okay i have one two three and four and then numbers so i have four numbers in a list now if i want to check whether a you number let’s say entered by the user is present in this list or not then i can simply instead of just going through each running a loop and then going to looking at each number going is one equal to the number that the user entered is 2 equal to is 3 equal to 4 equal to that number i could just make python do it easily by typing four in nums let’s say the user entered the number four and you and it returns true because four is present if i were to check eight in nums it will return false because eight is not present in the stream but if i were to add it and then take a look at nums and now if i check if it is in the nums it is so that’s what the membership operator does if you want to check whether something is not present in nums i can just do this and neutral and false and if i were to go and remove dot pop so 8 is removed if we take a look at nums now and if i check whether 8 is not in nums it is written true so that’s how that works so membership operators are judging you to check whether or not something is present inside a container or inside a group of sequence that we have and with that we have come to the end of operator section and we’ll move on to data types in python so python tokens in python every logical line of code is broken down into components these components are called tokens so there are many test tokens the normal types of tokens that we discuss are keywords identifiers literals and operators so everything that you write in python is a token of one kind or the other let’s take a look at them one by one so keywords what are keywords well in python keywords are special reserved words they hold special meaning so these words are something that you cannot use to define a variable so they convey a special meaning to the compiler or the interpreter when the compiler or the interpreter takes a look at this word they’ll understand that it has to perform certain action it has to allocate memory it has to create something on the call stack it has to do something in the background instead of just treating it as a normal variable now each keyword has a special meaning and a specific operation so we’ll take a look at them one by one we’ll come across many keywords we’ll take a look at them what they mean one by one we should never use keyword as a variable that because that’s going to cause a lot of problems when we run into a later related code and we’ll understand what that means in a moment as well so these are some of the keywords that python uses so as and continue break except dell len from import pass lambda there are many many others and these all hold special meaning and as you get deeper and deeper understanding of what python has to offer and how you can use python to create your own applications you’ll understand where these keywords are used the main point that you should understand is you should never use them as a as a variable name so again let me show you an example let’s say i open the python console now if i want to type ret is equal to 7 this works just fine but if i want to type return return that’s a special keyword and we’ll take a look at what that means in a moment but if i were to assign it to 7 i will get an invalid syntax error mainly because return is already a keyword if i were to type return one again that would work just fine but make sure if you try to use some keyword for this this is going to cause the problem now now that that is done let’s take a look at something else so these are the keywords but uh there are other two consistent called identifiers so an identifier is mainly a name that is given to something in python that we can give that allows us to identify it later on so these could be names given to variable names given to a function a class or an object in case you are not familiar with function or class or object you are familiar with variables because this is what we just discussed but functions classes objects everything else will be discussed later on so don’t worry about if you don’t understand it as of now you’ll get to understand it throughout the runtime of this presentation so there are certain rules that we need to follow when we’re naming an identifier the first thing is no special character except underscore can be used as an identifier so we can’t use a special character to be an identified a variable name or anything similarly keywords are not something that you can use in place of an identifier so as we already saw we can’t use return we can’t use four there are many other keywords that you cannot use as variable names as something that you can use as an identifier then python is case sensitive so that a variable which is named v a r with v capitalized and a variable named v a r with every characters in small less lower case these are two different identifiers so if i were to type v a r with v capitalized equals to seven and type small v small n small r equals to seven in five equals to eight and if i were to print the values both they will get different values so these are two different things in uh when python interpreter takes a look at it it will assign two different buckets for them then the the first character of an identifier can be an alphabet or an underscore but not a digit so as we have already seen that we can use any alphabet as the starting character of our identifier variable name on the other hand we can use underscore as well but we cannot use a number a digit is not something that we can use so that’s you can use it afterwards it could be the second or the last character or anything in between but can’t be the first one so let’s take a look at them if i were to open the python console we have several ways of doing it in python uh a variable is also an identifier so let me just show you the casing pro so variable 7 if i were to name it with a capitalized v it’s a 7 and if i were to name all of them capitalize with 7. so let me just change it now it’s 7 it’s 71 and it’s 72. so if i were to print var with everything in lower case it’s seven if i were to print it with v capitalized it’s 71. if i were to print it all capitalized it’s 72 so all of them hold different values and the last thing is we cannot start it with a number so let’s say 1 var equals to 7 we’ll get a syntaxes but if i were to do it v 1 a r this is just fine it will work v one a r and it works just fine now we come to literals so literals are just raw data that is given to a variable so literals can be of many types string literals numerical literals boolean literals special literals and we’ll take a look at them one by one so as you can see we can have several kinds of literal this is some code that’s been written to you written for you and we’ll take a look at string literals first so found by enclosing a text with quotes both single quotes or double quotes can be used so this is some code that you can take a look at and you can type it in your own console you can type it in any other kind of environment in which you can run python code i will expect you to run it in console because that’s where you get the most immediate feedback but it’s your choice to run it in any other environment if you’re comfortable with it so let me just show you what we can do here so i open this and as i’ve already discussed we need to enclose our text within codes to tell the python interpreter that it’s a string let’s try it so if i were to give path is a variable and i want to assign a string literal to it i can type a sample path press enter and it works this is simple if i were to type single quotes this wouldn’t make much difference it’s going to be just the same it’s going to work just fine now the reason why we have single and double quotes for this it’s mainly because in c and c plus plus and j and many other languages like java single quotes were used for handling a single character not a string but a single character so a single character would be something like a that’s a single character a string with lengths of ones is something that’s called a single character now the way it worked in java and c and c plus plus was that if you were to attach multiple characters one after the other so let’s say a b c d and then you were to do it that way then that would be a string and for that you can’t use single quotes you have to use double quotes i’m simplifying things a lot because there’s a lot undergoing like the null pointer and all but you don’t need to worry about that just need to understand why there’s two codes and why there’s one quote and that’s the thing i’m trying to get across so if a person shifts from a switch of a person’s shift from c plus or java to python they will be expected to use or they will think they’ll be using a single quotation mark for a single character and that would work and for that to work they have created uh for they have allowed for us to use both single quotations and double quotations there’s also one thing that you could do you could use three single quotes as well so if i were to write a really really really long line of code instead or a really really long line of string i could just start it with three quotes i press enter as you can see the prompt instead of it being three arrows pointing left it’s now become three dots now you can type anything let’s say that i’m typing this i press enter it’s still allowing me to type i type this let’s say that you’re typing a paragraph or a note or some or an email or something like this i can do it this way now let’s say that this is everything that i wanted to type including the enter character now if i want to end it again i type three single quotations press enter i get back the prompt with three uh arrows pointing left i type in the word line press enter and as you can see i get everything this backslash n characters is something that is used to determine a new line so if i were to print it using the print stamping function that we have already discussed press line as you can see it presses it starts here enters or creates a new line prints everything wherever the backslash n character comes in then at the end it again uh starts a new line and then ends it there so the backslash n character is a special character that tells you that there is a new line here so you don’t have to worry about that as of now but just understand that that’s how we do it it’s not just strings though there are many other different kinds of literals as well there are numeric literals as well and numeric literals allow us to use use variables that can store numbers now we can have positive and negative whole numbers with no fractional part as integers as we have already discussed we could have real numbers real numbers are numbers with fractional part using the float variable and we could have long log is also something that could be used to store uh values but the important thing is that it could store a much larger value than integer so as you can see in the description it says an unlimited string of integers followed by uppercase l or lowercase f and i’ll show you what that means in a moment and then it’s also something that you could do you could store some complex numbers so numbers with real and imaginary parts now if you have worked with uh if you worked with mathematics if you worked with the real and complex numbers that you understand what i’m talking about if you not work with it that’s totally fine it’s not something that’s used very often it’s only used in special cases where you need to display complex numbers so in our case let’s just take a look at this we have a positive and negative numbers so we can store that in integer as well let’s say that i want to store negative 77 i can store it in number or positive 77 i can store it in number and the type of number will still be a class of integer right but if i were to attach capitalized l here that would cause an error oh because it’s new version so nevermind so if that was the case then i could do that but it’s not yeah so as you can see it’s working fine now if i were to do a lot of other things like floating point numbers 1.56 and if i were to get the type of it and press enter this would give me a class of float so float as we have already discussed is something that is used to store floating point numbers or fractional numbers or real numbers however you want to call it so that’s how we stored that number now there are many differences between how these are stored in in underneath the hood they are stored using binary binary format and we don’t need to get into that but the thing that we need to understand is that when we’re using a binary format underneath the hood python takes a look at the data type and then decides how it wants to set the binary format of that number so that’s why we do it that way now there are many other ways you could do it and may you could use many other many other tricks as well to store other kinds of numbers convert float to integer can divide two numbers and make it convert to float or anything else that’s also doable but it’s it’s mostly that you’re supposed to do it uh easily that’s there make sure that you’re in the easiest option make sure make sure you’re choosing an option that works for you in person the value of an integer is not restricted by the number of bits so it can expand to the limit of available memory and no special arrangement is required for storing large numbers now that is mainly because in python you don’t need to uh as i’ve already told you it’s a high level language it’s not a low level language so you don’t have to worry about changing the types or the uh available bit size of a number when the number becomes large python will do it for you automatically it will take a look at the number it will take a look at whether or not it’s exceeding the current memory bounce and it’ll assign a larger memory bound for it so that you could add more numbers to it unless and until it reaches to a point where it’s uh overriding the available memory that then there’s no memory to store the number and cause an error and no special arrangement is required for storing large numbers python again does it for you so you can just follow on uh working on your application instead of working with the nitty gritty details of how things work underneath the hood this is why i told you you don’t need to worry about the binary arrangement underneath the hood python will do it for you then comes boolean so booleans uh evaluate to true or false true and false are basically a human construct so let me just show you what that mean if i want to check something let’s say that there’s a user and i’m building an application and i ask him to enter a password he says the password is abcd he’s creating his account he says abcd then it comes to confirm password and he writes abcd again now i want to check whether or not password and confirm password contain the same number so i press password equals to equals to confirm password i press enter and i get a boolean value of true so this is what allows me to understand how to move forward if it returns true then i can basically move ahead and tell the user okay everything worked perfectly i’m going to store it in the database and do whatever i want with it on the other hand if it had returned false let’s say the confirm password was abcd1 the user mistyped it and now i check whether or not they are equal i get false so this is what boolean values are you can assign them yourself so let’s say b o l equals to true make sure that you understand that the first letter is capitalized so a smaller letter t r u e will not work since python is case sensitive so if i were to type that’s true and i can also assign false to it press enter and this works as well so this is how that works now that boolean is uh understood let’s move ahead and see what other things that we have for us so we have special literals these special literals are something that we can use for ourselves so these special literals in python are called none which means that the variable is yet to be initialized so if you’re coming from other languages there’s concept called null n-u-l-l so that’s basically what python uses but instead of null we use none so let’s give let me give an example again let’s say that the user is entering something the confirm password and the confirm password field has been left blank so instead of writing something inside it he chose to wrote nothing and they submitted i will get a none value here that means absence of a value that the user has not provided a value so that could that basically means that if i print confirm password now i will get nothing because nothing is there it contains no value it’s empty think of it as a box that contains nothing as of now and when the user does enter something that box will contain that value so that’s what none is used for it’s used to define absence of value not an empty string not any other kind of string not zero not anything else but in absence of value and the type of none is none type in case you were wondering and now we have operators so let’s take a look at some data types in python so data types and pythons could be divided into two parts immutable and mutable so immutable data types are data types that cannot change on the other hand mutable are data trends the data types that could have their values changed so for instance string numbers and tuples will have some value assigned to them and then when we want to change them instead of adding something to it it will create a new data type for that it will create a new value for that variable to hold on the other hand in mutable we have lists dictionaries sets and then lists we have several types of generic list of tuples and all we’ll take a look at them one by one so let’s take a look at them so we have numbers so a number is basically as we’ve already discussed data types such as integer float and complex data types they come in numbers so as you can see a whole number with no decimal point is integer a fractional number a number that has one or more decimal one or more numbers after the decimal point uh is floating point number and a complex number is a number with real and imaginary parts so again complex numbers are not that prevalent mostly we use integers and floats complex number have some use case but not a lot then comes strings so a string is basically a lot of characters that are enclosed within either a single or double quotes and they are used to store text so a string literal could look like this as we’ve already seen so there are some operations that are being performed on string and we’ll take a look at them now so if i were to open my python prompt let me just define a string so s d r i n g equals to hopefully you can see this and if i type a b c d e f g h press enter this is a string now if i want to take a look at the length of the string how many characters are there in this enclosed within the double quotes i will type length string and i have eight characters one two three four five six seven eight so that’s correct now let’s say that i want only the first four characters this is what is known as slicing this could be done with a string or with any other character as well firstly let me show you what happens if i do this i get a this is what is known as indexing into a string basically what i’m doing is i’m saying in the string variable whatever data it holds i want the data that is stored within the first position and in string everything starts with the position starts from zero so zero is the first position one is the second position and by that logic since there are eight variables seven will be the last position i press seven and i get h if i try to go outside these pumps let’s say i press 9 i get index out of range basically uh there is nothing after the position 7 everything outside that is not accessible by python we have not assigned any memory to it so that’s illegal for us to do and that’s why we get an index error as you can see here index error mainly because we are using this and this is what is known as an index something that is enclosed after a variable named enclosed within two brackets so now that we know how to access a single variable a single character in a string let’s say that i want the first four characters first four as an a b c d so i start from zero and i want everything before the fourth character before the fourth character why well because uh zero one two and three are the first four characters because it starts from zero so we have zero one two three which are first four characters so how do i get all of them at once let’s say there are 15 and i want 0 to 14. so what i can do is i can type 0 after that i type a colon and i type 4 this means get me everything from the beginning of the string and write before the 4th cap so the 4th character is e and before that is d i want everything from a till d press enter and i get a sub string of it which is a b c d so this is what uh slicing a string looks like this is what’s known as a slice operator now you can do it every other way as well it’s not necessarily that you start with zero you can start with three to four which will give you i think just one character which is d now another thing if i were to leave this off what do you think will happen pause the video and try to think about it but if i were to do it now it starts from zero so it defaults to zero if i were to start it with the number that this it will give me an empty string but it makes no sense so that’s how you do that and finally if i were to leave four off since we know what happens here i think we can probably guess what’s going to happen here i’m gonna press enter and you can see what happens what happens is we take everything from five till the end so we have the fifth character or this the sixth character in this which is on the fifth index is uh let me just print the string again let’s go number correctly so it’s this so this is 0 this is 1 this is 2 this is 3 this is 4 and this is 5. i want everything from the fifth character till the last character so this is what that means and finally i can use negative indexing here as well so i want everything from the beginning till the last removing the last one character if i were to remove the last two characters i can do this and this works just fine as well so that’s how slicing works another operation we can perform is replace something in this so let’s say string dot replace if i were to show you the example here as you can see it says replace e d with e and it will do that here but let me just show you replace f with g i press enter it returns another string to me which is a b c d e g g h now if i were to take a look at the string variable that contained our value it has the same data so when we perform this replace operation what we’re actually doing is we’re saying give me another string the in which you have replaced all the f’s with g’s this is especially useful if you’re trying to remove some content out of a long line of text so let’s say that you have 17 lines of text in which you want to remove all the let’s say you want to remove all the vowels for some reason you could just uh run this and see if there is any a e i o or u and just remove that that will work just fine as well so now that we’re done with string now we’re done with data type string let’s move ahead now comes tuples tuples are a sequence of immutable python objects so what does that mean well uh let’s say that you want to store the number of or the name of weeks or in string format in a variable so we have uh monday tuesday wednesday thursday friday saturday sunday seven days in a week and all of them need to be stored in a single variable now there are many ways to do it one important thing that you need to notice here is that this data is not going to change ever there is no at no no time in future are you going to go that we need to add another day to the week that doesn’t seem very feasible so because it’s not going to change we are going to use an immutable data structure or an immutable type which means adding something to it is going to be uh it’s not going to be something that’s allowed so let’s say that i want weeks now i want to store everything in one i could certainly create variables that say sun which will contain sunday and similarly month tuesday wednesday thursday friday saturday that’s certainly doable but again it runs into the same problem let’s say that we had 14 things that we need to store let’s say that we had 100 things that we needed to store and we needed to access them using one variable that would be very difficult to remember the variable names of 100 different kind of variables it would be better if you could store them just in one variable and then access them using the position syntax that we had discussed earlier in string like this one so this is where tuples come into play so let’s say weeks and i enclose it in brackets but that’s not really necessary but i i would do that here now let me just type in sunday and you could type it with me monday tuesday wednesday thursday make sure that you’re separating them with commas and typing the space after commas to make it more readable that’s not really necessary friday and finally saturday press enter i press weeks and this is what we get if i want to take a look at the type of this type it’s tuple so uh now i said i used commas here and that’s not really necessary if i were to use this this would also work the way python identifies that this is a tuple that we want to uh add something to the data is by taking a look at this comma so it takes a look at this line it says okay weeks need to be assigned somewhere some value it takes a look at sunday it says okay one string and that’s comma that means other values are also going to be assigned and this is a tuple even if i had not put the parenthesis before and after and opening parenthesis and closing parenthesis to enclose this list it would work just fine now one thing that i would like to show you is if i were to go weeks zero i get sunday weeks six should give me saturday right but if i were to assign some other value to it let’s say any gibberish value i press enter and it gives me another tuple object does not support item assignment now this is one of the major advantages of using immutable object it does not let anything uh in the data that you have already entered to be manipulated so if you have entered a list of something that you know is not going to ever change then tuples are the best option for so this is why we use tuples so now let’s take a look at something else we have lists dictionaries and sets so lists are very much same to um tuples one major difference is that you can make changes in a list so if you have dynamic data that changes rapidly then you can use a list for that instead of creating youtubers for every change that has occurred let me show you an example let’s say that instead of weeks i wanted to store let’s say that i wanted to tour salaries of people right i can store salaries of three people let’s say that it’s uh stored in thousands so ten means ten thousand fifteen thousand twenty thousand press enter and now if i were to take a look at salaries press enter i have three salaries if i were to take a look at the type of salaries i would get list it’s not a tuple it’s a list that means that if i want to if i want to change the salary of a person because it does happen that a person gets promoted they get an increment or whatever happens they say that the person at the index 0 is now going to get a salary of 18 000 and now it does not throw any error because i am using a list and i can store any kind of data it’s still a list and if i were to print it now it’s 18. another cool trick that i can show you is that instead of this doing it like this let me just assign 10 to it and let’s say that the thing that i want to do is i want to instead of assigning a new salary i want to add 8 000 to the guys first to the person’s current salary now there are many ways to do it first way to do it would be uh salaries of 0 equals to salaries of 0 plus 10 so what happens here is i take a look at i want to assign the value at index 0 to be the current salary of the person and add 10 more to it let’s add 8 more to it to add consistency so as you can see this is what it looks like and now if i were to look at it it works just like that but as you can see the typing this is quite long it’s a bit repetitive so you get if you’ve already taken a look at the operator’s slide that we have taken that we have we have gone through one operator there was plus it now what will happen is it will add 8 to the already existing salary of the person at index 0 which is 18 so 18 plus 8 would give you 30 26 and that’s what we’ve got so you can do it this way you can add remove you can add data to it as well so i could go salaries dot append append will allow me to add a new salary to up for a person so let’s say that it’s 28. i press enter now instead of three it contains four salaries append is not something that you could do on a tuple so that’s how you do it and another thing is that instead of storing all everything as an integer value you can store data of any type let’s say that i want to store 1 k and i want to store a floating point value as well this is also doable so i can type mixed equals to this this is one of the advantages of using python many languages don’t allow for this kind of inter mixing of data when just trying to store it in an array or in a list so you can take a look at that now if i were to type next i get this i can store any other kind of data i can even store another list inside a list so if i wanted to store another list 155 comma b i’m just adding some gibberish value just to show you this is doable i can add another tuple to it this time i need to put uh called this parenthesis so to describe i’m adding a tuple 88 77 and it’s done now if i take a look at mixed i have this if i want to take a look at mixed of four right it’s tuple and i can take a look at the type of the tuple so to see if it’s a tuple or not and it’s a tube and if i were to take a look at the type of something at index 3 it would be a list so you can nest any kind of thing but just don’t go too crazy with it that could cause a problem you would not be able to comprehend the data that you have if you go extremely uh benign on it then we have dictionaries in other languages they are called hash maps or hash tables so a major advantage of using a dictionary over a list is firstly it allows you to create a key value pair so again let me just show you so if i want to associate a person’s name to the person’s salary instead of doing a list which could only be accessed using a numeric index like 0 1 2 3 what i can do is i can create again salaries equal to and i can create a dictionary like this add two curly braces inside it add the key which would be the person’s name let’s say john and end the quotation and now the person whose name is john gets 15 000 let’s say that there’s another woman who’s named jane she gets 14 000. let’s say that there’s another woman named another man that’s named johnny and he gets 5000 press enter and now if i were to print salaries it will display the key value page if i were to take a look at the type it will give me a class of dictionary so that’s all good but one thing that it allows me to do now is i can access it using the person’s name so let’s say that i have salaries i want to get the salary of jane i press enter i get 14 that means jane earns 14 000. let’s say that instead of jane i want john as you can see john gets 15 000 and if i were to use a name that doesn’t exist i’m going to get an error this is called a key error this key does not exist now a way to come across this or come over this is instead of indexing directly you can use get so i want salary of john but if it’s not available give me 15. so i am getting 15 because that’s the salary of john and if let’s say that i were to use some other key that is non-existent the person named john asd is not available in our in our dictionary so if i press enter i’ll get 18 which is the default value i could set it to anything i could even set it to zero so now if i i try to access a person whose name is not in available in our dictionary i’m going to get zero so that’s one of the major advantages another is really fast so even if you have like 10 000 records the way hashtable works or the dictionaries work underneath the hood is instead of it going through all the 10 000 names and finding the name that you want it to find it does some clever tricks and makes it really really fast this is why if you are doing something that requires really fast access to some data and you can give some unique key for that data to be accessed then definitely use dictionaries instead of using lists because in lists you would have to you could in list do something like this store a list of list in which the first thing is the person’s name second thing is the person’s seller i could do it here as well first thing is the person’s name jane and the salary of jane is let’s say 80 okay now if i wanted to i could define a function that could find this or do a lot of things but the important thing to understand is if it was let’s say 10 000 records and i wanted to find a person’s name named let’s say that a person’s name was not john or jane i wanted to find a person whose name is at the end of this list let’s say i don’t know where the name is let’s say that the person’s name is andy so we go through the entire list searching for the first element and checking if that’s first element is andy and if it is handy then we print the salary if not then we print zero now for that we have to take a look at all the ten thousand elements in dictionary what we do is we can just give the name and dictionaries we’ll do some clever tricks behind the scene that you don’t need to worry about and it will instantly find it for you so always use a dictionary when you have large sets of data and you want to constantly find things in it if you want to just store some data and you know you can access it using numerical indexes and the data keeps changing then use lists because in tuples you can’t change the data so if i have some let’s say i have 0 and i want the person with the employee id of 0 to have salary or something then i could press 0. here it’s not going to work because i’ve reassigned it to a dictionary so i’m getting a key error but you can understand my point on the other hand let’s say that i have a data set which does not change and i want to access it using numerical indexes then tuples come into play and you used to use and finally we get two sets sets are by far one of the most underused data sets and you need to understand why we use sets so sets are again unordered collection of immutable data which has no duplicate elements so what does that mean let me show you i can create a set by the way uh the way i am showing you how to create dictionaries and all that’s not really necessary let’s say that i want to create my sample dictionary i can just use the keyword dict function and it will work just fine now it’s an empty dictionary and i can add data to it another thing you can add data using dictionary and typing in the name and then assigning a value this will work just fine now the thing you need to understand is uh similarly you could use instead of this you could use this will create a tuple and l list will create a list now i want to create a set so let’s say that i name it s now i can do it this way but i won’t be able to add anything because this will be taken a look at as a dictionary so the best way to do it is use the keyword set or the function set and now if i take a look at s it’s a set so what is a set uh set if you have worked with mathematics and you have worked with the band diagrams and everything like that you you quite understandable of what set is already a set is basically something that contains a lot of values but it does not contain duplicate values and it allows you to check whether or not something it contains a value or not in constant time so let me give an example i think s dot add one would work now if i take a look at s it has one if i were to add two it contains two now if i were to add two again it instead of containing two elements named two it contains one element which is two with the value of two now if i want to check whether or not something exist in a set let’s say one in s and it will show you this is the in operator or the membership operator that we have already taken a look at and if i want to take a look at whether two uh whether let’s say i want to check whether five exist or three exist it’s going to be false so why are we doing this well one thing that you can take a look at here is that it again since if we are using set it is extremely fast for us to perform this operation let’s say that there were a million records in this and i wanted to check whether something exist in it or not the first intuition that comes into our mind is to take a look at everything that is inside the set all the 1 million things and compare it with the data that we want to find if it’s not there then we return false if it is there if it’s found and uh while we’re taking a look at all the one million one million data points then we return true now again that is fine but if it’s one million records then it takes a lot of time and let’s say that you had to do it consider constantly let’s say that you have to check whether or not something exists in those 100 records 10 000 times 20 000 times that’s a lot of computation so again sets do some really complicated mathematical things underneath the hood to basically just check whether or not something exists inside it in real time that’s why it doesn’t contain any duplicate values in this as well and so i can do one in as well if i were to assign a list of one comma two to it and then instead of one in s i can type one in l and that will return true as well the difference here is that it takes a look at all the elements one two and whatever finds if something exists in it or not whether whereas in set it’s extremely fast so if you have some data that you want to keep track of and want to check whether or not another value exists or you want to deduplicate the data you can do that as well so you set when you want to check memberships really fast you can perform the uh union intersection and all that and that’s very easy to do with sets as well so now let’s take a look at how to deal with sets and how to deal with different data types i’ll show you some hands-on to further strengthen your knowledge of data types so let’s take a look at the setup that we’re working with so we’ll be taking a look at data types we’ll be taking a look at lists sets uh dictionaries and tuples the things that we have already discussed but instead of typing into the python console that we had done earlier let me show you how this was the console that we had typed code into now for typing out single lines and executing them this console is actually very good but let me just exit out of it when you want to type in a lot of code and you want to open the file open a code file and execute it and run all that i would recommend that you use visual studio code so let me show you what it is let me just open a new window to a new window and here type in vs code press enter and open this once you’ve done that you can download it for windows or for any other operating system that you have after you have downloaded it you can install it like any other software that you install but after doing that just right click on the folder that you want to open it into type and press the open with code forms it will open with this code board so it’s done and as you can see there are no files as of now in this folder the folders name is hands-on in which we’ll be performing the hands-on let me create a file called app dot py py is the extension the python extension press enter now just to make it clear you don’t have to use vs4 if you don’t want to you can use notepad for this as well just go to the folder create a file named app.py and right and click on edit it will open uh it in notepad and you can make the changes there as well there’s virtually no difference now what we need to do is i open this and let me just make some changes i’ll increase the font size so that you can see a little better so it’s a loading extensions and while it’s doing that open the command pro to do that just go into the address bar delete everything type cmd press enter the reason why we do it this way is because we want to open this command window in this folder so that the files that we create the app.py file i want to run it using the python command if you have installed python then it’s already available for you so you can do it this way now if i open this as you can see i can type in a lot of things so let me just increase the font size i think it’s now i’m going to type some python code hopefully it’s visible to you all so in visual studio code you can write code and the reason why i’m choosing this is because it shows you some syntax highlighting shows you if you’ve written something wrong it autoformats the code for you does a lot of things for you right out of the gate so to perform this what i’ll do is the first thing we’ll have to do is we have to use lists so let’s say that i create a list named nums equal to and the numbers that i want to insert in it are 1 2 3 4. so the first task i want to do is i want to print it so i print nums save the file when you save it this icon changes from a filled circle to closing uh to an x now that that is done open the command prompt and type p y t h o n python and the name of the file which is app.py again i press enter and it’s there let me just uh increase the font size here as well so that is easier for you to see let’s say it’s 28 that’s easier to see yeah so again python space and the name of the file which is app.py in our case and one thing that you should notice is that app.py is in the folder in which the command prompt is open let’s see user my username desktop and in the desktop we have a hands-on folder inside which we have created the app dot python this is why this is working another thing you can do is you can press this command but it would work sometimes and sometimes it won’t work so i don’t like doing it that way and it gives you this line of code as well which is fine but we’re doing the same thing we are typing the part we are typing python and the name of the file you can do it either way that you want now that that is done let me just add a comment so a comment is basically something that you add in a code so that you can explain your code but now if i type anything here this code will not be executed by interpreter it’s just for the reader of the code so any person who’s reading the code can take a look at the comment and understand so i’ll type declare array and print declare list and print right if i run this again it’s not going to make any difference because it’s a comment now that that is done let me show you how you can add something to the list so you get the append method in the append method i can type in a number uh it could be of anything it doesn’t need to be a number it could be a string it could be an object as well i’ll type in a number let’s say that i want to add 5 to it i save it i print it again run it and i think i’ve run into an error yeah i want to print nums sorry and as you can see now nums has five in it i can do a lot of things in it as well instead of that i could use the pop method pop method removes the last element if i don’t pass in an index so if i were to store the value what it does is that it removes the value and returns it to so that i can also take a look at what was removed and now if i print nums run this this is the number i added 5 to it it’s this remove the remove the element at the end which was 5 and now i print it it’s 4 again but a cool thing that we can do is instead of not specifying the index we can actually specify what is the index of the element that you want to return so if i want to remove something at the beginning of the list i type in 0 this would remove one from it so if i run this again instead of five we’ll see one remote so one is removed and now it’s two three four five so again you can do a lot of things with it uh one of the things was this now let me show you something that’s uh really interesting so if i want to print all of them what i do is i use the for loop and we’ll discuss this in a later slide but just to give an example of where looping is useful if i have 10 000 elements instead of accessing them one by one and writing things like print num zero and then i keep going on like print num one num two num three so we print all the four numbers what i can do is i can just print a loop and tell it that i want to print everything inside the list so everything in this list i want to do something with it and what i want to do will be indented here do you notice that i put a tab after this or which studio code does it for me this is what’s called indentation this is what differ tells python what is the code that i want to execute in a particular block so if i type print text and if i do this as you can see it’s printed two three four all of them in a new line for me so this is what the loop looks like now if i had not done it this way if i had done it this way then python would have no idea whether or not i want to include this line in the loop whether this should be executed in the loop or not or to give you another example now i want to print done done will be printed at the end of the file so if i run this it’s printing done after each execution but if i uninvent it if i push it to the left side i run it again and now it’s printing done afterwards the python looks at it goes okay everything that’s underneath this uh for loop that’s indented one point to the right is going to be something that needs to be executed with within this loop and this is how python and reads your file and understand what is the code that i need to perform and how so make sure that you understand the indentation now that we have looped over this as well let me show you something about tuples so instead of one two three four five if i remove the parenthesis what do you think will happen presenter so it prints everything but it immediately gives me an error that there’s no way to add something to a tuple if you remember we have already discussed that tuples are not something that you can add something to they’re immutable so that’s why we do this so instead of appending i don’t think we can remove things as well i press enter and pop is also not available but you can print them you can loop over them just as we did earlier as you can see one two three four one two three four done it’s completely to it now let’s move on to dictionaries let me type this t i c d i o n dictionary or let’s say i’ll call it cell i’ll define a dictionary the way i did previously jane 15 john 20 i need to add it as a strict now that that is done if i print it let me just remove this it’s this i want to add something to the list or to the dictionary i can just do it this way i want to add andy to the list and he earns 25k now if i were to print this let me just yeah now we’re getting andy here as well as you can see jane john andy so this is working if i want to delete something from the list let’s say that andy i don’t want to print i don’t want andy in this listing i use the dell keyword inside i just type in cell of andy i run this again and andy is removed so that is fine that’s how you remove an element from a list or a dictionary other operations we have already performed one thing that we haven’t performed is looping over this so let me just show you for item in and here things get a bit complicated so instead of just sal i have to type items if i just want to loop over the keys i can type style dot keys and if i print item what do you think will be printed so uh pause the video if you want to think about it and then answer but what will be printed are the keys so the keys are jane and john remember uh a dictionary the key value pair these are the keys these are the values so jane earns 15 john earns 20. so we type it we printed the keys now if you don’t want to print the keys let’s say we want to print the values i do that now i can only print the values but let’s say that i want both of them items and i want key as well as well now i can print key comma and as you can see i’m printing jane space 15 which is the key comma value so we’re getting the data that we want now another thing that we can do is we can print jame in cell print it and we get true it’s the same operation it’s a membership operation but if i were to type in dsf i’m going to get false so again this is what it means and now let’s take a look at a set so let’s say s create effect uh another thing you can do is you can use this symbol but instead of typing a key value you just type in the keys so if i print the type of s and i run this it’s a set now so if i want to keep it empty then it would become a dictionary because by the this is this is what the default python behavior is so this is what it looks like we have a set for us now if i want to add something to the side i just use the add method i want to add one and i will do it two i want to add three and now if i were to print s it’s one two three so this works fine but if i want to loop over it for x in s let me just print x as you can see we get all the numbers in it and if i were to print the same thing that we did earlier one in s i’ll get true and if i were to print uh 4ns i get false so this is what this looks like this is what our code looks like this is how it works so hopefully that was informative now let’s move ahead and see what else can we learn about python so now let’s take a look at comprehensions comprehensions are quite an interesting feature in python they’re very easy to use they’re very informative and they’re uh quite fun to work with so let’s see what they mean as of now we know how to create a list i asked you to create a list and let’s say that i tell you that hey i want you to create a list and in that list i want you to enter number from one to five what are you going to do then unless you have not one from five let’s say that i say i want you to enter in a list i want you to create a list and add a number from let’s say one to ten but it could be anything it’s it doesn’t even have to be one to ten it could be any arbitrary number now the robot doing that well we know we can do it using loops and that’s how many languages ask you to do it so i’ll create a list which is empty and for let me show you how to loop to a certain extent i want to loop for x in range which is a function that allows us to loop over a particular range i want to loop from 0 to 10 so it will look from 0 1 2 3 4 5 6 7 8 till 9 not 10 because the end is exclusive so always remember that and now if i do l dot append x and finally you can print x not x l so uh if i were to clear the screen and type in this as you can see we get zero one two three four five six seven eight nine now this is fine but a major problem that you can see with this is the fact that we have to write a lot of code wouldn’t it be nice if we could just uh compress this all into a single uh compress all of this into a single line of code because we are just instructing the computer to add something to a list right so this is possible using comprehensions i define an empty list i tell it add x to the list now python is waiting for me to define what x is so i’ll go for x in range zero to nine and that’s it now this is something that many other languages don’t provide you to do python is quite easy to use so you can do that here we only get zero to eight mainly because it has to be 10 like we did earlier and this works now this is just list comprehension but i can perform many other kinds of comprehensions so list is this let’s say that i want to perform a tuple comprehension so for this i have to use parentheses and let’s say that i want to perform a set comprehension and a dictionary comprehension all the data types that we have discussed so far could be okay all the containers that we have discussed so far for the data types could be used to perform comprehensions as we are about to do now comprehensions are quite easy to understand once you get through the basics so let’s see uh let me just print all of them one by one so we will print the tuples then we will print the set and then we will print the dictionary so let me just show you um this is what tuple comprehension looks like what will it be for set just changing it to parenthesis will work now for dictionary we need a key value pair so let’s say that we want to store key as the current number which is the number from 0 to 9 and the value as the square of the number which is number divided by itself so this is what it looks like now for a number to be key value where i want to define the key press colon and then define the value which is x multiplied by x now if i were to print it as you can see we get this we get a generator object and we get this and we get this so this is what it looks like now generator objects are quite easy to understand you can just let me type in comma press enter and this is where you will get right now error so don’t do it this way but you could use tuple and now you press enter and you get a tube out of it so this is one of the problems when it comes to tuple comprehensions that you can’t just use it this way right but this is what comprehensions look like they’re quite easy for you to create once you understand how to create them and it’s very easy for you to understand how to write code like this and it’s just allows you to write such a code or code that’s easy to understand now that that is done let’s see what else we have to do so now let’s take a look at conditional statements so a conditional statement is basically a statement that allows you to change the flow of execution or the path of execution when a provided condition evaluates to either a boolean value of true or false so suppose that we had some way of figuring out whether or not you’re sick or you’re healthy so what happens is let’s say that you want to make a decision first thing is we ask you are you sick so let’s say that you are sick then we would advise you to go to the doctor but let’s say that you are not sick that means the condition whether you are sick or not evaluates to false and then you can go out and play football do whatever you want this is where conditional statements come into play and we encounter conditional statements every day during our lifetime we have encountered conditional statements i think you probably encountered conditional statements today as well waking up uh when you’re waking up you decide whether or not to sleep for an extra five minutes uh whether or not you are late for a meeting whether or not to take a cab or a taxi or a walk to the place that you work whether or not to eat at one place these are the kind of situations that allow us think of conditional statements in a more rational way so this is what it looks like we have many ways of doing this one is the if else statements so we the statement looks like this we have the if keyword inside that keyword that’s a condition if that condition evaluates to true then statements one and everything before the else statement gets executed but if say the condition is false then everything underneath the else statement will get executed we’ll look into it in more details when we are performing the hands-on but right now just understand if the condition is true then statements one is going to execute and if condition is false we’ll execute the else statement which is the statements too now there are many ways you can construct an if statement depending on the logic that you’re trying to build you can construct a nested if else statement you can construct if else and else if statements you can construct a lot of different kinds of equal statements so you’re not limited to just use if and then else and then your uh conditional statement ends you can use if if this not true then if this is true then if this is true then if this is true then you can take a look at a lot of conditions that way or you can nest them one inside the other that would work just fine as well so now let’s take a look at some conditional statements again we have the same setup we have an app dot py file py is the extension for the python executables or the python scripts that we write so to write it let’s first take a look at conditional statements so there are a few keywords that we use in conditional statements keywords like if else and l if which is short for else if so let’s take a look at this so let me define a number named x and it’s equal to 20. now if i check if x is less than 20 then i will print x is less than 20. else and now else print x is not thus or you could say x is greater than so now this is a statement but it’s a little incorrect i mean it’s not incorrect in the way the syntax is structured so if i write this it says x is greater than 20 and we know that’s not true because x is not greater than 20 it’s actually equal to 20. this is where the else if statement comes in so we can add multiple conditions and attach code to it that will execute when one of the conditions satisfied so if i type l if x is equal to equal to 20 and remember to put 2 equals mainly because we’re not assigning a value we’re trying to check whether or not a condition is correct print x is equal to 20 parenthesis again is correct this is where if l if and else come into play if for condition and a conditional statement will always start with if not with alif and neither will it start with the rams then we’ll check okay if this is not true else if this is true then do this and if this is true then do this make sure that you understand that only one of these statements will get executed so in no condition will it happen that this will also get executed and this will also get executed so if i were to do x is less than and if i run this it’s still going to execute the first statement which is x is less than 20 even though the second statement is also true so make sure that you understand that and now let’s take a look at nested elsif so so this is what it looks like again i could have made the same mistake i could just put equal and that would have caused the issue now nested else basically means an if statement and i want to check if x let’s say x is less than 20 and is less than 15 as well and i put it so i will print something out there as well x is less than 15 and i can type in else i can write some code x is greater than 15 and finally else x is greater than 20 so if i run this x is less than 10 20 and x is less than 15 which is true now if i were to give it a value something between 20 15 and 20 x is less than 20 but it’s greater than 15. so what’s going on here let’s take a look at the path of execution we take a look at we assign a value of 17 to x we check this condition we check if x is less than 20 it is less than 20 we go down we print x is less than 20 we go down we check x’s if it’s less than 15 it’s also true so we go down and we print this we’ll we skip over these statements because one of the conditions has evaluated to true we go down here and since that condition is also available to true it’s fine but now if i type x is equal to 21 what do you think will happen just one statement gets printed because it comes here sees if x is less than 20 well x is not less than 20 so it just skips over everything inside if and moves to else and prints that and it’s done so this is where nesting and nested if statements and else if statements come into play you can nest it uh really deeply but your code becomes really difficult to read if you can do it you can do it like this as well if x is greater than uh 15 or something like this and then again underneath that you can type another if but it makes your code really difficult to read so make sure that you understand the trade-offs when you’re using if and else if statements now let’s take a look at looping statements so a loop is a process in which we have a list of statements that execute repeatedly until it satisfies a condition so let’s say that i want to print something 50 times instead of me having to write it one time and then copy and paste it 50 times what i can do is i can instruct my python program to repeat something or repeat printing something 50 times and that way when the condition the condition that is satisfied is that it has this statement been printed 50 times if it has then we print that if not if it hasn’t been satisfied then we keep on looping and keep performing the task that is inside the loop and if we have satisfied the condition then we exit out of the loop and move on to the next statements so there are many kinds of looping constructs the first thing is we have a for loop so a for loop basically allows us to write code in which we can basically iterate over a lot of things so this is what the flowchart looks like we take a look at the variable and we check is there a variable in the list and if it does not have any variable that are remained to be performed the task on then we perform the rest of the code else we perform the statements we move on we perform the statements we move on and we keep doing that over and over and over again this is where the looping statement comes comes into play so then we and we’ll have an example of this when we’re performing the hands-on so don’t worry if you don’t understand it as of now then comes the while statement so the while condition is something that is also used in loops but the fundamental difference is while loop is used to loop over a bunch of statements in which what we’re trying to see if if a particular condition is true or not instead of looping over a particular uh list of things we can just use a while loop and loop over uh some things under a specific condition is true so here what we’re doing is we’re setting a variable a to 1 then we’re checking whether a is less than 5 so 1 is less than 5 we print 1 we add 2 to a so a is 1 and 1 plus 2 equals 3 so now a is equals to 3 is 3 less than 5 no it’s not so the condition evaluates to false and we print it so we print 3 we add 2 to it 3 plus 5 3 plus 2 is equals to 5. so a is now containing the value 5 we check whether 5 is less than 5 that’s obviously not true is equals to 5 so we end the loop and since there are no other statements left to be executed we exit the program now when we’re looping there are certain keywords that help us in the loop these are break and continue so what do they mean well the break keyword allows us to end the loop prematurely so let’s say that i have a list of names let’s say that i have a list of name from uh let’s a list that contains alphabets from a to c what i want to do is i want to print everything in the list but if anything if the character in the list is d then i want to stop the loop i want it to not continue any further i want to just uh exit out of the loop anything after that is not going to be printed so this is what a break statement comes in and it’s coupled with an if statement and we’ll take a look at this in an example when we’re performing the hands-on but now just understand that if you want to if you have a condition in which you want to exit a loop abruptly or prematurely before the loop has actually ended then you can use the break statement the continue statement on the other hand does something quite similar so you can take a look at the brake flow chart as well if you are uh unfamiliar with what it’s doing basically we check a condition can we check the brake condition if the brake condition is true then we break it and loop exit the loop if the brake condition is not true then we continue with the loop as long as possible again this is a bit esoteric so i’ll explain it in the hands on when we’re performing it with the code then comes what we’ll try to do which is looping and using continue so continue does something similar to break but instead of breaking the entire loop what we do is we skip over that loop so let’s say that i am again doing the same thing a list containing all the alphabets from a to z i want to print it they are ordered in any way that you like what we’re doing is we’re taking a look at the alphabets and we’re printing it on the screen but if the alphabet is d then we don’t want to print it we don’t want to perform anything that’s being done to other alphabets so what we do is we take a look at it we see okay if the current character is d then we use the continue keyword and we skip over the rest of the statements that are used to execute the loop and we just move over to the next next character so that’s what continue does and now we’ll take a look at the hands-on in which we’ll be performing uh tasks using all the keywords that we have discussed as of now let’s take a look at some loops so again we are in our app.py file and i’ll write some code for you to run to understand loops so we’ll take a look at for loop and value the first thing i’d like you to see is we’re going to be printing a number from 1 to 10 so let me just show you so this is a classic example of loop for x in range range is a pretty common function used in python if you’ve never used it get i i would suggest getting comfortable with it it basically means i’ll explain the code in a moment so what we’re doing here is we’re instructing python that hey i want to loop over something every time the loop uh iterates i want the value to be stored in x and what i want is the range to be going from 0 to 10 minus 1 which is 9 always remember range is exclusive of the end so if i want to type 11 that will go from 0 to 10 so that is done now what happens is when i print x what it will do is it will start ranging from 0 so it takes a look at zero prints zero goes one increments it by one goes to one prints one two three four five six seven eight and it goes to nine prints nine and then it is out of range so it stops abruptly so let’s see if i were to run this as you can see it’s printing but let me show you a trick uh when you’re printing something type the end equals comma right so in space so now if i print this instead of each thing being printed on its own line what’s happening is it’s going to print it uh with a comma and a space between the next element as you can see that’s the reason why this is there’s a comma and then space here as well so that is done and so this is how you loop over a range but let’s say i don’t want to loop over a range let’s say that i want to loop over a list of numbers so again nums equals one two three four five right or if you remember with comprehensions x for x in range 0 to 10 now we have all the numbers that we had earlier and this is what it looks like i run this again okay for oh for x in range and now it’s doing the same thing so we have numbers here and for each number we are looping over that now that that is done one thing i can show you is after looping we can take a look at some of the statements that we had learned earlier so break and continue so let’s say that i only want to print till four let’s say that this is a shuffled list and i want to break the loop and not print 4 so if x is equal to equals to 4 then break else will print x so i do this 0 1 2 3 it encounters 4 and since the condition is satisfied executes the break statement gets out of it and the program is not ended yet if i want to type something below it i can type like this done and it prints it there what’s happening here is i have uh just as soon as i encounter the number four i just quit the loop so everything that’s after four it’s not being executed in the loop if i change break to continue notice what happens so we have zero one two three four is skipped then we have five six seven eight nine so what happens it takes a look takes a look at the number it’s four and we continue we say we don’t want to print four we move ahead so this is this is something that can be helpful let’s say that you’re trying to print something out of a list and you don’t want to print the name of people whose salary is less than let’s say 40k so what you can do if first let’s say that for each x 0 index is x dot salary is less than 40 then continue so this would have the same effect as it did earlier this is what it means so this is where we move on to the while loop so let me just define a number is equal to 10 and instead of 4 we’ll use a while loop while a is not equal to 0 print a and then a minus equals 1 so if i run this it prints numbers from 10 to 1 in reverse order so what do you think is happening here so basically i have defined a variable named a and i’m checking whether or not it’s equal to 0 if it’s not equal to 0 then perform this task one thing that you need to make sure is if you’re trying to check whether something is equal to something and if it’s not then you need to manipulate it inside the loop if i don’t do this then this will run forever as you can see it’s keeping is going to keep printing on 10. i’ve interrupted it using ctrl c i don’t want it to run forever so if i do this then it understand how to do this but it will take a look at the number it’s 10 print 10 and in decrement 1 it becomes 9 9 is not equal to 0 9 go ahead remove 1 8 not equal to zero printed seven and this is how we go till zero we print uh one then we in decrement one which becomes zero and we take a look at this so we go okay is one equal to 0 equals to 0 that is true so we this statement evaluates to false which is not equal to 0. so what is happening is we are printing the things until this statement becomes true and when a becomes 0 this statement this statement becomes false and we move out so as long as this evaluates to true our loop is going to continue while or another way of reading is while this is true keep executing whatever we are executing so that’s a while loop now why use while loop over for loop there can be many scenarios where you want to use for loops i like using for loops mainly when mainly when as you can see if i can travel back in time yeah so mainly i like using for loops when i have a list of values something that i can easily access in a loop like this this doing this with while loop would be a little difficult let me show you how so i would need to have in index start it with 0 and check while index is not equal to the length of nums minus 1 that is one way to do it or you can check if while index is less than nums value minus i will print nums the values are at index and i will index plus equals 1 if i run this again it will do the same thing but i had to write a lot more code and i have to keep track of this index variable uh similarly uh python’s for loop does all of this behind the scenes for us so we don’t have to manage these external variables and all that for for us there’s advantages to this but if you want to make if you want to have access to the index variable then you can use a while loop or you can lose a clever for loop as well but for most scenarios when you have a list of values that you can iterate over using for loop is better and when you have working with a specific condition then using a while loop is better so it’s completely up to you it’s your choice whatever feels the right tool for you you can use that but i would certainly recommend using for loops for iterable statements right now let’s take a look at functions in python so functions are one of the most fundamental things that you need to understand when you’re learning any programming language they allow you to encapsulate code and then you reuse it in many other parts we’ll take a look at what that means and how you can accomplish things with functions but the first thing that you need to understand is that it’s a very very very important topic and you need to get a really good grasp of how functions work before you can move ahead and discuss other things and learn other things in python so make sure that you pay attention and understand it thoroughly so a function is a block of code that is organized in such a way that it’s reusable set of instructions and can be used to perform some related set of actions so it’s used to organize a real reusable set of instructions that are used to perform some related actions so we’ll take a look at what that means basically it means that we wrap up some piece of code that is entirely used to perform one thing and that is then used that wrapped up piece of code is then given a name which is the name of the function and then we use that name instead of copying and pasting the entire code again we’ll take a look at what that means and how to accomplish this if you don’t understand it as of now don’t worry so there are two kinds of functions in python the first is a user defined function which is the something which is something that we create as developers as programmers and then there are some functions that are built into python so you don’t have to create their implementation yourself python will have already done it for you so the user defined functions as i’ve already told you are created by users this is what the syntax looks like we use the def keyword def is a shorthand for define we give the function name we accept some arguments these arguments are basically values or things that our function will perform some actions on and then we do some calculations or any other kind of tasks that we wish to perform and finally and this is not a necessary statement but if you want to return something you can do that returning something basically means the this would be the result of the function now sometimes your function may not have any result to perform if that’s the case then you don’t return anything you just omit the return statement completely now give an example there’s a function right beneath it that says def add is the name of the function we pass in two parameters a and b we create a sum which is equals to a plus b so anything that we have a plus b and we return the sum now to be quite honest this is a very trivial example but to understand how functions work we need to move our way up from really simple to complicated examples now there are many uh built-in functions as well these functions have already been written by the developer of the python language and they have created these functions for us to use so some of them are the we have an abs function which is used to store the absolute value of a number so if you have a number that is negative that is negative 17.4 then if you perform use the abs function on that you will get 17.4 absolute basically means it converts negative to positive or in other terms you can get an absolute value where you perform some mathematical operations on it then you get all and returns true if all items in an iterable object are true so if i have a list of state a list and although everything inside that list is true then it will return true if any one of them is false then it returns false then we have any then we have ascii then we have bin which is used to take it look at the binary version we have bool we have many other functions min max whatsoever so you can use them if you wish to if you are trying to perform some tasks make sure that there is a instead of writing your own function make sure that there is a function available for you then there’s lambda functions now this is a concept that gets overlooked quite often but lambda functions are very powerful so lambda functions are basically anonymous functions that have no name and it contains a single expression a single line of code is there in a lambda function so these are used to pass some task that we want to perform in a function let me give you an example so this is how lambda functions can drastically improve the a code so let’s say that i want to create a function in the way that i have done as of now so i want to create a function called multiply so i can do it using multiply def multiply x comma y and then i return x multiplied by y but if i since you can see it’s a very simple pro simple line of code i can simply just do r will be the variable that will be pointing to the lambda lambda will accept x and y as parameters and will return x multiplied by y now to call the lambda all i have to do is use the r which is the name of the lambda and i put parenthesis in outside and inside the parenthesis i pass in the two statements if i don’t put a parenthesis then it would simply be a variable that holds something i’m not calling the function i’m not executing the code that’s inside it so i pre pass into ln3 and i get 36. now another thing that lambda does really well is that it allows you to remember some things so let’s say that i have created a function called def my function i pass in a number named n this number now will be remembered by the lambda that i have created inside that function forever so if i return a lambda in which i i will accept another parameter called a and what it will do is it will take that number and add n to it this way i can create multiple functions that do the same task but where the value of n is different and this is very trivial example but i’m very mundane but you can think how far you can take it there are many great python programmers who use lambdas in very clever way to write code that’s very readable quite simple quite easy to understand once you understand what lambdas are and it’s quite uh quite comprehensive take a look at a function that’s so simple to understand so take a look at that as well now let’s take a look at some functions and we’ll implement some functions inside again we have the same setup here we’ll define the function so let’s say that i have some task that i want to perform okay and it has several steps and all of these steps will need to be performed regardless of whether where i execute the task this example will also show you the importance of functions so let’s say that the i want to let’s say i want to download file okay and url now i’m not going to actually download the file but i’ll just print the steps so establish connection then these are the typical connection steps you need to perform when you’re trying to download a file from the internet open the connection download data close connection so now if i want to download anything any file let me just uh print the url or another thing i’d like to show you is that you can add strings or concatenate strings so this way so if i do this then it will print establish space connection space and then add whatever url is right at the end of it open connection and let me print the url download data and close connection all right so if i were to call this function download file and let’s say that the url is ftp colon for slash over slash www.abc.org now if i were to run this let me just run this here it’s going to perform these tasks but the major advantage of using a function is now if i want to perform the same task over some other place i can copy and paste it into different files and it will do the same thing over and over again so if i were to print this as you can see it’s done everything for me let’s say that i want to change the uh url df i j k and if i run this it’s downloaded it for all different purposes but now let’s that i want to change something in this step uh before after downloading it i want to log it to the login to some place let’s say i want to make sure that this is recorded in a database that i had downloaded this file so log to db and this is the url i downloaded it from now if i run this as you can see this change let me just so i’ve added two lines so this is what it looks like now it looks a bit clear let me clear the screen and run this again so that you can see and it’s downloaded it but as you can see that this step is now added to all three of the files this is the advantage of using a function so whenever you want to make a change to some particular step some particular well defined process you can just make a change in one place and everywhere that process is being performed those changes will be reflected there so if i had to add something else to the download file procedure this will be executed by all the functions or all the files that are calling the download file function that i have just described so that’s one advantage and now let’s take a look at returning some data right so let’s say x comma y i get in two data and i perform some task now i want to return x multiplied by y let me call it again this is a very trivial example multiply right and if i want to multiply two numbers multiply you multiply x and y now x and y are going to be numbers so let’s say 15 and make 10 comma 15. now if i were to run this 150 would be the expected answer uh if i were to run 25 250 would be expected answer so we have gotten both of them but one thing to notice is that it’s a really simple function so defining it like this is going to take a lot of time and it’s really unnecessary so one way you can change it is you can write it like this let me just show you the code that was earlier there and i am just going to change it to multiply equals to a lambda that will take in x and y and return x multiplied by 5 run this again and this is performing the same tasks that we had performed earlier and it’s printing the same thing that we had printed earlier so there’s no change of functionality but we’ve changed the implementation so what is the advantage now this function i can send as an argument to any other function this is where it gets a little uh difficult to understand so let’s say that i want to print let’s say that i want to oh yeah one thing i can do is def create multiply and it will and give me x and y will be given to me on the time of execution so now what i can do i can return a lambda that will take y as a function and it will return x multiplied by y so let me just delete this now this works and now i will create multiply create multiply time and now i will just pass passing 15 and 25 i run this and i get the same answer now as i’ve already told you that this is a function that you can pass into another function let’s create another function named execute takes in a function and takes in an argument and all it does is that return calling that function with that argument so x execute 15 would be the other argument and f sorry multiply multiply and we want to take the cubed i run this again it’s doing the same thing but it’s doing it for me now i can make changes to it i can basically log the calls to a console called f with and okay i’ve gotten an error yeah i need to convert it to string now it works so it called f with 15 and it called f with 25 so i can log this to a database to make sure that it’s being executed correctly and so on so there are multiple things that you can do and lambdas are actually very powerful but this is how you get started with it hopefully this was informative and let’s move on to the next now let’s take a look at arrays in python so in python arrays are basically lists arrays are used to store multiple values in a single variable as we’ve already taken a look at tuples and lists and dictionaries and sets we’ll be taking a look at a lot more in a future video but uh arrays are basic programming constructs in python you have uh arrays in the form of lists so you don’t call them arrays you call them lists which sounds more more readable and more intuitive so python does not have a built-in support for arrays but python lists can be used instead so you can instead of creating three variables to store all the three strings as we have done in the code listed below like this we can do it in an array and we can just use them ourselves so there are many operations that you can perform in an array and we’ll take a look at all of them one by one first is we can access an element in an array so accessing an element basically means we’ll be accessing it using its position in the array so if it’s at the first position we use the zeroth index zero is the the index is the number inside the parenthesis after the array name similarly if it’s the second then we use one because it’s indexed from zero and if it’s we use an index that’s not present then we get an error index index range error i think it’s called then that’s modifying the element now if you’re using a tuple that’s not possible but in lists it’s completely possible so instead of doing something else you could just modify the element using the same thing that we did previously you can take a look at the length of the array length basically means how many elements are there in the array we can loop over an array and perform tasks the way we want to we can add an element to an array using the append method that would add an element at the end of the array we can remove an element from a given position so we could go cars which is the name of the red dot pop one this will remove the second element in the array if i were to not give an argument then it will remove the last element automatically instead of passing in one if i had passed nothing then it would just remove bmw from the cards array then we have removing specific elements so if i have a specific element that i want to remove i can use the remove method on the others and we’ll take a look at all of them then there’s python classes and objects so classes and objects are quite interesting they’re very easy to understand they’re quite interesting to use interesting to see interesting to build applications out of they are the fundamental building blocks of object-oriented programming and we’ll take a look at that in a later module but here you need to understand what classes and objects are so a class is basically a container or a blueprint that defines what an object should look like and the object is basically the in-memory representation of a class so since python is in object oriented programming language almost everything in python is an object and it has properties and it has methods so class is like a blueprint for creating an object so in the example as we see we have given a class created a class you said in whenever you create an object of a class this should have a property named x and the value of that property should be five now this is a very very high level overview of classes and objects we’ll go into much more detail in a later video but here you need to understand that the classes and objects are how they are related classes are the definition objects are the concrete implementations objects are what you use in the real world classes are what define what your objects structure should be and then we come to file handling so what is file handling let’s take a look at that the file handling is one of the most important app aspects of any application that you use it’s used to permanently store data on a disk or a network location or anywhere they will need file handling if we have to read or write from the files python provides a really easy way for us to read and write to files so there are many file operations that you can perform some of the most important ones are opening a file reading a file writing a file or creating a file and deleting a file so to open a file the syntax is quite simple you use the open method it takes in two parameters first parameter is the path two of file and the second parameter is the mode so the mode basically is a single uh a string that can that is either r a w x or something like this if you pass in r as the mode the file is opened in the read mode that basically means that all i want to do in this file is i want to read from the file i don’t want to make any changes i just want to read the data the a mode makes it append this means that anything that’s existing in the file don’t overwrite it just add data at the end of it then there’s w which stands for write this opens the file and creates it if it doesn’t exist and it overwrites any data that’s already existing so that’s the difference between writing and appending a pending adds data to the end of the already existing content in the file writing will overwrite everything then that’s x which basically just creates the specified file and it will return an error if the file already exists with the same name we can read the file using the read method this will read the entire all the lines from the file into and show us that we can also specify how many lines we want to read and if you want to read it line by line then we can basically just loop over it and we’ll take a look at all of that in the hands-on then to write an existing file we must add a parameter to the open function now a would be as i’ve already told you to append a file append items to a file and w would be to write contents to the file right then there is delete okay so x is used to create a file if it doesn’t already exist and if it does already exist then it will return an error and finally we come to delete so deleting is basically allowing you to remove a file or delete a file that already exists you can do it using the os module you can write os dot remove give it the path of the file it’ll check if the file exists and if it does it will be removed from the disk so now finally let’s take a look at the hands-on of file handling we’ll take a look at what it means and how you can perform hands or how you can perform file handling in python as of now we only have a app.py file in our hands-on folder so let’s first try and creating a file so i will create f as the file handler using the open function i will write in the file name let’s say it’s sample.txt and i want to create it so i will use the x flag and i will close it closing the file is important because all the changes that have been made so far are made uh in in in memory buffer so when we use f.close all those changes get written to the file so make sure that you close the file so that the changes that you made to the file gets persisted let me remove that run this and now if i take a look at it as you can see sample.txt is already created but if i run it again i will run into an error that says file exists which means that the file that i wanted to create is already there so now that we have done that let’s try some other tricks now the problem is whenever i open i use the open keyword it’s obvious that i want to close it as well so instead of me having to write f equals to this and then write everything and then close it i can wrap everything inside a with keyword so with open file and i want to write something to it as f now anything that i write within this block and it’s indented that would mean that i want to after this block i want to end or close the file so instead of me having to manually write f dot close this will be done automatically form so let me just do this now let’s just f dot right and what i want to write is sample line save it run it it’s done by open sorry ctrl v open it and it’s sample line written to it but let me just add a back lesson and then right number line again see what happens when i do this i run this again and i open it but enter sample line and then enter the new line but if i were to then write something else or if i were to run the same code again this would cause i mean it won’t be the appropriate thing that i want to do so if i do this again it’s got the same thing and it again it’s got the same thing so instead of this being added to the end of the file what’s happening is the file is being opened everything inside it is being removed and then this this is being added to it but what i can do is if i remove this it put it back in and i type a now if i run this once and if i open sample.txt it’s there as you can see i have added a new line and here and then removed let me just remove it completely run it again open sample.txt there’s a new line run it again another line and if i keep running this multiple times as you can see multiple times the line has been added at the end of the file so that’s what the append method does now if i were to just read from it so for l in f dot red lines and all i want to do is print l as you can see it’s printing the thing that i want to print if i were to just remove the backlash and from my side this will print each line with the backslash n so this is what it looks like so we are able to read from the file we are able to write to the file and we are able to do a lot of things so with that in mind our files file handling is done obviously there’s a lot more that you can do with the file handling things but they should get you help you get started it’s very easy in python as i’ve already told you use the with keyword to avoid having to write after close on every file handle that you’ve opened if you don’t use the with statement then make sure that you’re closing the file before ending the script or your changes will not get written to the file that you’re trying to write it to so let’s see what else you have to do all right guys so now we’ll take a look at object oriented programming or oop in python so let’s take a look at the agenda we’ll begin by understanding what is object oriented programming and we’ll get an introduction to what it is then we’ll take a look at a real-world example of object-oriented programming then we’ll move on to object-oriented programming concepts the fundamental concepts like classes and objects then we learn about inheritance then we learn about encapsulation we’ll learn about polymorphism and then we’ll move on to learn about modules standard library in python we’ll learn how to install packages that you might need to use and finally we’ll learn about exception handling which is one of the most important concepts in the entire of python it allows you to write code in a way that manages errors on runtime so we’ll take a look at all of that so let’s move it let’s take an introduction to object oriented program so object-oriented programming is a programming paradigm where you can use a real-world entity which is called an object so object-oriented programming is basically about mapping a real world object in our code so let us consider an example let’s say that we have an object for parrot now a parrot has certain attributes and certain behaviors attributes are basically data about the parrot that we are interested in so data such as name age color and this data can be arbitrarily selected only the data that you need about the parrot needs to be in the object so there might be other data about the parrot that you might not need so the data suggests the species of the parrot and so on if you don’t need that you don’t need to include that and then there’s behavior so singing dancing or any other kind of behavior that a parrot might have could be attributed there a behavior is basically a thing that the object performs or some tasks some process that the object referring to the pair it performs so let’s take a look at an introduction to object oriented program and now we’ll be taking a look at basic principles of object-oriented programming so some of the basic principles are polymorphism classes inheritance encapsulation classes and object let’s have a real-world example of object-oriented program so as a human being we can be classified into a male and a female and these are the common functions between men and women they can speak they can hear they can walk they can see so now if we consider a human being as a class so a class is basically a description of what an object should be like think of it as a blueprint of the house that describes what the house should look like how wide should a room be what should be the dimensions of a room how many rooms how are they connected with each other that’s what a class is it’s like a blueprint that objects are created from so every human being has certain body parts nose hands legs heart eyes and so on and may they have some common functions as well they can walk they can listen they can speak they can see they can smell so these common features are called attributes so attributes are basically data about the person so as you can see the person’s name age and all that could be considered an attribute on the other hand they could have certain behaviors as well we’ll take a look at that as well so male and female are inherited class from human being so a human being is supposed to have certain features that is these features are inherited and the specific things that make them different are then overwritten let’s give an example let’s say that there’s a person who’s a male and the name is victor and the age is 24. now we can have an object of the class male and we can have attributes such as name and age associated with it so class is just a logical definition objects are the physical in-memory representation of that physical existence then you don’t need to know the detail about how you walk how you talk how you listen and so on these decisions are hidden and encapsulated encapsulated basically means that they are not visible to the person and they are bounded together with the data that they operated so for walking you need two legs for hearing you need ears for seeing you need eyes and all of that is encapsulated or grouped inside a single entity or a single object now on the other hand the the class of a woman a woman can be a wife a mother or a teacher or any other profession that they can espouse and at the same time having uh being able to perform so many tasks at the same time this is what is called polymorphism so let’s take a look at classes and objects we’ll take a look at other concepts such as encapsulation polymorphism more detail in later slides so what are objects and classes well class is like a blueprint of a house an object is the uh actual house that has been built from that blueprint so or you can think of a class as a definition of what a house should be like and then the object is basically a concrete implementation of that house or of the blueprint so object is the basic unit of object oriented programming it should be clear by the name object-oriented programming that objects are the basic unit now an object represents a particular instance of a class there can be more than one instance of an object so uh for instance if you have class you can have seven objects of the same class there can be multiple instances and all of those instances are called an object now each instance of an object can hold its own relevant data so if you have seven houses and they can all hold data about the color of the walls they can all be different so that’s what the difference between classes and objects are and we’ll discuss more of it when we are performing the hands-on and understanding the code but as not that’s what this is now objects with similar properties and methods are grouped together to form a class so this is what our object classes and objects are so how to create a class in python in python the class i think we use the class keyword we then give the name of the class and then we just underneath that we define the variables and all that for in an indented code block and the attributes are the variables and the behavior is the function so let’s move ahead and see what all of these mean now after that if you want to create an object we can just call the class like a function and it will return an object to us and there are a lot of more in intricacies in that which we’ll discuss in the hands-on but as of now you need to understand that this is for this piece now here object one is an object of the class with the name of class name so how to access class members is one of the most important concepts so you can create an object and using the simple class and then uh right after the class you can put parentheses like you’re calling a function it will return an object to you and in that object you can use the name of the last member which you want to use so as we know that we had created a variable named variable inside a class name so after that we create an object we use the object name we use the dot operator which gives us access to the members of the object and then we click on that and we create i was just created and that way we have several objects here now we print them and this is what it looks like and now let’s move ahead and see what this means so there’s another thing called the init method this is one of the several magic methods that are present in python and we’ll discuss those as well the underscore underscore init underscore underscore method this is also called a magic dunder method because it’s got two underscores right at the beginning and right at the end this is a special syntax for special methods that are supposed to perform special tasks so to give you an example we have an underscore underscore init method this is also called a constructor method this is a method that gets called uh the first time when the object is created so when you create an object out of a class this will get called automatically so as you can see we have defined a class we have defined an underscore underscore unit underscore underscore method this is also called the constructor method that takes in a certain parameters self is the object that will be created and returned so if you want to make some changes to the current object you can do it here such as attaching some properties so we are adding some name branch and here to the current object we print a student object is created and then we send it back to the user so when we create an object these things get set to the object and then that object is returned and assigned to ob1 and then we use the print details method in which we pass in self which is uh the uh object current reference to the current object and then we print the name as you can see this is what it looks like so here are some tests for you so create two vehicles called car 1 and car 2. we can set car 1 to be right convertible worth 70 000 with the name of ferrari we get car 2 to be blue it should be a blue van named jeep and worth 15 thousand dollars how do you do that well we create a class the vehicle after that we define the name and kind that by default and we set the color there we get the description defined there as well and then we return the description so now our code will go there you know we will create new classes and we’ll we’ll create objects and we’ll print them so how do you create an object and how do you set these objects and how do you do that that’s with the vehicles you will have to perform that and in the hands-on let’s see how to do this simple task so to give you an example of how things are working we have an app.py file which is inside a folder and i’ll be running command prompt here and now what i have to do is i’ll show you what classes and objects are like let me create a class the name dog and if i were to create and underscore underscore init method and what i’ll do is i’ll accept the name for a dot self dot name people’s name that’s it deaf i’ll create two behaviors so a dog can talk and the voice that a dog makes is boo and i think i’ll have to expect a self parameter always remember when you’re defining the method in class the first parameter needs to be this self parameter and this is injected by default so you don’t have to uh pass in the any current objects or anything like that and this will be done for you by the python interpreter so let me just correct this and then i’ll create another web function called print name again this will need a self parameter and what i’ll do is i’ll use the print method and i’ll get self dot now let me just show you a trick so we know how to concatenate strings let’s see how to interpolate string interpolation basically means we create a string will leave some places there to be filled with data later and that will be filled by python we provide the data so let me show you so i say my name is and i put this two curly braces so i open a curly brace i close the curly brace and now i use the dot format method and i pass in self totally what will happen is python will look at the string it will go my name is then it will take a look at these curly braces and understand that i want to provide some data to be replaced to for this curly braces to be replaced with and in the dot format method i pass in the data which is self.name so now let’s call the function so let me just create a dog which is the class that i had just created this will require a name i’ll call it charlie i think i ran it by mistake so the name of the dog is charlie and now i can call the actions so if i call dot talk will print book and if i type doc dot print name save it and now if i run it using this i can just type in python h2n and the name of the file is after ui i press enter and as you can see i print wolf and my name is charlie now another thing that you need to notice here is that this method has a special meaning here so this is this gets called automatically whenever i create an object so if i were to create another object let me just print something here just to make print object with name created now i use the format method again and in that method i’ll pass in the name this is done and now this dog was created i create another dog let me call this bruno and and put and now charlie will talk then bruno will talk then you know speak its name and well charlie will speak its name now that that is done watch what happens when i run the code so object with name charlie gets created object with name bruno gets created so as we have created charlie first and bruno’s later on this is this function gets called for charlie this function gets called for grown up you create both the functions then we print the then talk is the method that gets called and first charlie stops then bruno talks and then it asks them to print their name so bruno prints their name and then charlie prints its name so this is what it does but another thing that i’d like to show you is if i print it so if i were to print char now these are objects these are not string so if i print both of them what do you think will happen and we get the name of the class and this is the class of dog and we get the ids so that’s not very helpful what if i wanted to print whenever i print charlie or bruno as objects i want to print their name right so what i can do is since we’re in a class and we have magic methods i like i’ll show you one more it’s called str and what i’ll have to do i’ll return some string here all i have to do is return self dot name and then this and as you can see now instead of printing that long string of name and class name and all that it’s going to print the name of the dog or the object of the name of the name attribute of the object so this is what it looks like there are many other major magic methods and i encourage you to check them out on the documentation page but as of now i would definitely suggest that you read more about it this is what it means hopefully the topics that i want you to get away with is the constructor method how to create a class how to create an object how to call methods on those objects now these two objects instead they don’t uh have the same uh they point they have the same structure but the data in them is different so the name uh the attribute name is different for both of them this is why when when we call the print name method bruno prints bruno and charlie prince charlie so make sure that you understand that as well it’s a difficult concept when you’re starting out with programming to get around but once you get the hang of it it’s very easy so with that in mind let’s move on to the next topic all right so now let’s take a look at inheritance python so inheritance is basically acquiring properties or behaviors of another class the class that is being inherited from is called the parent class and the class which is inheriting things is called the child class there are many other terminologies we will take a look at them one by one so to give you an example uh you would have inherited a few qualities from your parents and they would have inherited certain qualities from your grandparents and that they this would have kept going on uh however long your family chain is now in a family tree traits such as hair color eyesight and anything other than that gets passed down from generation to generation so this is what inheritance means there are many different kinds of inheritance in python and we’ll take a look at them one by one but for context these are the five types of inheritance we have single inheritance multiple inheritance multi-level inheritance hierarchical inheritance and hybrid inheritance so let’s take a look at them one by one so single class inheritance basically means that there’s one parent class there’s one child class and we inherit from the parent class and the child class and the child classes the class that inherits from the parent class so here’s an example let’s say that we have a class of fruit and then we have a class of citrus so the class citrus inherits from the class root and as you can see the syntax is quite easy to understand we create a class using the class keyword we give the class a name and right after the name we put the class that we want to inherit from in parenthesis after doing that inheritance is complete and then we can perform other tasks as well so now let’s check look at multiple inheritance so a class can inheritance from multiple classes now this is not possible in many other languages this is java but you can in cpr in other languages such as c plus plus and even in python you can inherit from multiple classes just be very careful by inheriting from multiple classes could cause a lot of issues if you’re not careful so make sure that you understand that so we have a class a and we have a class b and we want to inherit from both of them so to do that we can simply instead of just passing in one class of name in the parenthesis like we did earlier we can pass it multiple classes name and separate it by comma so that’s how that’s how you can hear it from multiple classes then there’s multi-level inheritance so multi-level inheritance is just single inheritance repeated multiple times or repeated in different combinations of classes so let’s say that i have a grandparent class then a parent class and then a child class the grandparent class carriage inherited by the parent class and then the parent class gets inherited by the child class that’s how the relationship goes just one class at each level is inheriting one other class this is what multi-level inheritance means so this is an example let’s say that i have a class a and a class b and i inherit from class a to class b and i inherit from class b to class c this is multi-level inheritance and then comes and then comes hierarchical inheritance so in this one uh single class is being inherited by multiple classes so it’s just for clarification your parents have certain qualities that get inherited by you and your siblings your brothers and your sisters this is a hierarchical inheritance and your grandparents had certain features that get inherited by your parents and their siblings so your father your uncle and everyone had inherited the behaviors the mannerisms all uh certain other qualities like hair color or eyesight from your grandfather so that’s all that and finally we have uh let’s take a look at the code first so we have a class a and then we this gets inherited by both class b and class c and finally we have a hybrid inheritance hybrid inheritance is basically a combination of any two kinds of inheritance that we had seen earlier so if in the example we have hierarchical inheritance between classes a and classes b c and d and then we have multiple inheritance in class e and class b and d so this is how hybrid inheritance force this is the code that explains the relationship that we had in the previous example or somewhat the relationship that we had seen in the previous example class b and c are inheriting from class a that’s a hybrid inheritance and class d inhabited from class bhg that’s a multiple inheritance so inheritance when you’re working with inheritance there’s a function called super so what does that do so let’s say that i have a parent class of vehicle and a child class of two wheels what i do here is that i use the super function and then i after calling the super function i can access any property or method on the super class so if i want to call the method from the super class i use the super function superclass is the class that i am inheriting from the parent class so i call that and then i call this start function on it when i print i have two wheels and then i print i call the stock method there so when we come to inheritance there are certain concepts like overloading and overriding so overloading and overriding separate concepts let’s take a look at them one by one overloading is basically one function with different parameters so we overload the function so that we can understand what kind of behavior we want depending on the number of parameters the function will get called so in the add function if i type in three integer it will call the second function and if i call pass in two integers it will call it the first function so there’s another way of overloading a function instead of creating multiple functions with different parameters what you can do is you can uh act on the function parameters based on the data type and the size of the data so that’s how you can do that this is what’s happening here with checking if it’s an integer or string if it’s an integer then the beginning result the result that we’re about to return should be zero which is an integer so that we when we add numbers to it we don’t end up affecting the result on the other hand we can assign an empty string to it if it’s an instance of string and then we perform the concatenation operator again and again and again or add additional operation and then finally we just return the result now we come to overriding so overriding basically means changing the functionality of a function that we had inherited from a superclass so let’s say that i have a class called a it has many functions one of them is say hi so it says i am in a i create another class called b inherits from class a but i want to change the implementation of the say high function so i do that by defining another function with the same name and the same parameters and then i print a change the implementation and now when i call this save say hi function the b classes version of say high will get caught so let’s take a look at some examples here’s the code and we’ll take a look at multiple forms of inheritance overloading and overriding the first thing is i’ll create a class named base let me give you a more intuitive example let’s say that i have a class named person a person has several attributes i will create a init function okay now here i will firstly the person is always going to have a name let’s just keep it simple and keep it at that so i will tell dot name because the name that is faster so this object will now have a property named name which will be set equal to the name that the function that the user passes and now left stay named which takes in send and all i will do is i will print self. and we are done now let’s create another class now a person could be a lot of things so let’s say that engineer and i will inherit from person now def self dot p r o f f-e-s-s-i-o-n equals to engineer and finally i can create another function say question that will take in the self parameter and we will print self dot profession and that’s that’s basically what we’re trying to do and now i can create another class with a similar structure and here this will be a doctor so let me first complete the implementation with the doctor now one thing that i’d like to show you is this is single inheritance right one class is inheriting from one class and now if i create an engineer so and if i print dot say name so obviously i had to i had not to pass a name as of now and this is going to cause some issues i wanted to figure out why if i run this let me just put on this as you can see i’ve gotten an issue it says engineer has no attribute name can you figure out why that is let me tell you why that is if you want to take some time and figure it out pause the video and think about it and if you uh if you’d like to know how why this is happening i’ll tell you so with the engineering let me pass in a name first so let’s say the name of the engineer is john the reason why is i’m setting the name but this constructor function is not getting called so i’m not setting the name property here all i have to do to solve this is use the super function that we have already described and using the super function i am going to call the underscore underscore init method inside it all i have to do is i have to pass them now that that is done if i use the function now and i’m calling the person’s name john so this is working and we are working for a fine as of now now if i want to use instead of this what do you say profession and i’m printing the profession of the person which is this now the problem that i’m getting here is that i’m printing the return value which is nothing so if i do this then it works the name if i remove this and i run this again it works so as you can see that this is what it means now if i were to do this again and if i were to change it from engineer to doctor i’ll show you certain example of where inheritance is so useful so i’ll pass in the name doctor and if i were to create a doctor doctor i have to change this here as well and let me just name dr jane and in your same image and you see profession and now i’ll do the same thing but instead of it being an engineer i will doc now if i run this as you can see john is an engineer jane is a doctor so this is working now the major advantage of engine inheritance is that it allows us to remove code duplication so instead of us having to copy this code into multiple implementations of things that require this we can just reference it here we are saying anything that’s in person you can use so now we are using this so if i were to change the implementation instead of it being just printing the name my name is right and now if i were to print this as you can see the implementation is changed in both of them imagine this is a large system and i have a lot of code so instead of me having to then run through all the professions that i have and changing this in copying and pasting it in every file i can just change it here and this change will be reflected in all the files so that’s the major advantage of inheritance it allows us to remove code that needs to be copied multiple times and we can safely change the implementation and the other functions will uh work exactly the way they are supposed so that’s that was multiple inheritance but let’s say i have wanted a multi-level inheritance so how what how i will go about doing that so let me just remove all this let me create a class this is going to be a really generic example class a and i will create init pass in the self and print just me like instead of printing let me just pass this keyword is used when i want to tell python that there’s nothing no implementation here if i leave it open then i’ll get some error as you can see i’m getting an error here expected an indented block of python so i’ll just put in paths here because i have nothing to tell right now then b and finally class c will inherit from both a and b right and now all i have to do uh if i want to i can basically add a few other things so i can create a function named or another thing i could do is say name i could do that as well but i think the most important part here is let me just print and i have to pass in self here even though i’m not going to be using self print now i can copy and paste the same thing and this copy and paste it here okay now all i have to do is i have to create an object let’s say i call it obj and i create c now if i were to print dot print a again i’m making the same mistake print b i’m going to say from a from b from c so these methods get inherited in this class this is multiple inheritance and there’s hybrid inheritance and then hierarchical inheritance i don’t i need to go into any details of that basically this would be just be inheriting from a and c also inheriting from a and then would that would be it so hopefully this was useful and now let’s take a look at overriding and overloading let’s say i have two functions let me create a class called base class has nothing to inherit from def add it takes two functions a and b and it will return a plus b right now let’s say that i have another function def add it takes a b c and it also takes in the parameter itself and it will return a plus b plus c yeah indeed it properly yeah and now if i were to print let me just create an object and here i can call oops sorry used to be an equal so add and i pass in two numbers two comma three and another thing i can do is i can call two comma three comma four now if i call the functions i have two functions with the same uh that is missing one position requirement for c a comma b comma c okay i think it should work for some reason it’s not we’ll take a look at that okay if i were to pass in five okay so we can have multiple functions let me just remove it out of the class so we don’t have to deal with multiple functions okay this is this will work just fine oh yes self is also next to beautiful because we are outside trust so we can add multiple functions uh in python i think we need we need to yeah this is how this is this would work let me just clear the screen and yeah so this could work just fine if i if i have uh overridden the functions but you could have multiple this is what overriding would look like in a function let me just show you what overwriting would you think i have a base class and then let’s say i have a derived class that will take a base class and here what i have to do is i will use the add function again but a comma b but what it will do is instead of just adding a and b it will also add 1 to it so for also i need to use the self keyword now if i want to create this derive the derive and now what i can do is print base dot add one comma two and derive dot one comma two comma three or i could just add one comma two is fine as you can see uh it’s giving me different results based on the implementation that i have but let’s talk about overloading so i want to create a function that handles multiple multiple things what i can do is instead of doing it this way i can start argues so this is what it what it will do is whatever however many uh arguments i pass let me just go over even if i were to pass in three arguments four arguments five arguments it will perform the same task so now if i take a look at if length of args is equal equal to two then return r sub 0 plus x of 1 if it’s not the case then result equals to zero and for x in args result plus equals x and then we return the result so i can pass in multiple arguments just make sure that you’re putting a comma before them and if i run this now i have to syntax this effect i need to put this and now it will work yeah so this is what overloading means so i have created multiple implementations based on the number of arguments that have been passed to me i can create uh functions as well so i can create add two and here i’ll add two functions and all that so that’s how overloading and overriding works so hopefully that’s clear now let’s move on to the next one the next topic is encapsulation in python let’s take a look at that encapsulation is basically abstraction plus data hiding abstraction is showing essential features and hiding non-essential features to the user so uh depending on how your application is set up let’s say that you have uh an application that manages the payroll of of a company now since it’s managing the payroll of a company the user does not need to know what kind of database it uses what kind of file system it’s using what kind of language or what kind of drivers it’s using the background background that’s all things that you have to do so you can abstract those details out of the program and just focus on the ui and all that so that’s what abstraction means just showing the essential features and hiding the known essential features for you to deal with so this is what it means so while writing an email you don’t know how this things are actually happening in the back end this is one example of abstraction as well so when you’re writing an email on gmail when you send the email you don’t know how the things are being put into an http request how they’re being transferred from other email servers and how the email reaches to the person so encapsulation deals on on the concept of public and private so public basically means a function that anybody could access so so far we have created only public functions private functions are functions that other people are not supposed to access you can access them uh under very strict and uh rigorous circumstances but it’s generally discouraged so if you want to access something then you can use update software then you can use a syntax that we’re showing down so let’s say that we have a class named car it has an init method and in that we have an update software method it’s a private method the way we know it’s a private method is that it’s been preceded by two underscores so whenever we start a function name with two underscores that means that it’s a private method so we create an update software method we create a drive method and then we call the update software method i create an object called red car i call the drive method since it’s public but if i want to call the private method this is how i do it like underscore the class name and then two underscores and the function now again this is not something that you should do it’s generally discouraged to call private methods on your own so if possible don’t do it on your own but if you have to then this is one way you can do it so you can change the value of private variable using a setter method as well so let’s say that i have a private function called private variable called max speed let’s say i’m defining a creating a car game in that there’s a max speed function so instead of me having to reach into the function and then exposing the max speed variable what you can do is you can call it a max speed function set max speed function and pass in the maximum speed that you want so using the red car dot underscore underscore max speed will not change the private variable because it’s private but you can use a setter method to do that for you now let’s take a look at a hands-on to understand how that works so let’s say that let’s say that i have um let’s say that i have a function called some let’s have a class called some class and here i have a function let’s call it public i’ll print like function and now i can create a private function always remember underscore underscore means private private this takes in no functions so i will just print private function now i think i’m getting some error mainly because this is supposed to be a class and they need to accept a self parameter and now if i were to create an object and now if i were to call the obj dot public this would work just calling the public function but now if i were to call the private function underscore some class underscore underscore p-r-i-v-a-t-e it’s calling the private function for me so that’s how this works now let me tell you how to do how to deal with the private variables so let’s say that i have a self dot underscore underscore private equals to uh instead of it being private let me call it b a l a n c and set the balance to be zero so let’s say that i’m creating it for some account now i want balance to be private mainly because let me call it the reason why balance is private because there can be certain rules that bank has depending on how the things are set up we might have multiple multiple rules so it could be that a balance can never be equal to zero or it could be that a bank has that whenever you open an account of balance should be minimum of 10 so we can set it here now also when we’re setting the balance there could be multiple procedures that we need to follow that we need to log the old balance then update the new balance and then log it log that as well so these are the kind of things that will take a look at in the set balance function you take self we take balance and what we’ll do is self dot b underscore underscore balance equals to all right so i need to put a comma here all right so now instead of this i can create a count and i can also create a get balance function so def get balance self and what it will do is it will return self dot underscore and scope so now all i have to do is print obj dot get balance i’ll print it again and before that i will change the balance set balance to be 20 now if i were to run this what do you think will be the output so okay i’m getting none here okay obj is an account i call the init method yeah self.balance is being set to 10 and then i’m passing in the balance or i could also do it this way self dot set balance to be 10 still getting none so we have self dot balance we have self.set balance we have an account and we are using the obj let’s just print it here okay so we’re getting the balance but we’re not able to return it mainly because it’s a private thing so that’s why we’re not able to return the function so make sure that you you’re understanding that so you can’t return a private value but you can use it under it as you can see now i have 10 and 20 let me just remove this if i were to print get balance instead of we’re printing 20 but if i pass it here save this and run it again as you can see the current balance is 2010 and then it gets printed to 20. so that’s how that works hopefully it’s clear to you and let’s move on to the next topic all right so now let’s take a look at polymorphism in python so what is polymorphism well polymorphism basically means it’s the ability of a class or an object to perform certain tasks or adapt to perform certain tasks based on the environment so to give you a bit more context let’s say that we have functions with the same name but they behave in different ways that’s polymorphism you might also recall that this is also what’s called as over riding so this is basically that so you have different you as uh human beings we behave differently with our friends and with the elders and people we respect so that’s very akin to what polymorphism is so a single person behaves differently at times similarly objects can be have have say the same method that performs different tasks depending on the environment so polymorphisms with the function is quite simple as you can see here we have a function called in the pacific in which we pass in an object of fish and the fish will swim so we have two true fishes one is shark one is clown and both of these fishes will sing differently that’s polymorphism in a way let’s take a look at polymorphism in more detail using some code again the setup is quite the same i have an app.py file and i’d like to show you how i’m going to do this so let me first create a class named worker and that class will have a function or a method rather called what that will take in a parameter of cell and now all i’ll do is i’ll print working now that that is done i’ll create another function called perform task this will again take parameter of cell and what it will do is it print performing task and instead of after this i don’t want there to be an empty line so i’ll change the end parameter and now i’ll call the self.work function or method so that is done now what i’ll do is i will create two classes one is the class of a delivery person so plus delivery man and this will inherit from the work and here what i’ll do is i’ll override the work function and it will take in a parameter of cell and all i have to do is change the print statement here p-r-i-n-t delivering goods similarly i’ll place another class this class will be called lumberjack this will also do this but here the task will be different and now it’s time for us to perform the task so what we’ll do is i will create an object of delivery man now create another object lumberjack sorry and now all i have to do is i have to perform the tasks so deliveryman dot perform task number jack dot perform task so even if the perform task function is going to do the same thing since we’ve overridden the work function in both of these uh both of these classes are applicable our code is going to perform differently so for one thing we are performing the task for billy language and for the other delivering cutting board so you can mix and match a lot of these tasks and perform and make it to work the way you want it to work now let’s move ahead and see what else we have to do so now comes the title of modules so a modular basically let’s take a look at what modules are to put assembly a module is simply a file that contains some python code modules can define functions classes and variables and can also include runnable code so there are some predefined modules in python standard library that we can use for our basic purposes and we can create our own modules so there are two keywords that you need to be aware of when you’re working with modules the first one is the from keyword the second one is the import keyword so the from keyword allows you to specify the things that you want to import from a module and the import keyword basically just imports one entire module into the current scope so i’ll show you what that means and let’s take a look at some module code so again the setup is the same but here what i will do is i will create another file and i’ll call it utils.py here i will define two functions so a function would be add and what will do is written x plus y and i’ll do the same thing here but instead of adding it it will be multiplying now that i’ve done this i can now use the uh previous thing so from u t i l s is the thing and my code editor is recognizing this as well we will import that so if i want to just import add from mutants i can do that and i’ll do add i’ll just print it add 7 comma 9 now that is done if i run this again here it goes and now i have wanted to import the uh multiply function as well you can do that and that works as well so it’s working for failure but if i wanted to import everything instead of me going to write everything like from ins util support add multiply divide absolutely and all that i can just import utils and now instead of add i will use newton’s dot add which i have already run and utils dot multiply and it’s working the same way so this is how modules work in python let me just close this one out let’s see what else we have to do so let’s take a look at the python standard library so python standard library is a very extensive library which offers a wide range of facilities so it it’s basically pre-bundled set of codes that python ships with so that you don’t have to write everything from scratch so the standard library contains built-in modules they provide access to functionalities such as io which is file io or disk io or something like this and they provide uh ability for us to take a look at data types functions uh we have modules that are already available in the python library and there are many other modules in python standard library that allow us to perform some very common tasks uh which we can take a building like building csv files uh taking a look at the operating system feature so on and so forth so let’s take a look at the python standard library code so again the setup is quite the same we have the uh the standard library uh we’ll be taking a look at the python standard library essentially what we’ll be doing in this one is that we will create three files inside the data for inside a folder named data so let me just do that uh i have no folder now let me create a folder called data inside it let’s say that i will create three files uh sample dot txt uh data dot txt and let’s create another file called demo dot txt now i want to take a look at all the files that are inside the data folder there are many ways to do it but in python since it already ships with a os module that provides me with this functionality i can type port os and now all i have to do is i will print os dost list directory this is a function this will take in a path and the path is the data folder if i run this as you can see it gives me all the file names let me just clear it up yeah this is what it looks like so there are three files in this folder and these are the name of the parts now i can take this a lot further there are a lot of things that you can do using the os module and there are many other modules that you can perform a lot of tasks with similarly if i wanted to just use this i could take do it like this from os import list directory just to revive the previous example and if i were to use it now this is what it looks like and it works just fine so let me just remove this this works now let’s take a look at what else we get so we will be installing packages and when we are installing packages we use let’s show you we use something called pip pep is the package manager that comes with python and it allows us to use it allows us to use code written by other developers and we will use the code written by the developer so that we don’t have to write the same solve the same problems that has already been solved by other people so uh in this example we are installing the colorama package which we will be using in our hands on so let’s take a look at that so as of now colorama package is not installed for me to install the color rama package i have to type in the command pip install colorama c-o-l-o-r-a-m-a so pick install and i press enter if i you’ve installed python in your system then this will work just fine and it will take some time it will take okay so the colorama package has already been installed on my system this is why it’s saying requirement already satisfied but in case you don’t have it it’ll work for you now all i have to do is i have to import some methods so um me see colorama i will import the init method also from term color imp i will import colored so what i’ll do is firstly i’ll initialize call the init method and now all i have to do is i have to print something right print it firstly i’ll type in hello world and in the here i will have to wrap it inside the colored function so that the output is colored here i passed in the text and now i need to pass in the colors so the colors would be white and on red so white this should be a string and not now if i go to my stretch system type in pack pip the python app dot py as you can see the background color is right and the foreground color is white exactly how i wanted it to be so instead of me having to write the code to change the colors of the strings that are being returned having to deep dive into how the terminals work and how to accomplish this task this has already been done for us we can just use the colorama package and we can use the term color and all that so this is how that works and hopefully that’s clear you can install any package that you want to install i mean just make sure that you’re understanding what is the package you’re installing is it a good package it has it been used by other users is it tested is it pure and all that and now we come to exception handling and this is an important concept so make sure that you understand it so an exception is basically a runtime error uh this could be caused by a lot of things that are outside your outside your control let’s say that you’re building an application that works with the database now it could be that the database for some reason is not working maybe the table is not there maybe the web server that is hosting the database database management software is causing some issues has run into some issues and now you need to perform some other tasks so these are runtime issues that you can ex you can basically uh expect these things to go wrong and write code that should be performed when these things go wrong such as displaying an error message and doing something else if you don’t do it then your program will exit immediately one of the most common uh runtime exceptions is file not found exception so exception handling is basically writing the code to handle these exceptions to occur we are basically expecting things to go wrong and we are writing code to handle if these things go wrong the way we want them to go wrong so there are certain key words and you need to understand their meanings before you jump into exception handling things like try accept finally raise cash and all that so we’ll see we’ll take a look at what these mean let’s take a look at trash the you can create your own exceptions if you want to uh and raise them appropriately where you think it’s appropriate but as of now what you need to understand is that these exceptions can only you only need to create your own exceptions when the already existing exceptions don’t make sense so let’s take a look at try try allows you to define a code block that should that could potentially throw an exception so underneath that underneath the try keyword as you can see we have indented a function named throw exception what happens in this function is that it takes a look at the function it throws uh it might throw an exception and if it does throw an exception then we catch it and we perform certain tasks we catch it using the except block so accept allows us to define a code block to check whether when an exception occurs what do we do then so like here we are raising a custom exception and we are catching expecting uh it to go wrong and raise that custom exception and in the accept block we’re tired typing when a custom exception is thrown by this code then perform this task here we’re just printing the message written to us by the custom exception and then we go to the finally block so the finally block gets run e whether an exception is thrown or not so it let’s say that you have a database connection and you want to close it you can think of it like this even if your throws an exception your database connection will get close no matter what so that’s why we use the finally block it’s basically used to free up resources that are not going to be used so if you opened a file or you opened a connection to a network and your code throws an exception and it exits out prematurely then that connection remains open so the final block basically allows you to take a look at the code and then if there are some resources that you need to free up you can free up there and then there’s the raised keyword raise allows us to create our own exception and raise them the way we want them to so this is what it means and in this example we are creating our own exception and then we are raising these exceptions as well using the function called today’s custom exception and which in which we are basically using the raised keyword and instantiating a new exception and now let’s take a look at some code so all we’ll be doing here in is will be creating an exception throwing an exception and learning how to handle custom exceptions as well so the first thing i’ll do is i’ll create a function called throw exception i will accept a number here and if the number is equal to equal to zero then i will instead of throwing i’ll use the rays function and exception and here i’ll pass in uh an argument that will show the message that i want to display on the screen which would be human 0 is not did else we will print so now if i were to use the throw exception function pass in zero and if i were to run the code as you can see this throws an exception and this exception is causing that program to stop abruptly so if i’m typing done here and if i run this again the last line does not get executed our program exits here so if i want to handle this then this would work for instance i’m handling exception as ex and i will print it save it run it again and now it’s working fine it’s showing me that argument 0 is not accepted and we’re printing the done statement now another thing is just the else statement could be used here as well so else statement is used when uh the code associated with the else statement will be called when nothing well the exception is not true so i will type in not only when the exception is not from here the exception is being thrown so if i type in one this will work fine so it prints one which is because we are printing it here it’s not raising the exception then it goes ahead and goes uh not thrown and then we print the done statement so it’s working fine now let me show you what the final statement would do and if i were to type sorry and now if i were to type this as you can see it prints done but if i want to pass in 0 and press this as you can see this is still printing that so this is used to run code whenever we need to make sure that a co piece of code is being run even when things are not working up so even if there was some reason because of which our code would exit abruptly because an exception was not handled finally class would run completely fine so now that that is done let me show you how to create your own custom exception so let me just create a class called my exception it takes an exception now i will have to in it now as you can see i am inheriting the exception class so i would use the super method here i would have to pass in the current class self object and i have to call the init method here on an issue that i can see is i need to pass itself and now here i’ll pass in uh in the init method at least in the current exception class i don’t have to pass in anything and now i need to just set the self args here it’s a tuple so all i have to do here is just paste my section save it and now if i were to create my exception yes my expression as you can see i’m getting it here but i’m getting it in the way of a tuple so everything is this way but as you can see i can just change it by typing a comma now i’m getting my exception and rest my so this is working now if i were to try it out and in the accept top all i will do it in the accept block i will have to specify the kind of exception that i’m expecting so my exception as ex and i will print x so we’ve done this and it’s working fine we can mix and match a lot of things like else and finally and all that but here it’s working fine so let’s move on and see what else we have to do web scraping is a term which is used to describe the use of a program or an algorithm to extract and process large amount of data from the web whether you are a data scientist or an engineer or anybody who wants to analyze large amount of data sets then the ability to scrape data from the web is a useful skill that you should have let’s say you find data from the web and there is no direct way to download it then web scraping using python is a skill which you can use to extract the data into a useful form so that it can be imported for further analysis so what is web scraping well web scraping also known as screen scraping web data extraction web harvesting etc is basically a technique which is used to extract large amount of data from the websites these data are mostly unstructured in nature and once extracted you can transform it to a structured data and save it to a local file in your computer or to a database in a tabular or a spreadsheet format the data displayed by most of the website can only be viewed using a web browser and most often they do not offer the functionality to save a copy of this data for personal use only option then you have is to manually copy and paste the data which is a very tedious job and can take many hours or sometimes days to complete now using web scraping you can automate this process so instead of manually copying the data from the website the web scraping algorithm will perform the same task within a fraction of time so this was about what is web scraping let’s move ahead now next we have is web scraping or web crawling the same thing well there is a very subtle difference between web scraping and web crawling moreover web scraping and web crawling are interrelated the words web scraping and web crawling may look similar and many people use this word very frequently but both have lots of differences between them in simple terms web crawling is the process of repetitively finding and fetching hyperlinks starting from a list of starting urls broadly speaking web crawling is a process of locating information on world wide web indexing all the words in a document adding them to a database then following all hyperlinks and indexes and even adding that information to the database major search engines like yahoo google bing etc all have such a program which is also known as a web spider or a bot but when you see on the other hand web scraping is a process of automatically requesting a web document and collecting information from it strictly speaking to do web scraping you have to do some degree of web crawling to move around the websites web crawling is generally what major search engine do for searching any kind of information web scraping is essentially targeted at specific websites for a specific data for example stock market data business leads or supply product scraping anything like that but an important thing to know is that web scraper can do things a web crawler wouldn’t do for example it doesn’t obey the robot.txt i’ll tell you in detail what exactly it is later in this session but now just understand that they doesn’t obey robot.txt it is a file which tells that what you can crawl and what you cannot so web scraper does not obey that robots.txt file don’t worry later on we’ll read about it in detail next is submit forms with data or execute javascript or transforming the data into required form and format or even saving the extracted data into database so these are the things a web scraper can do but a web crawler cannot i hope you guys are clear with the difference so moving on ahead let’s learn about some of the use cases of web scraping or under what business scenario you could use web scraping so on number one we have tracking competitive pricing well web scraping helps in extracting products or service price of the competitors to stay ahead of the cut throat competition next is sentiment analysis well sentiment analysis is analyzing the reaction or response of a customer or a consumer using web scraping it can be easily traced by extracting ratings reviews and feedbacks on forums as well as e-commerce websites next we have market research when you are planning for a product or a service launch then you can use web scraping to study the market in advance and it can help you out for the product campaigns launch next is industry scrutinization most of the time you and your business would demand to know that who all are present in terms of market player so in this case again web scraping can surely help a big time next is content aggregation well if you want to gather information from multiple documents or web server for further processing then what you can do then you can just web scrape the data then you can just process the unstructured data and convert it into a organized structured data and it can be further used as a real-time information next is monitoring brand value you can make future decisions easily and very accurately by analyzing the brand value on scrape filter data related to your brand along with some positive and negative keywords next we have is the lead generation well you can use the scrap data to identify who to target in a company for your outbound sale campaign you can even use it to locate possible leads in your target market or identify right contacts within each one well this was some of the business application of web scraping well if you research out there you will find out any number of other business application for web scraping now since that you know you are fetching data from a different website so now that you have seen various uh business use cases of web scraping and you know that what you do in web scraping is you fetch the data from other website so question might arise in your mind like is web scraping or web crawling a legal thing to do well the answer is maybe crawling means fetching content from the web pages in an automated manner rather than to manually opening each page in your browser the calls made by the browser agent the target server that hosts the web page is quite similar to the way the bot hits a page to grab its content so now the question arises why is crawling a taboo among those who only have learned to use the dom well it’s mostly because it’s quite often used against the website policies and breaks the ground rule of crawling so here are some thumb rules that you must follow if you want your bot to behave humanly the first one being robots.txt will beat a commercial or a non-commercial indian american or any website one of the easiest way to find out which websites allow scraping and which doesn’t is simply by checking their robots.txt file you can do it by appending forward slash robots.txt at the end of the website url you wish to scrape for example what if i want to check the robot.txt file for amazon.com so i’ll just write www.amazon.combots.txt so as you can see here it gave me a list of allowed and disallowed link which can be crawled you can consider this as a filter consent form that you should abide by if you want to crawl that site tells you what urls you can and what you cannot crawl this is really bot specific even google’s bot cannot crawl a blog page unless the site is worried about the page seo next is the public content will crawl only public content keeping copyright policies in mind if you are crawling a site only to reproduce the same content on a new site of yours then good luck with that you can easily do that it’s a very legal thing and no one is gonna pose you or sue you for that next is terms of use check the website’s terms of use and make sure all’s well between them and you you should definitely read the terms and condition of a website you want to scrape if the data is under creative commons you should definitely read the terms and condition of a website you want to scrape and check if the data is under the creative commons that you can use it commercially so basically if terms and conditions states that scraping is illegal then you shouldn’t scrape that website next is authentication based websites well some sites need authentication before you could access their content and mostly they would discourage crawling because they only want real human beings to get logged in next is crawl delay robots.txt also lists delay to be maintained between consecutive crawls to ensure you are not hitting their servers too hard if you overload them with requests chances are that your ips will be blocked let me tell you about an ongoing case between linkedin versus hiq in august 2017 linkedin blocked hiq from accessing its data available in public linkedin profiles of registered users so what these hiq guys were doing they were just extracting information from linkedin all the public information all your public profiles which are being extracted by hiq so what linkedin did it blocked the company from accessing its data hiq took linkedin to the court and the case is still ongoing the case hinges on the question of who owns a piece of data and the circumstances under which the information can be viewed as residing in the public domain accessed by all and the appeals code judges may rule that linkedin owns exclusive right to the data which would not have been compiled without the entrepreneur talents of the linkedin founders conversely the judges may conclude that since linkedin users set their profiles to public placing them in full view of search engine and general web surfers they are giving companies like hiq free reign to view and use the data as they see fit so it’s a knife edge decision with strong arguments on both sides either ruling could have profound implication for how people like you and me interact with the data in your daily lives so this was a story of how hiq breached the policy of linkedin by web scraping their data so as you know that for this tutorial i’ll be scraping using python so what are the basic python libraries that i’ll be using for web scraping well the first one is the request which is used for fetching urls it defines the functions in class to help with the basic url actions like digestion authentication redirection cookies etc next is the beautiful soup well it is an incredible tool for pulling out information from a web page you can use it to extract tables lists paragraph and you can also put filters to extract information from web pages beautiful soup does not fetch the web page for us that’s why i’m using the combination of both request as well as the beautiful soup library python has several other options for html scrapping in addition to beautiful soup like mechanized scrape marks scrapey etc don’t worry we’ll discuss about them in our next tutorial for this tutorial we’ll be just focusing on beautiful soup and requests so let’s move on well why performing web scraping you will have to deal with html tags thus it’s very important that you must have a good understanding of them if you already know the basic of html it’s good but for those who don’t i’ll try to cover the basic for you guys in just two minutes here’s a basic html code the first line exclamation mark doctype html it indicates that this particular html file is written in html5 so basically all the html5 documents must start with it next is the html tag all the html documents are contained between an opening and a closing html tag next we have is the opening body tag well this is the visible part of html document like whatever you see on the web page is mentioned inside the body tag next is the body tag well it is the visible part of html document like whatever you see on a web page is enclosed between a body tag next we have is the h1 tag well html headings are defined with the h1 to x6 stack there are six different types of heading which are defined as h1 h2 s3 h4 h5 and h6 for example h1 is the main heading then h2 is a subheading and so on after headings comes the paragraph for paragraph we have a p tag over here this tag is used to define a paragraph there’s the ending body tag and the ending html tag let’s move ahead next we have is the table tag don’t get confused with this style thing the main tag over here is table the style is just one of the attribute of a table like it’s telling that the width of the table should be 100 we represent the row as tr tag and the rows are divided into data using the td tag for example i mentioned a name over here like james james is a data of a row it’s a cell fine we add another data to same row james smith another data to the same like 45 so first name last name and h are being added to same row tier now again what if i want to add elements to my next row so again i’ll define another row tr then again i’ll write first name last name and the age and then again close the dr tag or the rota these are some of the basics see what we have next next we have up here is the a tag well a tag is used for hyperlinking text with some websites for example this is my entire tab what we have up here is a href equals www.intellipad.com so this is the tag and what is my text that has to be hyperlinked to this website it’s visit to learn then we close the tag so this entire thing is known as element a hrdf part is the opening tag or the atar and inside that we have attribute name as href and attribute value as www.intellibatt.com moving ahead we have the enclosed tag content to visit to learn more which will be hyperlinked to triple w dot intellipaat dot and finally we have a slash a as the closing tag so this was all about the basic html tags which i’ll be using in this session if you want to learn more about html tags in detail you will find multiple online sources freely available for you you can just learn about html in detail from those websites or content first of all let’s begin by understanding what web scrapping is so suppose you have a website or you have a link for a website which is containing some information that’s getting updated regularly and you wish to store that information either locally or in a database or anywhere or you wish to access that information and then perform some manipulation on it now usually and before web scrapping you should usually check if that website has an open api to provide that data so that you can just request the data using a url and however sometimes these websites either don’t have a web api or don’t have the data that you want in the form of a web appear so in that case what you do is you pass the entire html content of their website as a string and then extract information from it that you need you can then store it or manipulate it however you wish to and that’s what’s known as web scrap now python has built uh has a lot of packages for web scrapping so one of the packages that we can use is called beautiful soup it’s now in its fourth version so i believe beautiful soup 4.4 is the current version and another one is scrappy now scrappy is less for web scraping more for building web spiders or web crawlers now these you can these this is this could be considered an entire framework in itself as it allows you to create two spiders by inheriting classes and then telling it how to parse it as you can see in the example right here and this could be useful when you’re creating a web crawler and you wish to crawl multiple pages but if you wish to crawl a single page if you wish to crawl create a web scrapper for just getting information from a single page that’s being run i think executed several times a day such as is a stock market website for example that’s getting updated hourly so you can use beautiful soup in my opinion now scrappy is also very good but it’s it’s more powerful than beautiful soup in the sense that it is used to create web spiders or web scrappers or web crawlers particularly for them so if you wish to learn scrappy it’s the documentation page that you need to go to there are several tutorials on its documentation page there are examples so on and so forth in this session we’ll be looking at beautiful because it’s very easy to get started with and it’s quite easy to use as well so now let us take a look at what can we do with beautiful so with beautiful soup we can if we have an html page offline stored we can either parse that or we can parse an html page from a website now parsing an html page from a web address that needs you to get the string content of the html page so for instance as you can see this is the string content of the html page this will be returned as a string then you can parse it and then you can use it with beautiful soup so before we get started with beautiful soup one thing that you need to understand is that you need to install all the dependencies in your project now i’ll be using pip and that’s p-i-p-e-n-v to create an environment and then install all the packages that i need uh have any of your work with pip and viewer if you have then you can message in the chat box no okay so let me give you a brief introduction of what pip env is why do we use it and how do we use it so basically suppose you are creating a project in python using external libraries like a beautiful soup there are many others like pandas numpy django django rest framework all of these packages are available inside python inside the python they are not available inside the python standard library so you need to download them from the internet and then use them now you can download them but it could be the case that these packages have dependencies themselves that is they require certain packages to be downloaded as well and those packages require certain packages to download themselves and then those packages might even have versions so for instance let’s say django your project only works on django 2.1 and there were some changes in django 2.2 which you cannot support at as of this moment now let’s say that another person or maybe you yourself wishes to work with django 2.3 to create a new website now you would either have to uninstall django 2.1 and then install django 2.3 or you would have to create a new environment and install this install the django version that you wish to use in that so this is where tip and b comes in so what it does is it creates a virtual environment all the packages that you install in this environment can only be used in this when you enter this environment so think of it as a container or a virtualized environment for you to install your python packages so for instance let’s say that i wish to first of all you need to create an environment open and for that you can type pip and start or click and shell or whatever i usually start type pip and shell and then you press enter and it says here creating a virtual environment now this is important because everything that we install in this virtual environment will only be able to use in this environment we won’t be able to use it outside this environment so it’s creating a virtual environment getting all the files required for it before paid bank there used to be virtual env so now we have different virtual env is what is used underneath pip env but prepared me env could be thought of as an abstraction now you can see creating pip file this file contains all the all the packages that your project might so as you can see we have created a pip env and an environment and as you can see this is the from where you have to run this command because that’s saying uh not recognizing this command okay okay so first of all everything so if you wish to run this command first you need to install ppnp so in your pc or in your mac or whatever you have type pip install pip enb i’m sorry install yeah and when you press enter yeah mine is already installed so it’s not going to say that but you can install it is it installing oh it’s struck at 65 percent yeah it’s doing something yeah yeah but it will install it’ll take some time but it’ll install and then you can use the command ppm okay another i have a question over here so we generally use pip install till now we have installed so many libraries even circuit learn as the tensorflows so we have used pic installed tensor flow so what’s the difference between pip install tensorflow and pip okay so the difference between them is that if i wish to use tensorflow version 2.1 for one project and tensorflow version 2.3 for another project with pip i cannot do that because in pip packages are installed globally so for instance let’s say that i’ve been using for a project beautiful soup version 3 and now suddenly i have a new project and i wish to use beautiful soup number four version four so because of having different versions we need to either uninstall beautiful soup version three and then install beautiful 0.04 in that case the previous project that we just created with beautiful super potency which was running with stop working so so to avoid this problem and install these packages for project basis so for instance i wish to use beautiful super supervision for for this project so when i install it using pip env only only this project will be able to use that version so for instance let me show you first let me enter the environment so as you can see it didn’t create another environment because it had already created one and it shows here that’s the name of that’s the label of the environment or unique identifier now if i install pip e and v install b e a u t i f u l s o u p four beautiful soup four and then i press enter now it’ll start installing it’s installed and now it’s locking which is checking uh all the files and it’s done now as if i can show you just one thing this is a ctrl o yeah yeah pip file so as you can see it’s showing the version of python that i’m using it’s showing the version of it’s showing that i’m using beautiful soup and since i didn’t specify any version it’s going to install the latest version which is this star so now if i wish to take this project with me on any other computer and use it all i have to do is install tip enb on that computer and run fifth env space install it’ll create the same environment that i have right now with same packages that i have and according to this like a beautiful soup the latest version and it’ll use python 3.7 and it’ll start working on its own if i didn’t do this then the problem would be that the another pc might not have the same versions of all the dependencies that i might have so they might have beautiful superzoom three version two maybe they do not have it maybe they have a different version of python so on and so forth so you can think of it as an as a way of creating an environment or you can think of it as a way of creating a container in which your project can be developed so i hope that clears it up just thank you sure so okay now that we have installed the beautiful soup library another thing that we need to worry about is parsers so parsers are basically how your beautiful soup library or parsers is basically you can think of it as a package used by the beautiful soup library to look at the html content and then create an html tree out of it or pass it or understand the structure of the html page so whenever you go to an html page in chrome or whatever browser what you can do is right click click on inspect and as you can see this tree like structure this is how html usually works it is a tree of nodes nested into one another in a hierarchy so to understand how these relationships and parts all of this we use parsers now beautiful soup supports many parcels uh i think it’s documentation page has yeah here it is so it’s supposed to html dot parser which is python’s inbuilt to the library so you don’t have to install anything if you wish to use the html parser that is in built-in python um you can also use lxml html parser which we’ll be using in this tutorial because it’s very fast and very lenient now when i say lenient it means that if the html content that we’re getting is not properly formatted that is some of the tags have been opened but not close so it’s not completely valid html there are some bugs or errors in it yeah the lxml parser will try to rectify those mistakes and present the tree as it is instead of throwing an error and saying that html is not well formatted because the html content that we’re getting may very well be out of our hands because we’ll be getting it from the internet there’s also html5 lib and lxml xml parser now the reason why you why we won’t be using html5 flip also although it’s very useful is because it could be very slow as you can see here and so this could be a problem on the other hand lxml xml parser is used to pass xml right now we’ll be passing html and it’s very fast and lenient python’s inbuilt html parser is also very good however i’ll suggest that you use lxml whenever you can because well first of all it’s quite fast and the speed is of the importance when we’re building a project and it’s very lenient so it can rectify some of the mistakes that we get so that’s very useful so first thing that we need to do is install the lxml parser and i’ll install another package called requests this is a package that is used to send requests to a website or a web server and get the response in as a string or however you wish to get it so what i’ll be doing is firstly installing e and v and you can install multiple package in packages in the same command so i wish to install lxml and q u e s t s requests with n s so don’t forget the s because it might install under package name request now when i press enter so lxml is installed requests it now installing this is also installed and now it’s locking locking basically means checking if the dependencies are fulfilled and then creating a hash out of it and then locking it it’s all it also creates a pip env.log file as you can see it’s mentioned right here and this is what contains the current version of our ppnb file how it works of all the things that vpn we need to create dependencies so locking is a process it might take some time because we’ve installed two packages at once so after it finishes locking we can start let me know that uh why we use web escaper web scrapper okay fine so many websites they contain some information that we might need in a project so for instance let’s say you’re creating a project or predicting the stock market tomorrow now for that you need some data so you can get the data from an api here that is called a web api many stock market websites do provide web apis which basically are a you can think of them as url url endpoints which you can send a request to and get the data in a format that you wish to get they get it like json or xml however you might come across a situation where the data that you want from the website that you want it from does not have the web api that you can request the data from in that case you usually use a web scrapper now a web scrapper is basically just a python program that gets all the data from the website and then parses it and allows you to extract information from that webpage so for instance let’s say that i wish to take the title from this website now i can if it has an api i can request that api to provide the title or what i can do more generally is just to create a web api just to create a web scrapper get the information of the html page that is returned when i hit this website and then pass it and then just extract this information then i can do whatever i wish you did which is basically save it in a csv file text file database manipulate it or do whatever i wish to do with it so for instance to continue with the stock market example let’s say there’s a website that maintain that updates the stock market information per second so it’s instantaneous and you wish to get that information from that website so what you can do is you can create a script api for that so you can go direct over it means uh website over there and you can collect your data what you want yeah you can write a python code that code will get information from that website or the web page and then extract the information that you need okay okay okay we got it thank you welcome um this is sanji so my question is for example uh the website which you are showing on screen right now so when i we’re reading your screen content or maybe a website content if i only what do we do so for instance you wish to just read this content am i right yes yeah so basically what you do is just as we’ll be doing in this session is you create you install the beautiful soup library then you write the code to get this entire html content this is loading yeah after that you parse it which is not really difficult to do because it allows you to do it very easily and then you traverse down the html tree so for instance this html this chinese content is inside a class of simple inside a ul element ul stands for unordered list you can think of it as a list with bullet points so in a list with a class attribute of simple so we just tell the beautiful soup library that okay there is a list on this page with a class of simple i want you to extract all the information from it and you do that and then you can do whatever you want to do with it now you can do it with any of the web pages and it doesn’t really have to be an ordered list you can do it with images you can do it with videos you can do it do it with paragraphs you can do it with whatever you should as i’ll show you uh in this session so uh sorry again having a question so for example if i’m if i’m reading a html content or any other website and it is all showing as a text or some graphical uh images so yes how do i know come to know that there are some kind of tables or maybe reports are attached along with that but i need to extract the data from that sure so the first thing is when you are thinking of scrapping a website the first thing that you need to do is you need to go to that website and look at its html content to figure out how to get the information that you need because they haven’t provided a web api it’s not just as simple as just taking taking a web url sending a request and getting the response so you have to do a bit of a research in that department so you go to the website let’s say for instance i this is the first time i’m visiting this documentation website and i decide that i want to extract this information as you suggested then i look at the html content of this page like this now you whatever content that you wish to extract there because of the developer tools in chrome or any other tools what you can do is you can click on this button which says select an element in the page to inspect it and then go to the content that you wish to extract so i wish to extract these yeah so i wish to extract these chinese links so what i can do is just simply look okay so i wish to extract these links they are inside a an or inside an unordered list with a class simple inside them there is a list item inside them there is a a link and i wish to extract the text of those links now you can do you can do it this way and it’s usually a better idea to be very precise about which what you wish to extract so if you just said that i wish to extract an unordered list that would create a problem because there may be another unordered list on the on the web page and you will have to then pass a lot of content just to get these three information so be very precise and look at the content of the website before scrapping it that way you can understand how you need to begin scrapping it what are the what are the elements that you need to understand what are the what are what are the classes that we need to reference in order to get the data that we wished does that answer your question yes and will you be showing us a code or maybe a hint of that the ones i choose which content i want to read i can target my python program to read from that link yeah so for this session i’ll be showing you some of the code so that you can extract information from a website so you don’t have to worry about that this session i’ll show you okay okay so this is what we have right now and let’s begin with our session today okay so for this session because we need to get some information from a html page what i have for you is yeah so an html page here it is so this is the html page that we’ll be scrapping you can think of it as a you can think of it as a blog so i’m sure you guys have visited several blogs so this is one of the blogs it contains posts each post will have a title that will have a link to a full on blog post and this is the summary of the title so when you click on the link it will create it will go to a new the entire blog post so for that first thing i need to do is i’ll need to run a server now it’s perfectly all right if you don’t have a server you can use any other any other website or anything that you wish to i’m creating a server because it makes it easy for us to send the request get the information from the request pass it and then use it now if i don’t have a server then i’ll have to open this html file in python which isn’t always the case that you can do it because if it’s a website that you don’t own so you don’t have the html content you need to get it from the website so for that i’ll be using something called light server so it’s checking the version of the light server that i have yeah it’s working so the light server is working and this is the server that i have now if you don’t have light server installed what you can do is first install node.js this is node.js you can go there and always remember to download the lts version lts stands for long term i think long term service so you can download it and install it it’s installed like any other package that you can found online and then you just write npm install dash g that g stands for global so it will install it uh on your pc so that any project can use it l-i-t-e f-e-r-v-r i already have it installed so i won’t be running this command and so yeah so now what i’ll be doing is i’ll start the light server in the current directory make sure that you are in the current directory mine is in the d drive web scrapping directory so i do it and because the name of the html file is index.html i get this so this is the content that i’ll get now what i’ll be doing in this session is i’ll be getting this content parsing it and then getting the information about the title the summary and the link of this title so this is article1.html and article 2.html and so on and so forth and after getting the content i’ll store all of this in a csv file because usually when you are doing web scrapping you are getting the information from the web server uh in an html format passing it and then storing it in some of the in some form or the other so like a csv file an excel sheet a database a small sqlite database or a json format any way you wish to do it so let’s begin so the first thing that i wish for you to understand is that you need to have installed all the packages that i mentioned so these were beautiful soup for requests and alex server so after installing these packages what you can do is i have this html file because i wish to show you how to pass an xml html file and many websites online either don’t allow for their web website or web page to be crawled or scrapped in a way so though one question here this node.js package do you need to install in the same directory or it can be installed in a default program files directory default program files directory you can install it anywhere you wish to there is no no problem so uh sunil has asked where can i get the html file so for this session i’ve created this html file myself if you wish to allow the trader to upload it in git so that you can get it but you don’t necessarily need this html file you can use any html file that you can get the point that i’m trying to get across in this video is how you can use uh beautiful soup and parsers and all of that to scrap any website so because i’m using my own created html file i’ll show you how to use this but you can use it with any website provided that they give you permission to scrape it now scraping is a you can say it’s a difficult thing because when you scrape a website usually those websites don’t allow users to scrape their website so the if you have uh like seen the structure of an html page you would have seen seen the structure of a directory that contains the html pages for a website they have a robots.txt file this file contains a list of the files and folders that the website owner or the create owner of the server does not wish the crawler to crawl now if you are creating a if you are creating a web crawler then this might be a concern but since you are creating a web scrapper you don’t need to do that another thing is that when you create a web scrapper what you do is you send a request to the web server now this means certain uh certain certain load on the bandwidth of that web server so whenever you’re scraping a website do make sure that either you have the permission to scrape that website or you’re not doing anything labor intensive on that website so that it could essentially bring that website to a hall most of the time it’s fine but many a times there could even be legal actions because the website could get done it could be your fault so when you’re scraping a website always make sure that that you are not doing something very intensive that ends up hanging the hanging website and like creating multiple instances of your scraper and then scraping the website or another thing that you need to worry about is that the that you’re not scraping something that the person does not want you to split so just make sure that you’re not doing that and another thing when just scraping it do make sure that the information that you’re scraping could not be get gotten from an api because it if it can be then it’s actually very easy to just get the information by requesting it from a url instead of scraping an entire web page or website so i needed uh sorry um so uh when i say that if i’m reading a website or if i’m scrapping a website how do i know that whether i have uh permissions to uh scrap the data from that website or not yeah many websites actually uh so you can look at their robots.txt file many websites do provide that so if you’re creating a crawler using scrappy or creating a parser what you can do is you look at the robots.txt file this is one of the most uh easiest things to do if you’re looking at the robot.txt file and they have mentioned the name of the html file that you are going to scrap and they have mentioned that they do not wish for this file to be scrapped or crawled or whatever then what you can do is you can just not do that because that would be against the wishes of the creator of the website or the maintainer so don’t do that and uh yeah that’s the first way to do it and secondly like i said if you are using a web crawler yeah sorry where can i check that so in a website when you’re crawling it you can request it so for instance usually it’s in the root of the directory so if you wish to you can just robots.txt right now it’s showing you what are the robots to txt file and how to use it so let’s say that there is a robots.txt file rob or ts.txt this if it’s in the root of the directory then you can just replace the localhost 3000 with the name of the website so for instance www robots.txt so you can get it this way our website doesn’t have it because it’s just a local website and only i’ll be doing it so you don’t have to worry about that here this is exactly the reason why i’m using asl creative website for demonstration instead of going on a website that i do not have any uh permission to access so this scrapping activity you can do both on response page as well as request page or only the response with the output data so uh you can do it with uh so there’s no request and response as per because when you are sending the request you are getting the response as an html page so this is the data that you will be getting as a string and this is the data that will be using like passing it so you get it as a response from the website yeah okay okay so i mean the reason why i was asking there are situations where you submit the form data to the server to get response okay so so the first thing is the form data has nothing to do with crap and scrapping can only be done with the information that’s being shown on a website so for instance yeah so form data you can submit it as a user but for that you will have to log into the website then file in the information that click submit now if you wish to automate this process this could be done using other packages like selenium and other automated automation packages are available for scrapping what we do is you we take a look at the website we look at the content that’s already present there then we send a request get the response as the html page is string representation and then we pass it and then extract the information so forms don’t really fit into this equation if you want to create a a script or a program that submits a form automatically for you for that you can use selenium and other web automation libraries but beautiful soup is not really for that okay okay so let’s begin with the scraping the website so the first thing what you do is you create a new python file uh i’m going to call it scrape dot py py is the python extension and now i’ll be importing so from bs4 which is the name of the package what is this editor you are using uh i’m using visual studio code vs code so okay yeah you can use it it’s very good actually so i i suggest that you if you are doing something important you can use visual studio code this is the information about the what about the editor using spider spider and um yeah spider and um yeah so they are also very good so you can use them as well but they are more suited for data science and data analysis purposes here we are just extracting the information and then storing it in a csv file so i’ll be using the visual studio code and i’m also using an extension in the visual studio code language called python so python extension this is why i’m getting this these code completion hints okay so after doing that which id do you use visual studio code visual studio is a cool blonde ide visual studio code is like 24 to 30 megabytes and it’s very small and it’s quite useful visual studio is more useful when you’re developing an entire application using c sharp or any other microsoft technologies okay okay so if you if you’re interested in using this you can download it free from microsoft’s website virtual studio code is the name thanks okay so this is uh so we’ll be create this is the class that we’ve imported beautiful soup from the package bs4 now we’ll create a new instance of the beautiful soup class so you can name it anything you want i’ll be naming it soup now you beautiful soup now as you can see it has two uh arguments that i need to pass in it markup which is basically the string representation of our html page so for that i’ll use the requests library that we have installed after that req requests dot get and inside the get i’ll paste in the web address so it’s http colon slash localhost 3000 and when i get it i want the text from text basically means just give me the string representation of the entire html page so just give give it to me as a string and now i need to pass the xm the the parser’s name so i need to pass llxml so you can use lxml if you are using any other parser you need to look up the parameter that you need to pass inside the beautiful soup library documentation page you can go i think it’s somewhere down here it’ll show you how to use any other parser so i’m using lxml so this is how you use it if you wish to use the built-in html5 html parser inside python you need to pass this parameter for xml you need to pass this and for html5 please you need to pass this that is after installing html5 if you are wishing to use this built-in html parser you don’t need to install anything however it could be slower than lxml so after doing that let me just print so let’s see what this gives us so let’s run this py and the name of the file is scrape dot so scrape dot now if you are using uh linux or macintosh you need to type pids210 entirely but on windows after i think python version 3.4 or 3.6 you can type just py and it will run so i can do this and this is the entire string representation of the website now one thing that you need to notice is that it’s not formatted very nicely so it’s very hard to read so you don’t know where the body starts where the body ends and what are all the tags that are inside the body type so for that what you can do is there’s a method called pretty fine so after using pretty fire what you can do is scrape and as you can see it’s formatted nicely now we can see where each stack starts where each stack ends it’s indented nicely now after this since we don’t need this and now what you do is you go to the website and you look at the structure of the html page so as to figure out how to extract the information that you wish to extract so what i want to do is what uh what i always do is usually i take the information from one of the one of if i’m if i wish to get information from multiple sites so for instance let’s say that i wish to get the information from like for instance i wish to get information from multiple uh html tags so here we have multiple posts and i need to get extract information of all the posts what you can do is you can firstly just get the information from one post which i have here extract the summary the title and the link and then you can just replace that one with multiple so what i’ll do here is this first i’ll show you how to extract information from a single uh blog post or article so if i wish to get the title and the summary and the link of the first blog post for the latest blog post if it’s ordered ascendingly then what i can do is i can just type p-o-s-t-s or you can also type arctic a-r-t-i-c-l-e-s is equals to soup dot find now we’ll be using find and not find all find and find all are a little different so i’ll be using find find all will return all the instances of this so i want a div now how do i know i want a div is when we look at the html content i want this entire post this is a div with a class of post now so if i wish to use it what i do is i pass in i want the information inside a div tag the first step tag that you find that has the class and we’ll be using a keyworded argument named class underscore the reason we are not using class is because class is already a keyword so then we can name the class which is i think post let me check it post yeah and then print article so if i run this again now you can see we have extracted just the latest post it is the latest post because we’re getting the first blog post title and it’s appearing at the top of the page now let’s see what the information that we need to extract is uh yeah so the first thing we wish is the title of the blog post second thing would be the summary and third thing would be the link so the link is basically this right now i don’t have an html page for article one and article two because they won’t be necessary here so okay let’s look at the information that we wish to extract so the title is inside an h3 with the title with the class title inside that there is an a tag that has the uh the text as the title that we wish to extract so to use that we need to do is did e title is equals to article dot now you can access them as properties on an object so what we know want is to use the articles h3 h3 with the a a which is the link so we want to so we want to extract information from the posts h3s link so we can do this using article which is the entire post then h3 which is the heading and inside that we have a title we have a link with the text that we wish to extract so we go dot a and we wish to extract the text we don’t wish to get the all the information about the tag so we type text and let me just print it to see if it’s working correctly whenever you’re working with a data like this you should always make sure that you’re getting the correct data so so blog post one title so we’re getting the information exactly the information that we wished now what we need is the summary so the summary we can get the same way just look at the html type content so it’s in the summary that we want is inside ap tag inside the post p is p stands for paragraph so very similar just go to article dot p dot don’t forget to use the text attribute this is the text that we have blog post one summary and i think this is the one that will be getting if it’s a summary and let’s try it again create.py blog post one summary so we’re getting the information that we need now we need the link now link is a little different because uh let me show you link is requested so the link is very much we get the link very much the same way that we got the title but here we don’t wish to get the text we get we wish to get the value of the href attribute hdf stands for hyperlink reference so it gives us this we should have slash article underscore1.html but to get that we don’t use the text property we use the href attribute for that we go article dot h3 dot and now we want the href attribute so we access it like we are accessing a key in a dictionary now let’s print that and see if i’m getting it so slash article one dot html so we are getting the id link as well now what we want is we want this information title summary and link for all the posts so remember like i said that we first get the information for one post and then we create the html file so that we can get we create the python code that we can get this information for all the articles that we need so for that what we can do is we can use if just a quick info guys intellipaat does provide end-to-end certification training on python so if you are interested you can check out the course details given in the description below now let’s get back to the session now for example when i’m reading that link uh i found that within that link uh there is some like uh values uh or maybe a data set is there or a yeah so how do i uh read that data so uh like inside the link there is data so for instance when you click the link it opens another page and the data is on that path is that what you see yes now for example this uh will give me uh some kind of rows and columns yeah gives me some kind of data uh at that website i want to get the data from there so you wish to get the data so for instance you are are you asking that if i have a table of data on a website how do i extract information from that table if if i’m understanding this correctly yes yes yeah so it’s basically the same process i mean exactly the same so let me show you okay index system so the tables are created using i think the table tag then you create the table row with the table data and the name of the table data could be anything like this in full data now usually when you are using a web scrapper you look at the website and most probably when you’re looking at a website they have some classes on those table data and table rows and tables so that they can apply some css styling to it now we can use this let’s say the name of the the class is data similarly i create another row and the class is data sample data let’s see if it’s showing up on the website so this is the table that you were talking about now if i wish to get the data what i can do is i can just tell let me just show you here okay so i can just say soup dot find a tag that is table data with the class name as that’s the classical game data yeah so this is the data that we’ll get and i want the text attribute of it let me see if it’s so we are getting the data for a single row now let’s say that there are what uh three rows and i wish to get data from all the three rows so what i can do like we were going to do in the in the demo which we’ll do but for now let’s address the query that you raised so for data in soup dot find and instead of find we do find all that we can get all the tags that have that are of td or which is this with a class of data so we get all these three and then what i do is i just print data dot text so as you can see i’ve extracted the information from the table so you can do this with any of the websites the the first thing that you need to do is you need to take a look at the html content to figure out how you are going to tackle this problem now it could be the case that they do not have this class in that case what you can do is just take the information from all the table data elements there are only three and you can just use that but in a large website they usually do have a class activity so you can use that you got it got it thanks and uh now for example uh when i’m this is like i’m simply read a for loop and read all the table content i want to uh save it in like a csv file yes yeah i’ll be showing you that okay so the thing that we’re going to do is get information from all the posts so for that what we need to do is we need to first find all so here we are using soup dot find what i can do is i can just click save it here okay and okay so for article in soup dot find all after we find all the div items with the class post this is what we have and indent it properly yes so now let me just print it to show you if it’s working correctly so we’ll be printing the title the summary and the link i think it’s called yeah and as you can see we’re getting the information that we need from all the posts so there are six blog posts we’re getting the title then we’re getting the summary and then we’re getting the links so it’s working correctly now the final thing that we need to do is we need to store it in a csv file now thankfully for us python has a built-in library or scraping for sorry for dealing with csv files so what we can do right now is as you can see there is no csv file here right now so we’ll have to create one as well so we can first import csv this is a built-in python package so you don’t need to install it now what what you do is you create a csv file so i think you do it with what csv underscore file equals to open the name of the file would be will be storing it in a file called data.csv now you can name it anything you wish to i want to call it data.csv and the second thing will pass is the mode now we wish to write to this csv file so if you wish to write to a csv file use the w flag if you want to read you don’t either you don’t pass anything or you can pass the r flag will be passing the w time which means you wish to read the data sorry write the data after opening the file we need to pass it to the csv dot writer and we type the csv underscore file now we can yeah or we can type it name it right and let me write down so this is what we’re doing now you could if you wish to do it this way by just doing it this way however there is a slight benefit of doing it by creating a variable that stores the file handler and then we pass it past the handler to the csv writer i’ll show you the benefit in a minute so this is what we’re getting right now now when we are writing to a csv file we write it in terms of rules so first thing we need to do is we need to write the headers headers are basically you can think of them as the title so at the top of the csv file we’ll have three data columns which will show tell us the data that everything inside that column store so let’s go let’s write this is the method that we’ll be using and this takes in an iterable so you can pass it anything that we can loop over i’ll be passing it a list of items so we will be storing the title the summary and the links so this is done now all i need to do is create a row and then type it so row and then row is equal to the title summary and the link now after writing the rows what we need to do is we need to close the file it’s always important to close an open resource because this could result in memory link if you don’t close it so this is why when i said that there is a certain benefit of storing the name of the file or the file pointer or file handler whatever you call them in a in a variable and then using it later so we’ll close the file and that way the information that we wish to write to the file gets stored now let me run this again hope i haven’t made any mistakes so far and it works and this is the csv file assuming all the data is written into it and it is now one thing that you are noticing is there’s a line break or an empty line in all of these [Music] between all of these titles so let’s just look at the file yeah so we as you can see that is when you’re writing it to a csc file it’s uh leading to an empty file so the reason for that is whenever we’re writing inside a csv file we can just instruct it to not insta create a new line so here you go it means that for a new line you don’t put anything so because we are writing using the writer this also enters uh creates a new line and this also creates a new line so this that’s why so for that after doing that we go py scrap dot py so it works and now if you look at the file yeah so it’s working correctly so this this is the blog post this is the summary and these are the links now one thing that you’re noticing is that the links are not properly formatted so we are only getting the end of the link we’re not getting the entire website like localhost 3000 colon 3000 slash link the link that we can just click and go to the website so to fix that we’ll write some code but why we’re getting that is because also so usually links are not written as absolute links they are written as relatively so a relative link is basically link of the current site and append that link that we have so we have article dot article underscore one dot html and we’re appending it to localhost colon 3000 but in the html tag we’re only get given slash article so if you wish to get the entire link which is this what we can do is we can again to write some code so this is basically just some normal string concatenation we get this is the link now this is where we wish to insert the link then we use the string.format method that is built into python and then always remember to close the csv file because if you don’t close it you might get a permission error as you can see here permission error because the file is open we are not allowed to edit it so yeah i don’t wish to save it after i’ve closed it i run this again it works perfectly and now if i open it and as you can see we are getting the link as we want it so now if i wish to i can just click on it unable to open because yeah okay so you can’t open the link from excel but you can open it i think from a from visual studio let me see if i can do it yeah so here’s the entire link now another problem is that they’re adding two slashes i i made some mistake in the code let me just rectify that so yeah and now let’s scrape it again don’t save it is create a py and and there it is so the links are perfectly aligned they are yeah so it’s working as we expect it to work so i hope that covers it uh is there any question so i needed in the beginning of this session or maybe before creating that uh scrap file uh you have shown us to uh like installing the light server so what is the use of that uh if i’m writing a python program and running on a different machine yes so light server okay so this is the html page that i have i created it it’s called index.html now if i wish to open it i can open it like this this will just open it on my local webpage but since i wanted to show you how to get the information from a website now for that you need to send a request to a web server so this is why i’m using light server light server is nothing but just a dummy offline server it’s a server that’s your running that you’re running on your own pc this is why we’re getting the web address as localhost and report is 3000 so if i want to not do that and if i were to use like the html page that i have created so for that i’ll need to instead of using the request page i can just open it give it the file name index.html and i think i think this is how it works i’m not exactly sure i’ll see ey okay so oh yeah data.csv is open let’s close it don’t save ui and it works so the reason why i used light server is because i wanted to show you how to get the information from a website but since i created the web page on my local machine i needed to serve it using a server a local server if you will so this is exactly why i use light server if you don’t wish to use light server and experimented on your own web pages that you have created and have it with you you can just use the code that i just drew instead of using the requests library all you can do is just open the file inside the beautiful soup and then type in the name index.html or whatever the name of the file that you are using provided that they are both in the same folder the scripting file as well as the html5 and then you can just parse it as normal now i’m using the request library so that i can show you how to get the information from the server please paste the code in the chat box so do you wish me for me to paste the entire code if you want i can paste the entire code here so this is the entire code so i pasted the entire code and let me just paste the one with the sorry with the yeah this one also okay so i pasted the one with the request library and also the one with the html page i think this should help you uh you can use the request library as well as the normal page now you can use light server if you wish to and if you’re using light server you can use the request library if you’re not using the light server and you’re using a local html page then you can open it inside the beautiful suit and lastly if you are using it on any other website do make sure that they have not asked you to not do that because that could be difficult for them to maintain the server and it could also have legal repercussions so is there any other question so the quick question is for example i’m reading a file or maybe a website called discoverychannel.com sure now uh do i need to install light server or no you don’t need to you do not need to install light server light server is just for serving the web page locally so if you are getting the information from an or website that’s online like you said discoverychannel.com that case you do not need light server at all what you need to do is you just need to type in the name of the website here so in our case because i’m using light server i have localhost colon 3000 now since you said you have discoverychannel.com i say this dash [Music] channel dot com and then you edit the code as per the structure of the html page that you get and the information that you need and after that you just run the code and it will work perfectly without any light server requirement the reason i use light server is only because i needed to show you how to use the requests library to get information from a website and i think yeah that’s it uh any other question okay i have a question i have a question over here so for a production use means let’s suppose in a production um i want let’s i want to scrap a website of my competitor and they’re here they have multiple pages definitely like like amazon so i need to go to every page of my return and get the price so that i can list my price just comparing with him or doing any data science so for that do i need to give all the url one by one by one by one and but how i do it recursively in multiple pages because yeah sure so there are multiple ways of doing it um as far as i’m aware amazon i think provides an api i’m not sure you can look it up but even if it doesn’t but there are multiple ways of doing it one way you can do it is just scrape the entire website for links store all the links in a csv file and then use a a python script to take links from that csv file and then script and script that website another way you can do it instead instead of storing it inside a inside a what should i say yeah inside a csv file what you can do is use it inside uh use it recursively so for instance uh let’s say that you find a link on the website so what you do is you go to that link and scrap scrape that website so instead of uh so instead of just scraping the website that you have right now just the first link that you see you open a request to that website web link here and then start scraping that but you do need to be aware of the fact that when you are scraping the website you need to take a look at the structure of the website and then use it so for this this website that we have we know the structure we know that okay the the information we need is inside idiom class with inside a div tag with a class post inside an h3 and inside that there is an a tag with the text as the information that we need now if you’re just recursively going through any links then you may not know the structure of the page and your code will throw errors so to avoid doing that just make sure that you know the structure of the page and you can do it either recursively you can store it inside a csv file and then use it that way or i think the better way of doing it would be using scrap because it allows you to create a web crawler which does the exact same thing that you are asking now scrappy is a much more full-fledged framework so if you wish to use it you should go through the tutorials as i’m showing right now it shows you the how to create a scrappy project which is here how to create a spider how to run your spider and the spider is exactly what you are asking asking about it is scrapes entire website also make sure that amazon.com or any other website that you are scraping doesn’t prohibits you from scraping the website so look at their user i think they have a user’s license or something so that will also contain that information and robots.txt file is also there so you can use either one so i’d suggest that you use the scrappy for uh creating crawlers and going through multiple links or you can use the technique that i just told you about getting uh all the links storing in the them in a csv file and then one by one scraping all of the links either way the information that you get will be there but just make sure that the website that you’re crawling is not prohibited and the structure of this table is known okay so we are saying the structure needs to be known before right yeah is is there any um intelligent algorithm like we have learned so many data science that can help us predicting that in any case that i don’t have to know because if i want to scrap a whole amazing website it is very difficult for me to get every webpage remember which where they have a tag and then code for the very lengthy code so do we have any algorithm for the same that we do when yeah yeah so not that i’m aware of but uh one thing that’s uh i think helpful here is that when you are if you’re trying to scrape amazon just scraping it for the information about the prices of the of that just trying to scrape it for the prices of the items that they’re selling now when you look at amazon.com it let me show you amazon now one thing is let’s say that i wish to look at all the shoes here’s the shoes under 4.99 i get this website and now if i wish to get the price of all the shoes i can just take the link that i have here inspect this this is inside a span with the price of a price that a desk price just four so i can scrape this website or if i have all the links for all the shoes i can just open them and price will always be inside a link with the span inside span with the idea price block auto price so what you can do is you can just get the link of all the website look at the id or the class that you wish to get the information from and then just use it that way so just take a look at instead of instead of getting a div with the class first what we want is we want a span with either a class of a size medium a color if price buys price block buying price string or a with a with an id or price block out our price and that way you can get the information of all the pages now as for the algorithm i i don’t think i’m aware of any of such intelligent algorithms that can just assume the structure of the page and extract the information that you wanted to extract you could probably use a regular expression or stuff like that but it would be very difficult and regular expressions are very computationally expensive so either you can create a web crawler or you can use the already known structure of the website and then get the information so so for web crawler also we need to give the structure right we need a structure before it connects foreigners do is that they can recursively look at the other links that you provide them as well so this is why web crawlers are like the all the foreign that it can get that it needs from the website so for instance there are meta tags there are a title tag body tag and if it finds those tags it will store the information about the stack inside it’s uh inside its database and then recursively look at all the links that that website has to get a better idea of what this website is all about so that is what a crawler or spider or a scrawler or a crawler or spider is now scraper is basically that we just created just a look at the look at a single website get the information from that website and that’s it you can create a spider out of that beautiful soup library but it would be more difficult so i’d suggest you use scrappy for creating a crawler but if it’s just one webpage that you need to get information from then you can use beautiful soup okay one last question can we pass the username and password as well with request if the site is yeah yeah so this is i think a question similar to the forms question that we got earlier so username and password if you need to pass in then i would suggest that you should use uh what should i say a selenium selenium is a web web automation tool so it will open the browser enter the username and password that you x tell you to enter and then click the button and then login now usually the information that you wish to get is gotten through either an api or through a website if the information that you need to get is behind a login screen then it’s i think it’s safe to assume that the creator of the website doesn’t wish for you to get the information because if it were a public information like the one that we just use like blog posts and stuff like that so you can easily crawl that but if it were it’s not public information it’s information private to a specific user even if it’s you then i don’t think they allow crawling for that so getting information from by for putting in username and password is not really something that you do with the subscriber you just look at a public webpage that’s available without providing any information and then you just get that if you wish to get information from yeah so you mean to say if i need to scruff facebook and with my login user id and password that should not be also allowed because that is yes that’s that’s actually not allowed that’s not very much yeah that’s not allowed you can’t do that uh you can maybe write a script that uses selenium to open the browser on your behalf and then enter the username and password and then you can look at it inside the selenium uh selenium instance but web scrapping is not really about that so i don’t think you can do that a little bit question so for example if i’m like crawling through like a particular website for example um amazon if it allows me to get the data for example there are there might be a multiple pages or multiple links with the product name of cpu puma and i want to know that within the price range for today deals uh within the price range of say 2 000 to 5000 what all uh categories of shoes are available uh with the brand name humor how do i uh write a code for that or what are the steps to get the data sure okay so one thing uh i as far as i can understand you are saying that you need to filter the data and then get the data uh for that uh filter so for instance you’re saying i’m you view my shoes correct it’s like a product name puma and yeah i want to know what all brand category or the name of the shoes with the puma brand which is available between 2000 to 5000 today and i want to extract the data for that so one thing you can do is firstly you can go to the amazon website yourself then i search for puma shoes like i have done i hope you can see on the screen so after that what you can do is you are looking for something within the price range that you want so you can filter it according to that let’s say that i want a shoe with a size 4 okay now this url is what you need if i type this url here i’ll get exactly the same thing that i wanted to get previously so you can scrape it by getting first of all this url then looking at the structure now on this page you will only get the shoes which are which are labeled as puma shoes with the the price range that you either the price range i’ve chosen the size as the filtering factor so the size 4 and the name puma you will get that and from there you can begin scrapping by looking at the structure of the html page so i want to get information about all the yeah so this is i can just look at this yeah uh s expand x corner this is the class you can just look at the structure of the page and extract the information about the price range or the current deal that’s going on or whatever so in the current trend we have the price a dash price dash so what you need to do is you need to first go to the page enter the filtering information go to the copy the website’s address this website address contains only the name and information about the products which are inside the filtering criteria that you provided and then you can scrape this this web address so that way you can do it got it thank you yeah sure any other questions of the entire website for example i mean all the pages are organized under a website with different navigation path right so at the website level if you want to capture all the details yes we have an api so uh maybe so this is also crawling recursively so let’s say that you go to amazon as well so usually all the links that you need to use are on the front page of let’s say amazon or any other so for instance if i want to these are all the this is the navigation that you wish to uh get the information from so you can again like i said just sorry you can again just look at the html content that’s all the class all the html tags with the class of each menu and inside them we have all of the information that you need so you can do it that way now if you are asking me that is there any other way that we can just know the structure of the entire website some uh like for instance how many web pages are there what are the folders inside which folder what html page is there and we need to scrap all of them this is what web crawlers are for now you can use the crawlers as well uh also some of the websites do provide uh information about the structure of the website uh i in the about section sometimes it is sometimes it’s a file that they have so you can do it that way as well however if you wish to scrape the entire website and not just information from one webpage i’d suggest you use scrappy for that to create a spider and then crawl the entire website with all the links that you can get or if you’re using beautiful soup just get all the links that you can that are of this website’s domain so all the links that start with amazon.com scrapping amazon and then scrap all those websites so all those web pages sorry and that way you can do it as well there is no i don’t think there’s an api that allows you to do it uh just to type the name of the website and will get you all the file structure there may be one i haven’t used one so far okay okay any other question uh the question here is do we need to extract the data one by one for each item so yes you do need to extract the information like we have done here the the the best way to do it as far as i can tell you is just to extract the information about one item first and then do it for all the items present on the page because the items have similar structure we can just use the same code and that way we get the information so for instance we needed to get information about all the posts that we so we loop over all the posts and then get information about all of them one by one and then store it inside the csv file so yes that reminds me to ask you one more question that whenever like we gone through a particular structure of a website i need to create that many variable to read the data and export it to my csv file in the similar manner so i can use it is that yeah correct yeah so uh okay okay so uh yes so you do need so for instance let’s say that whenever you’re scraping a website you are usually scraping it to get the information about a particular thing so a website isn’t as far as i can tell not extremely huge so for instance even on an amazon webpage there are only going to be 10 items 15 20 100 so what you can do is you need information about the price so you need information about price so all you need to do is just go soup dot find on uh span with the class uh with the class r price this will get you the price of all the items on that page and then you just need to get the price of item that item dot txt and that way you just need to create one variable to get the information about all the all the elements on the page now as far as your question about to do i need to get the information about all the or do we need to store information about all the variables uh yes you need to store information of all the things that you need inside the variable now if you don’t need to don’t wish to store all the information you can just write it directly but even for that you need to create a row and all so i would suggest that when you’re scraping a website try to scrape small portion of it or portion that repeats itself so for instance in our website there are what six posts we are scraping just one post and this post structure is repeating itself if for instance there were like a million posts i won’t have to change my code in any way shape or form it will work just fine the just the csv file would be a bit longer so if you wish to do it that way i would suggest that just look at the repeating element get the information that you need about one element and then just run it in inside a for loop or a while loop or whatever loop you wish to choose and then write it inside that and one other question we have is that if we have wikipedia there is no data in there only article then how do we do that okay so let’s go for a wikipedia article by wikipedia so the information that you are asking for is that if we have a entire block of text then how do we extract information from that so wikipedia does actually have an html structure so if you look into it let’s say for instance i wish to extract just the history portion of it or just the headlines of all the all the portions of this article so i think all of them have mw headline uh yeah so this also has a mw headline as the class so you just get all the titles of the mw headline uh what i’m saying here is that even if the page looks like it’s an entire block of text and it doesn’t have a html structure there will be some now if you wish to just extract this paragraph from the entire html from this entire block of text you can use um there are several ways you can use a regular expression which is used to extract a pattern a text matching a pattern inside an entire block of text or what you can do is you can just take this information split it uh using the slash n so which means that you want to create a list of all these paragraphs and then just take the last item of the paragraph provided that the contents structure has not changed and the last paragraph is what you want and you can do it that way as well so if you’re using it on wikipedia or anything like that you can just use it can we do vice versa for example if i’m developing my own website and i have like 10 000 records which is there uh to upload on that website so unlike that i’m reading from a website i can simply write to that website so that it goes uploaded in one go and i can have all the data available on my website okay so uh as far as i can understand you are saying that if you have a website that you need to enter some information inside a form let’s say that you have a website and a csv file the csv file contains all the information about employees and there is a form on the website that you need to insert all the information into can we use the web scrapping for that is that the question correct okay so again like i said the web scrapping is not really for that for that you need to use something called selenium that’s elena um so this is browser automation you can use selenium there’s also a selenium package for python so if you’re quite good with python or like using python you can just use selenium for that now you can just tell it to do it what you want it to do so for instance you give it the url of the website tell it okay there’s a form inside the website uh i wish for you to enter get the record from the csv file enter the name of the employee and whatever it is according to the information that you get from the csa file and then press the submit button and then wait for 10 minutes then do it again then do it again then do it again now you can do it this way selenium is built exactly for that as you can see we have one for python as well so selenium is built exactly for that but web scrapping is not about web scripting is about getting the information from the web page a public webpage not about sending information to our webpage so we work with responses not with requests this is why we’re using the request library to get the response from this website convert it to text and then beautiful soup can use it so if you if you plan to use it to submit information to a form i’ll suggest that you use selenium i think we do have i think courses on selenium as well so if you’re interested in that then maybe you should refer those but the web scrapping is not really about that i just saw one question how to deploy this thing in in production means you have written this code locally and i want to deploy it maybe automate it may be running after each one day or something anything so how to deploy this thing in production so okay so the first thing that you need to understand that this is not a production ready code okay this works fine but if you are scraping a big website i suggest that you use object oriented program that way you can create a class that will have methods that will allow you to scrape websites now after that you have run the made the production ready code and you showed that it’s going to work fine then what you can do is either deploy it on a linux linux virtual machine online like using a using azure amazon web services or any other pc if you want to run it on other pcs if you want to run it on your pc then you need to keep your pc running for the entire day or for forever since it’s a production server but if you wish to run it on any other website uh you can use linux you can use uh windows or any other web servers web server operating system and then put this file of yours which is crap dot py or whatever you call it uh depends on you and then put it inside that virtual machine and inside that virtual machine you just uh you can try uh there’s there are python extensions that allow you to schedule the execution of the script or also i think in linux there’s something called cron jobs c-r-o-n-j-o-b-s so cron jobs are built exactly for that they schedule a command or script on your server to run automatically at a specified time so for instance let’s say that you have a web scraper you wish to run it every hour and put the information inside a csv file so you can just paste this scrape dot py file or any of the files that you have you can just create put it inside a linux server that is maintained by azure or amazon web services or any other service provider that doesn’t really matter just make sure that it’s running all day and then schedule a chronzo now scheduling the crown job is out of the scope for this session but you can look it up it’s fairly easy just opening the crown tab and adding some file items it works that so just schedule a cron job to run this file every one hour or however so you wish to choose and then do it that way you can do you can probably do it that way also so so we can’t install this in the windows machine means let’s say that we can yes we can install it we can in windows i think there’s something else or many cloud providers like azure to provide you with facilities to run scripts regularly so maybe you can use that but for windows i’m not really aware if there’s a command that allows you to schedule a running of a command or maybe you guys might yeah okay fine got it thank you thanks a lot also i think maybe there is a you can write a python script which is running in the background continuously and then every hour it’s running that script so you can run it that way as well but i don’t think there’s a command inside the windows operating system like there is in linux which allows you to run a job or a script within a specified interview repeatedly can we uh obtain this uh code i mean we have discovered code in the model building just like i am going to create a model machine learning right yeah so yes you this code can be uh put in any of the more models if you wish to but i think the way it works is that you first get the information from the website uh you can store like for instance we got from the website and we’ve stored it inside the csv file and then we use that file that information that you’ve gotten from the web scrapping for training your models so you can or if you wish to you can just get the entire information and not store it and just use it directly but i would suggest that you store it inside a csv file because that way you will know what information you use to train your model so you can do it if you wish to just get the information from the website and then store it in and then start using it without storing it inside a csv file and start using it to train your model but it would be better in my opinion to just use it to store the information inside a csv file and then use that csv file to train them so let’s start off linear regression recap with this example so this is lauren she’s looking for a property to buy but she’s confused how to start so she goes to one of her friend josh and asks him if he could help her to find a property with bigger garden area for xbox our guy josh on the right agrees to help her to find a property but he himself doesn’t know how so what he do he goes to another friend of him explains him the whole situation and ask him if he can do anything about it he immediately says yes and start doing some calculations then he says to josh that spending xbox can get her a property of area y now josh is confused and asks him how did he find out the friend goes like simple linear regression now let’s see how exactly he used simple linear regression to solve his issue so here are our dependent variable and independent variable property size being the dependent variable and money being the independent variable what this guy want to do he want to find out a relation between the property size and money so it’s like if you want a house of bigger area or bigger property size then you have to spend more money so both of them are directly proportional right so in this case we’ll get a positive linear regression line right that is spending more money will get her a bigger property okay in other case but again our next scenario is she wants a bigger garden area right so imagine the scenario keeping the property area constant the house area is inversely proportional to the garden area it’s like if you want to increase the garden area so you have to reduce the house area so take it like this suppose you have already fixed the property and you want to construct a house with a garden in it but now your demand is you want to have a bigger garden area so if the size of the property is fixed and if you increase the size of the garden then obviously you have to reduce the size of your house right so in this case if you try to plot a regression line you’d get a negative regression line right bigger garden area means smaller house right now let’s take an example to see how exactly did he predict the value of the house so in order to find the regulation line what he did he took the historical data of property area sold in the particular price okay and he plotted it on the graph so this was the plotted point of property area sold and passed for a particular price okay now what he did he draw a regression line now to find out what property area she can buy with xbox he plots x on the independent variable scale and project it to the regression line and then against that point there he has the area y this is the area y so this is how he predicted lauren can buy y area of property in xbox all right now let’s see what he can and what he cannot say from this so he can say that if lauren spends x amount of money she can buy a property area of why but what he cannot say is will the property would have a good neighborhood or not or will the location be noiseless suburb or a bustling city right so these are the question which even he cannot answer using this graph so the questions like will the property will have a good neighborhood or will it rain tomorrow or not or is this male a spam or not all these kind of problem fall under a particular category known as classification problems in machine learning now with linear regression algorithm we could not answer these problem so that is where logistic regression comes into picture now let us see where this logistic regression algorithm that we just talked about lies in the machine learning algorithm tree so in machine learning we use two traditional learning techniques to build a predictive model supervised learning and unsupervised learning again look at the supervised learning there are two categories regression and classification right in regression we have linear regression and in classification we have logistic regression and svm so today’s topic of discussion is logistic regression which comes under the category of classification okay so now that we have got a little bit idea about logistic regression let’s go a little bit deeper and discuss about what exactly is logistic regulation and why do we use it so what is logistic regression well logistic regulation is a statistical classification model that deals with categorical dependent variable again you must be wondering what are these categorical dependent variable well these are some of the discrete variables that have two or more categories without having any kind of natural order for example temperature area or gender okay so you can say that logistic regression is generally used where the dependent variable is binary or where the dependent variable is binary that is only two outcomes are possible either yes no true false one zero etc right and also remember a fact that you can use both continuous and discrete input data with logistic regression so before moving ahead let’s look at the graph see there are two variables one is independent and other is dependent can you figure out which one is dependent and which one is independent so before moving ahead let’s take a look at this graph so before moving ahead let’s look at this graph so over here we have two different variables our studying and probability of passing exam can you figure out which one of them is dependent and which one of them is independent so if you have guessed that ours are studying is your independent variable and probability of passing the exam is dependent variable then i’d say that you are 100 correct so now that you know what exactly is logistic regression let’s move ahead and see why do we use logistic regression well logistic regression can be used as a tool for applied statistics and discrete data analysis why because it gets the output in the form of probabilities which help us to easily classify the given data okay so this is why we are using logistic regression so now that we have successfully established the basic of logistic regression by understanding what and why of it let’s go ahead and see how logistic regression can be applied for classifying data with the help of an example so here we are using an example of spam email classifier we need to build a predictive model that would classify whether a male is spam or not so let’s look at the approach that we are going to take while building this model first we’ll try to understand the variable that on the basis of which we are classifying the mail next we will plot the label data once we are done with plotting the label data we’ll draw the regression curve and finally we’ll try to find out the best fitted curve using maximum likelihood estimator all right so let’s get started so step one is defining the variable so let’s start off by understanding what is independent variable in our case so in our case the independent variable is count of spam words well here are some examples of commonly used spam words word like buy get paid guaranteed winner unlimited etc okay so these are the kind of words which when found in the mail the mail is treated as spam if the number of these kinds of words are more in a mail then that mail would definitely be a spam mail okay just for a better representation let me put them in a bag of spam boards let’s put the bag so there’s a bag of spam words let’s put all the boards in them one by one buy get paid guarantee veno and unlimited okay now what about our dependent variable well our dependent variable is going to be the probability of male being a spam if the probability is one that means the male is spam if it’s zero means it’s not a spam mail well in general the mail with less number of words from the list of spam words will be treated as a spam mail with five or more spam words in a mail would be treated as a spam mail but there can be cases where you might find males with less spam words being spam also you might find cases where males with more number of spam words is not a spam mail so here our aim is to build a predictive model to classify the mail with minimum error okay let’s see what is our next step so our next step is plotting the label data let’s say this is a set of data that we’ll be using to build the model this is a very small data set but just remember that whenever you are using logistic regression make sure that you are using a large amount of data logic regression works pretty well with large amount of data it doesn’t work that good with small amount of data okay here just for the purpose of understanding we are using small data set all right so we have two variables number of spam words and the probability of male being spam against each male okay next as a step three will draw the drag dash in line so next what we’ll do we’ll plot our data set on x-axis and y-axis with independent variable on the x-axis and dependent variable on the y-axis so number of spam words in a mail is a independent variable right and the probability of that particular mail being a spam or not a spam is a dependent variable it depends on the number of spam words in a mail right so let’s plot these words one by one so first we have is one word and the probability of this male being a spam is zero so it will be plotted up here so next we have is five spam words in a male and the probability of that male being a spam is one so it will be plotted somewhere here next is 3 spam word is a spam mail so it will be plotted again here two words it’s not a spam mail so i’ll be plotting up here seven words again a spam mail up here four words not a spam mail here nine words it’s a spam eight spam words okay it’s not a spam so once we are done with the plotting this is how our plotted data would look like now let’s say that we have a new mail and now we want to figure out whether it’s a spam or not so before moving ahead let me just tell you in real world scenario to perform logistic regression you need a large amount of data set and also you might find many cases where a spam mail might contain only two words whereas spam mail might contain only two spam words or also it might be possible that you get a mail where you are having more than five spam words and even in that case your mail is not a spam okay so here we are building a predictive model with primary aim to reduce the error okay now let’s say we have a new male up here now we need to figure out whether this male is spam or not but how do we do that well first of all we need to plot a regression curve which would fit the best and that curve would be our logistic regression curve but now the question comes how to find out which is the best regression curve okay well this will contain three steps well the first step is to convert the y-axis from the scale of probability confined between 0 and 1 to a scale of log odds then drawing a random regression line out of the data that we already have then with the help of sigmoid function we’ll convert the log odds to the probability of male being spam we’ll plot each male on the base of their new probability values and this will form our regression curve then finally from this plot we’ll find out the log likelihood values of each male now from this plot we’ll find out the log likelihood values of each male the individual likelihood at last we’ll find the log of likelihood that would be our log likelihood of the regression curve now the question comes what are these terms such as log odds or log likelihood means so before moving ahead let’s discuss that so what does log of odds mean let us explore this with the help of an example so before we proceed any further let me just clarify you one thing this probability and odd these are not the same thing let me explain you this with an example suppose this guy he goes to fishing five times a week so out of five times he catches a fish two times and he failed to catch three times okay so now in this case what is the probability and odd for getting a fish for dinner so let’s first calculate probability so probability is chances for divide by total chances so chances for catching a fish so what is the probability of catching a fish that is how many times he caught a fish that is two divided by total chance he had to catch the fish so that was five right so here probability of getting a fish for the dinner is two by five okay next comes the odd chances for divided by chances against that is the ratio of how many times he caught the fish divided by how many time he failed to catch a fish okay so he caught the fish for two times and he failed in catching the fish for three times so the odd for getting a fish for dinner is two by three okay so now that we know odds now let’s see what log of odds and log odd ratio are are the same let’s find out for your information log odds is also called as logit function okay so in our previous example where the fisherman was catching a fish let’s add another factor to his fishing let’s add a factor as weather so then we can recreate the entire scenario as he was successful two times on a rainy day but on a sunny day he was successful for three times now the odds of catching fish on a sunny day is how much it’s two by three right and the odds of catching on a rainy day is 3×2 right as it’s already mentioned that on a sunny day he catches three times so he’s successful for three times and he failed for two times so let’s see he was successful three times on a rainy day and two times on a sunny day so odds for catching a fish on a sunny day is two by three that is he’s successful two times on a sunny day and he failed for three times in a week so odd for catching fish on a sunny day is two by three similarly odd for catching fish on a rainy day is three by two and now log of odds of catching a fish on a sunny day is just log value of two by three and similarly log of odds of rainy day is log of three by two now log of odds ratio is nothing but the log of odds on a rainy day divided by odds on a sunny day okay next is log of odds ratio so log odds ratio is nothing but the ratio of log of the odds on a sunny day to log of the odds on a rainy day that is log of 2 by 3 by 3 by 2 which is nothing but log of 0.44 so here we can say that odds and odd ratio are both different now let us go back to this step so now that we have understanding of log odds so we are ready to perform this step okay so let’s see how we converted the zero one axis to minus infinity to plus infinity axis so here we’ll be converting the probability scale to scale of log odds okay so for the log odds we have a formula as log of probability of spam divided by 1 minus probability of spam so here the probability of a male being a spam is 1 okay so we get the value of log odds as log of 1 divided by 1 minus 1 that is log of 1 by 0 it’s positive infinity how we got that so log of 1 by 0 is nothing but log of 1 minus log of 0 and log of 0 up here is minus infinity so minus of minus infinity is what plus infinity how log of 0 up here is minus infinity let’s see so in general logarithm so we have log 0 with base b equals c so if you convert it into exponential form we get 0 equal b to the power c right so if the value of b is less than 1 so the value of c has to be extremely small or closer to minus infinity for this equation to be true okay and we’ll get a positive infinity in the case where b is greater than 1 or our base is greater than 1 okay so coming back up here so here log of 1 by 0 is nothing but log of 1 minus log of 0 and we got the result as positive infinity as log of 0 up here is minus of infinity and minus of minus infinity is what plus infinity how we got the value of plus infinity let’s see so log 0 with base b equals c if we convert this into exponential form we get something like this 0 equal b to the power c right so for the equation 1 to be true if the value of b or the base is less than one then in that case the value of c will be positive infinity okay for example 0.1 to the power 1000 would be smaller than 0.1 to the power 100 right so more the value of c up here more closer will the number get to 0 right and in next case if the value of b is greater than 1 so for this we have to make the value of c closest to minus infinity why for example we have 10 to the power minus 1 okay or 10 to the power minus 10 which is more smaller 10 to the power minus 10 right so which is much smaller or closer to zero 10 to the power minus 10 right so that’s why we have to keep the value of c as less as we can so here if the value of b is greater than 1 the value of c would have to be close to minus infinity in order to make the equation true so in this case log of 1 by 0 by default we have base as 10 okay so that’s why we took the value of c as minus infinity so minus and minus of infinity is plus of infinity so that’s why we got the plus infinity up here i hope this thing is clear to you how we got the value of log odds as plus infinity so we’ll plot this up as plus infinity log odds now next let’s find the log odds of non-spam mail so here we have the formula as log of probability of a male not being a spam divided by 1 minus probability of male not being a spam so log of 0 divided by 1 minus 0 so we have log of 0 by 1 which is log 0 minus log 1 which tends to minus infinity okay similar concept now we have our data up here so first we’ll assume one regression line so now we have our data up here so first we’ll assume one regression line then we’ll project our data onto the regression line okay now let’s just go back to the step where with the help of sigmoid function will convert the log odds to the probability of male being spam okay but what does this sigmoid function mean so sigmoid function is the standard logistic function the logistic function is defined as l multiplied by e to the power k times of k minus k dot upon 1 plus e to the power k times of x minus x naught so here l is the curve’s maximum value k is the steepness of the curve x naught minus x as the value of sigmoid point okay so here the sigmoid function e to the power x divided by one plus e to the power x here k equal one and x naught equals zero and l equal one so this mathematical sigmoid function from s shaped curve which is confined between 0 and 1. let’s try to understand how logistic function works with the help of an example so let’s say we have a set of unspecified data and on these data we need to apply a sigmoid function so let’s see what a sigmoid function can do by visualizing this graph so let’s say we have a unspecified data and on these data we need to apply sigmoid function so plot this data and find out the respective y for this point okay so what a sigmoid function is doing it can be visualized from this graph right so you are giving some values on your x-axis and using sigmoid function you can predict its probability on the y axis right so this is the reason why sigmoid function is very useful while solving the classification problem it takes any real valued number and maps onto a value between zero and one okay well now that we have an idea of how sigmoid function works let us move ahead with our spam email classifier so we are ready to perform this step that is converting the log odds graph into a sigmoid function graph now we have to find out the best mle okay our best maximum likelihood estimator so we are going to replace the log out value of each male to get the probability of each male being a spam so we have the formula up here probability equal e to the power log odds divided by 1 plus e to the power log odds so one by one will place the log odds of each male into this formula and calculate the probability for each male okay for example i have this mail which after projecting into the regression line gives us logout value of minus of 3.2 okay so what we’ll do up here we’ll calculate the probability using the value of minus 3.2 we’ll place the value minus 3.2 in our formula e to the power log of minus of 3.2 divided by 1 plus e to the power log of minus of 3.2 so from here we’ll get the probability as 0.03 so plot it accordingly onto the new graph between probability of male being spam versus spam word count so on the basis of probability a male would lie somewhere near to zero so according to the prediction this male is not a spam meal again for another meal which is projecting on a regression line this gives us a logout value of 5.6 so when we put the value of 5.6 in the formula we get the probability as 0.99 so the probability of this male being a spam is 0.99 so again we put this into this graph similarly one by one will calculate for each one of them for this male after projecting onto the regression line we get the log odds value as minus of 4.5 so minus of 4.5 and put into the formula we get the probability as 0.01 so the prediction of this male is not a spam male which is same as action right so again we plot this male onto a graph all right so similarly you can repeat this step for the rest of the email as well and finally we got the s curve up here so there’s our regression curve but you must be wondering is this the best fitted curve or how do we find out whether it’s best or not well this is when the concept of maximum likelihood comes into picture so now that we have regression curve let’s find out the likelihood of this curve so first find out the individual likelihood of each male again you must be thinking how do we get the likelihood value well likelihood of each meal is nothing but the probability value of each meal being spam so likelihood of first meal being spam is 0.01 likelihood of second wheel being spam again 0.01 similarly third 0.03 for 0.05 and so on till 8th male being 0.99 okay so once you get the individual likelihood of each male multiply them to find out the likelihood of the entire curve okay then calculate the log of likelihood for calculating the log of likelihood you can just take the log of the previous result okay you can just take the log of previous multiplied result here we are adding all the logs because log of a multiplied by b equal log of a plus log of b okay so we got the value of log likelihood of this curve as minus of 0.084 now let us rotate this line to find out the best fitted regression line so we got the log likelihood of this curve as minus of 0.084 now let us rotate this line to find out the best fitted regression line again we calculate the individual log likelihood of each male for this one let’s say we got log likelihood which is shown on your screen so final value we got up as minus of 0.207 so we got the value as minus of 0.207 so now if we compare the log likelihood values for these two regression line we’ll see that line a has bigger value of log likelihood than line b right as line a has log likelihood value of minus of 0.084 so the log likelihood value for line a is minus of 0.084 whereas for line b is minus of 0.207 so minus of 0.084 is bigger than 0.207 right so therefore we can say that line a has better likelihood value than line b now again we’ll rotate the line we’ll keep on rotating the line until we get the maximum value of log likelihood and then finally we’ll choose a line which is having the maximum log likelihood and that line would be the best fitted regression line so i hope the concept of logistic regression is clear to you guys so for now this was all about the theoretical and mathematical concept of logistic regression so the confusion matrix shows the ways in which your classification model is confused when it makes predictions it is basically a summary of prediction result on a classification problem the main key to a confusion matrix is summarize the count value of correct and incorrect prediction so the image shown on your screen represents a confusion matrix let’s see what exactly does it mean but before that let me just tell you how to create a confusion matrix this would make things more clear for you so let’s see how so for creating a confusion matrix you’d be needing a test data set or a validation data set with expected outcome values then make a prediction for each row in your test data set then from the expected outcome and prediction count the number of correct prediction for each class and the number of incorrect prediction for each class organized by that class that was predicted okay let’s see what exactly does it mean so here’s an example we have some expected output and a predicted output for that so from here you can see that all the red color results are the incorrect predicted values and the green ones are the correct one so in total we have seven correct protection out of ten okay so from here you can say that the accuracy of your model is 70 percent now here men classified as men are three one two and three and women classified as women are four one two three and four and now men classified as women men as women one and men as women two okay so two and women classified as men one now if you create a confusion matrix out of it you’ll get something like this men classified as men three men classified as women one woman classified as men too and women classified as women is four so from here you can say that total actual men three plus two is five total actual women one plus four again five and total correct values men classified as men and women classified as women that is three plus four it’s seven so from here you can see that there are more errors while predicting men as women rather than predicting women as men okay so this was about how you can calculate a confusion matrix now let’s come back and see how to interpret a given confusion matrix this is the sample of a confusion matrix so here we have created a confusion matrix for a fire alarm so this represents a actual alarm this represents no actual alarm here predicted fired positive and here predicted far as negative so if the alarm goes on in case of fire so it’s a true positive event the alarm goes on and there is no fire so it’s a false negative event there is no alarm in case of fire so it’s a false negative even and there is no alarm and there is no fire then that means it’s a true negative event okay so let me just explain you this example this should make things more clear to you so actual alarm and predicted fire so total true positive events we have 40 and total false negative event we have 10. so from here you can say that total number of times the alarm rang was 40 plus 10 that is 50 okay here we have false positive event as 5 and true negative event as 95 so the total number of times the alarm did not rank was 5 plus 95 that is 100 and this one is the predicted fire or not so true positive plus false positive that is 40 plus 5 how many times the machine positively predicted the fire so that is 40 plus 5 45 and how many times the machine was not able to predict the fire that is 10 plus 95 that is 105 and total number of events that is 50 plus 100 or 45 plus 105 is 150 so we have mentioned n equal 150 up here that is total number of events okay so this is how you integrate a confusion matrix so let’s move ahead so now let me just show you in my jupyter notebook how you can create a confusion matrix let me just open my jupyter notebook so there’s my jupyter notebook and what we are going to do is create a confusion matrix in python so the very first thing that i’ll be doing up here is importing the required libraries so from sk learn dot metrics so i’ll be importing confusion matrix next let’s create some expected value let’s say expected equal so let’s add some values in it like 1 1 0 1 0 0 1 0 0 and 0. now this is my expected value now let’s create some predicted values for that so predicted equals first it’s predicting correct next let’s say 0 0 1 0 0 1 0 1 so this is a predicted value now let’s calculate the confusion matrix so let’s say results equal confusion underscore matrix inside this will pass our expected and predicted value expected comma predicted and print the result that’s it let’s execute it so here you got the result as 4 2 and 1 3 so what does it mean so first we have is 4 so 0 predicted as 0 is 4 times 1 two three and four so zero predicted as one is two times zero predicted as one one zero predicted as one two okay next is one predicted as zero so one predicted as zero is just once here okay and next is one predicted as one that is three times one predicted as one one two and three so this is a confusion matrix and what we can say from here so total number of correct prediction made by machine is four plus three that has zero classified as zero and one classified as one okay and total number of incorrect prediction is two plus one that is three so we have seven correct prediction and three incorrect prediction okay and the total number of times the machine predicted the value to be zero is four plus two that is six times and total number of times the machine predicted the value to be as one was four times okay so from here we can say that our machine predicts the result seven times correct and three times wrong so the accuracy of our machine is seventy percent so this was all about how you can create a confusion matrix in python a demo with logistic regression with the help of psychic learn package and we’re going to build this logistic regression algorithm on top of the heart disease data set so let’s quickly go to jupiter notebook and start with our demo right so this is jupiter notebook guys and our first task would be to load up the heart disease data set and for that purpose we would have to import the pandas package so i’ll just type in import pandas as pd and i’ll use this read underscore csv method from the pandas package so i’ll type in pd.readcsv and i’ll pass in the name of the data set which is basically hard.csv and i’ll store this in this dataset object now let me have a glance at the first few records of this data set so this is a data set which comprises of all of these columns and we’re going to build the logistic regression algorithm on top of this column over here which is basically target so target would be our dependent variable and the rest of the columns would be the independent variables right and this target basically means that so you have one and zero values over here the one value means that the person or the patient has the heart disease and zero basically means that the patient does not have heart disease right now let me also have a glance the shape of this data set so i’ll just type in print dataset.shape and this gives me a value of 303 and 13. so 303 means that there are 303 records in this data set and 13 columns now let me actually have a glance at the value counts of this target column so this value counts basically tells me the frequency of these two values so i have these two values in this column which is basically 1 and 0. so there are 165 records where the value is 1 and there are 138 records where the value is 0. so this basically means that in this data set there are 165 patients who actually have the heart disease and 138 patients who do not have the heart disease now i’ll go ahead and actually visualize this so i’ll load up the matplotlib package and c bond packages and i will pass in this target column on to the x axis and the data is our dataset which is basically this heart disease data set and what i’m doing is basically building a histogram and i’ll plot this up over here right so this is the bar plot for the value of zero and this is the bar plot for the value of one and this basically tells us the same thing so 165 is the value of the number of patients who actually have the heart disease so this basically is for all of those patients who do not have the heart disease all right now let me go ahead and divide the data set into features and label sets so i’m storing all of the features into this x object so all of these 12 columns would be my features and this target column would be my label or would be my dependent variable all right and this is how i’m going to divide the data set so i’m going to extract all of the columns except the last column and store it in this x object and similarly i’ll only take the last column and store it in this y object right now let me have a glance at this individual independent variable and target variable so x dot head gives me all of the independent variables and y dot head gives me the target right so now that we have our independent variables in the dependent variable let me go ahead and divide this data set into training and testing set and for that purpose i’d have to load up the train test split method from sklearn.model selection and over here i’m showing the test size to be equal to 0.2 so this means that 20 of the records are in the test set and the rest 80 percent records are in the training set right i’ll click on run again now we have divided the data set into training and testing sets now finally it’s time to build the model and for that purpose i’ll be importing logistic regression from sklearn.linear model and i’m going to create an instance of this so i’ll just use this method logistic regression and i’ll name that instance to be log model and i’m going to fit this model on top of the train set so i’m basically passing x strain and y train as the parameters right i’ll click on run right so we have successfully built the model on top of the train set now we’re going to go ahead and predict the values on top of the test set so i’ll type in log model dot predict and i’ll pass in x test as the parameter and i’ll store the result in white bread so we have also predicted the values now it’s time to calculate the accuracy so i will type log model dot score and i’ll pass in x test and y test so i want to calculate the accuracy for the prediction on top of the test set right so the accuracy comes out to be 73 which is actually not that bad and let me actually also build the confusion matrix so confusion metrics would give me a table of values which actually comprises of the correctly predicted values and misclassified values so i’d have to import confusion metrics from sklearn.metrics and again i’ll just pass in y test and wipe red as the parameters inside this function and i’ll print this out right so this left diagonal which you see this left diagonal actually represents all of those values which have been correctly classified and this right diagonal represents all of those values which have been misclassified and if you want to get the accuracy all you have to do is add up 20 plus 25 and divide it with all of the values and you’ll get the same accuracy so let me actually add up a new cell over here and calculate the accuracy from this confusion matrix so i have to divide this left diagonal with all of the values so that would be 20 plus 25 divided by 20 plus 25 plus 10 plus 6 and this gives me a value of 73.77 which is the same as i got over here so the accuracy is 73 right so we have built the confusion matrix now i’ll also go ahead and build the roc curve so the roc curve it sort of gives me the right trade-off between the true positive rate and the false positive rate so let me go ahead and plot this and this is what we get over here so on the y-axis we have the true positive rate and on the x-axis we have the false positive rate and basically you can understand this plot this week so the closer this curve is to this top right corner over here the better the model that is this curve needs to cover greater area and this what you see red line so this basically represents a classifier which would give you around 50 accuracy and your model would be as better as it is far away from this red line that is this blue line has to be towards the top left corner right and this is how we can implement logistic regression model with the help of sklearn consider yourself to be data scientist at a prestigious telecom company and the name of that company is neo and they are facing a major problem and that basically is their customers are churning out to other competitors now you as a data scientist at that particular company have to make sure to stop this churning out and also to find out the reasons why customers are turning out to other companies right so this is the problem statement and what you’re basically going to do is a bit of data manipulation data visualization operations and then you will go ahead and build the ml algorithms on top of this data set you’ll start off with the linear regression algorithm and then you’ll build the logistic regression decision free and random forest algorithms all right and i will be working with this customer churn dataset so this is our data set which comprises of all of these columns so we’ve got customer id gender senior citizen partner dependents and so on right so um yeah so this is our first task data manipulation and uh do this let’s go ahead and actually import all of our libraries so i’ll just type in import pandas as pd and then i’ll also load the numpy library so i’ll type in import numpy as np i’d also need the matplotlib library so i’ll type in from matplotlib import y plot as plt so these are all of my required libraries i’ll just wait till these libraries are loaded right this is done i’ll also go ahead and load up my customer churn data frame and i’ll store it into an object and name that object to be equal to customer churn i’ll just use um dd dot read underscore csv and inside this i will given the name of the data frame which is customerchurn.csv right so i have successfully loaded the file and i’ve stored it into a new object and also i’ve named that new object to be equal to customer shown now i’ll go ahead and have a glance at the head of this so customer churn dot head and this is the data set on which we’ll be implementing all of the operations so this is just the unique customer id this column gender tells us about the gender of the customer whether the customer is male or female the senior citizen column tells us whether the customer is a senior citizen or not so if it’s zero then the customer is not a senior citizen if it’s one then the customer is a senior and then this tells us if the customer has a partner or not this tells if the customer has dependence or not this is the tenure of the customer in months right so one month 34 months two months and so on this tells us if the customer has phone service or not does the customer have multiple lines and this is the type of internet service used and then whether the customer has online security whether he has device protection tech support streaming tv streaming movies and so on and then this column is for contract so this just tells you the contract type of the customer so the contract type of the customer could either be month to month one year or two year and uh there’s a type of billing whether it is paperless or not after this we have the type of payment method so the type of payment method could be these electronic check mail check bank transfer and so on and then these are the monthly charges and uh total charges included by the customer right so now let’s start off with our data manipulation tasks so this is our very simple task we just have to extract some individual columns from the entire data frame we’ll have to extract the fifth column and store it in customer five so let me do that i’ll type in the name of the data frame first which is customer shown and i would use dot i lock i would need all of the rows and i would need the fifth column since the indexing starts from zero so zero one two three and four right so this would be my fifth column over here and i will store this in let’s say c underscore five and i’ll have a glance at the head of this c underscore phi dot head right so we have successfully extracted the fifth column from this entire data frame and this is the head of it similarly we’d have to extract the 15th column and store it in customer 15 right so if it’s the 15th column then the index number would be 14 because again the indexing starts from zero right i’ll store this in c underscore 15 and i’ll also make this to be c underscore 15. i’ll click on run right so this is the streaming movies column and i get the same thing over here this is column number 15 and i’ve extracted only this particular column from all from the entire data frame all right this was a basic extraction of columns from the entire data frame after that we’d have to do um data extraction on the base of a condition we’ll have to extract all the male senior citizens whose payment method is electronic check right so there are three conditions over here the first condition is the gender of the customer needs to be male second condition is senior citizen the value of senior citizen needs to be equal to one and the third condition is the payment method needs to be equal to electronic check so i’ll given all of these three conditions over here i’ll start off by giving the first condition stimulation i’ll type in gender and the gender needs to be equal to male right i’ll cut this and piece this inside this of this i’ll use the and operator and then given the second condition so the second condition would go something like this customer churn after this we’d have to set the senior citizen value right so senior citizen this needs to be equal to one and then i’ll go ahead and given the third condition so over here the third condition this is for the payment method column so the payment method needs to be equal to electronic check all right so i have all of my three conditions over here i’ll just paste the cell below now what i’ll do is i will copy all of these three conditions and i will paste them inside this and i will store it into a new object and name that object to be see random right now i will print the head of this c random dot head so we see that this so if you have a glance of this gender column and you’ll notice that all of the values are male similarly if you have a glance at the senior citizen column then all of these values are one similarly i’ll go to the payment method column then you’ll notice that all of these values are electronic check so i’ve given three conditions over here and all of these conditions have been satisfied so um next up we have to extract all of those customers whose tenure is greater than 70 or their monthly charges is greater than 100 okay again we’ll do the same thing customer churn 10 yard needs to be greater than 70. i’ll put this up inside this now what you need to notice over here is we are using the or operator so this it’s either the first condition or the second condition right so if one of these conditions is true then we’ll get that particular record so again customer shown and this time that’s the monthly charges so the monthly charges have to be greater than 100 these are the two conditions so either the tenure of the customer needs to be greater than 70 months or the monthly charges of the customer needs to be greater than 100 again i’ll insert a cell below and all i’ll do is cut this and paste it inside this right and i will store this in again see random now let me again print the head of the c random dot head right now i’ll head on to the tenure column now you see that none of the tenure or none of the values of the tenure over here is greater than 70 but if i go to the monthly charges column then you’ll notice that the monthly charges are greater than hundred so it’s either or so one of these conditions has to be true and in this case we see that the second condition is true over here right so if either of the condition is true then we’ll get that entire record of this we’d have to extract all those customers whose contract is of two years payment method is mail check and value of chonus yes all right i’ll again copy this piece this over here let me delete this from this right so we have these conditions over here so we’ll have to extract those customers by contractors of two years payment method is mail check and shown is equal to yes so let me put in all of these conditions over here first is contract and this needs to be equal to two years so let me just check how is it contract as of two year right two space here after that it’s the second condition and over here the payment method needs to be equal to main check again i’ll put in the double equal to operator and i’ll given the value and the value is equal to mail check and then i’ll give him the final condition over here so custom option and this time the churn needs to be equal to yes right so i’ve given all of these three conditions and i’ve separated them using the and operator again i will store this in c underscore random let me print this out see underscore random so we see that there are just three records so there are just three customers who satisfy all of these three conditions so uh let me have a look at the contract of this right so is of two years for all of these three customers next is the payment method and again payment method is mail check and churn all of these values are yes right so out of all of those 7000 rows only there are three rows which satisfy these three conditions next this is a question on random sampling so we just have to extract 333 random records from the entire data frame and to do this we’ll be using the sample function so i’ll type in customer churn and i’ll use the sample method over here and inside this what i’ll do is i’ll give the value of the number of records i would sample so i want 333 records i’ll store this in c underscore 333 let me print out the head of it right so now what i’ll do is so whenever i run this every time i’ll get a different sample of 333 records i’ll run this i’ll run this again right so if you have a glance over here so all of these values should be changing the customer ids would be changing the indexes would be changing right so keep a glance or have a glance at these uh row ids indexes over here right so again if you have a glance with this the row ids are changing so this is random sampling i’m randomly sampling 333 records from this entire data frame over here and i’m doing that with the help of the sample method right and then this is the final operation when it comes to data manipulation so i’d have to get the count of different levels from the churn column so if i want to get the count of the different levels present in a categorical column i have the value counts method so first i’ll given the name of the data frame which is customer shown after that i’ll given the name of the column which is churn and then i’ll just type in value counts right so this is it so let me just wait till i get the result right so we see that no so the number of customers will not be churning out as 5174 and the number of customers who will be churning out is 1869. so you can do the same thing for other categorical columns as well so let’s say if i want to get from the count values or the number of counts of different levels for let’s say the contract column i’ll just change the name of the column over here so i’ll put it to be equal to contract right so there are 3875 customers whose contract type is of month to month there are 1695 customers whose contractors of two years and there are 1473 customers whose contract is of one year right so these were some basic data manipulation operations after that we’ll head on to data visualization all right so here we’ll have to create a simple bar plot for the internet service column and yeah we have to set the x-axis label to categories of internet service y-axis label 2 count of categories the title of the plot distribution of internet service and the color of the bars need to be equal to orange right so i’ll just type in plt dot bar over here now i’ll insert another cell over here now what i’ll actually do is um so this basically takes in two parameters first so the first parameter is the names of all of the parts and the second parameter is the values for those bars right so the names of the bars would be coming from the internet service column so internet service and i actually want the value counts of this value counts and from this i don’t want the values i just need the keys and again i’ll have to convert them into a list so i’ll use p list over here i’ll click on run let’s see what do we get right so these are the three levels present in the internet service column so from this internet service column what i’ve done is i’ve used the value counts method so this value counts method has two things keys and values now i don’t want the values as such i just want the keys and i’ll take these keys and i’ll convert these keys into a list right so this is the list of the names present in this internet service column now i’ll cut this and paste it over here and this would be my first parameter and my second parameter would be all of the values and if i want all of the values i’ll just remove the method keys from there right so this over here would give me all of the values present with respect to this internet service column right so fiber optic or in other words the number of customers whose internet services fiber optic is 3096 number of customers whose internet services dsl is 2421 and number of customers who don’t avail the internet service are 1520 seconds right now again i’ll cut this and i’ll paste it over here now let me print this out so this is my basic r plot over here right so this is my bar plot on the x axis i have the names representing these bars over here right so this bar is for all of the customers whose internet service is fiber optic this is for those customers whose internet services dsl and this is for those customers who are not availing the internet service and these are their accounts over here present on the y-axis now that to do some other things over here i had to change the color of the bars so the color of the bars was supposed to be set to orange and for this i have the color parameter and i’ll just set it to be equal to orange i’ll run this again right so you have successfully changed the color of these bars over here now i’d have to set the x-axis label and the y-axis label so i’ll just type in plt dot x label and the x label needs to be equal to categories of internet service let me type that down categories of internet service after that i would need the i need to put in the label for the y-axis so this would be blt dot y label and i’ll just put it to be equal to count and then finally i’d have to given the title so plt dot title and i’ll set the title over here and the title needs to be equal to distribution of internet service let me type that out distribution of internet service alfred run right so this is our final bar plot which gives us the distribution of internet servers and the x axis label s categories of internet service and the y axis label discount right so this is how we can create a simple bar plot and you know all of this so these are the basic steps behind uh you know before you go ahead and build all of your machine learning algorithms so the pre-processing the data pre-processing part is always the main part of your data science life cycle so this is where you properly comprehend your data set this is where you understand the structure of the dataset you understand the correlation between all of the columns you know the correlation between the dependent variable and the independent variable so by manipulating the data set and visualizing the structure of the data set is where you understand all the patterns in the data set and you get insights from the data set right so next up we have to build a histogram for the tenure column so again it’ll be your similar operation plt.hist and i’d have to build a histogram for the tenure columns will be customer shown then tenure and i’d have to set the number of bins to be equal to 30 and i’d have to set the color to be equal to green so i’ll just type in color equals green over here so this is our histogram and this gives us the distribution of the tenure of the customers so um yeah so if you look at it closely so this is basically the count over here right so there are um yep so there are around 800 odd customers whose tenure is not even one month so they are churning out before they even complete one month and again there’s a huge pico here so there are around more than 600 customers whose tenure is uh 17 months or more than 70 months and then rest is pretty much the same right so the average customer yeah so this is the normal range of the customer so it’s between 200 to 400 and the average tenure of the customer you can say would be somewhere between 20 months to 50 months or 60 months right and this is where you have the peak so you have the peak at the starting and you have the peak at the ending again let’s go ahead and add the title so i’ll be just plt dot title over here and the title of the plot needs to be equal to distribution of tenure i’ll do that all right so i have created this plot and this is the title of the plot which is distribution of tenure so we’ve made a bar block we made a histogram now you guys also need to understand the difference between a bar plot and a histogram so a bar plot is normally used for all of the categorical columns so whenever you want to understand the distribution of categorical columns that is when you will go with a bar plot and when you want to understand the distribution of a continuous numerical column that is when you’ll go with a histogram right so um next up we’d have to create a scatter plot between monthly charges and tenure so tenure is on x-axis and the monthly charges is on y-axis so plt dot scatter x-axis would be tenured so customer shown tenure and then i’ll sit in the column for the y-axis this will be customer shown and this will be equal to monthly charges let me just run this right so this is what we get over here now let me also set in the labels for this so plt dot x label and this would be equal to tenure let me type in tenure over here similarly i’ll also set the y label over here so this will be plt dot y label and this would be equal to monthly charges right so now we also get the corresponding x and y axis labels after this i’ll also go ahead and set the title so it’ll be blt dot title and this would be monthly charges versus 10 yard monthly charges versus tenure right so this is our final scatter plot where we have the x-axis and y-axis labels and this is the title which is monthly charges versus tenure right and finally we have to also build a box plot between the tenure column and the contract column so um tenure needs to be on the y axis and contract needs to be on the x axis for this i’ll just type in customer tune dot box plot and um so now i’ll send this to be equal to customer shown contract and after this i have the column to be equal to customer shown and this needs to be equal to [Music] 10 yar let’s see what is the error over here columns not found so let me actually remove this from over here and let’s see what happens all right so now we get the result so we had actually given the name of the data frame initially itself so it was customer churn dot box plot and now all you have to do is assign the contract on the x-axis so now when i said buy equals to contract what is happening is i’ll have one box plot each for the different levels of the contract column so i have one box plot for the month to month level i have another box plot for the one year level and another box plot for the two-year level and over here the y-axis this is being determined by the ten-year column right so this over here zero to seventy this is the tenure of the customer and what we understand from this box plot over here is so if the contract of the customer is of two years then most probably the median tenure of the customer is very high so if the contract of the customer is of two years then his tenure or median tenure would be around 65 months similarly if the contract of the customer is one year then the median tenure of the customer would be around 45 months and then if the contract of the customer is month to month then the median tenure of the customer would be around 15 odd months all right so um these were all of the examples of visualization now it’s finally time to head on to machine learning right so this was your data pre-processing part where you had understood the structure of the data you had learned how to extract individual columns and after that you learned how to visualize the data and get some interesting insights from the structure of the data we’ll start off with our first machine learning algorithm which would be linear regression over here and uh linear regression as you already know so over here though a dependent variable would be a numerical column and you’re basically trying to understand how does one variable change with respect to another variable and over here we have to build a simple linear model where our dependent variable is monthly charges and the independent variable is equal to tenure right so or in other words we are basically trying to understand how does monthly charges vary with respect to and your so monthly charges dependent variable tenure is the independent variable and these are all of the subsets when it comes to this linear model so we’ll start off by dividing the data set into 1730 split and then we’ll build the model on the train set break the values on the test set after that we’d have to find out the root mean square error and um yeah we’ll have to print out that root mean square error so let me go ahead and import the linear regression model from sk learn so i’ll type in from sk learn import linear model after this i’ll type in from sk learn dot linear model import i need linear regression right so these are my two basic libraries so i needed linear model i needed linear regression now i would also require the train test split so i’ll type in from sklearn dot model selection import train test split right so the strain test split method would help me to divide my data set into training and testing sets so now it’s time to divide my data into training and testing sets so before that i’d have to get my target and the features or in other words i have to separate my dependent variable and the independent variable so monthly charges is the dependent variable so what i’ll do is y equals and i’ll extract only the monthly charges column and i will store it in a new variable and name that variable to be equal to y similarly i’ll only extract the tenure column right so i’m extracting the monthly charges column and i’m storing it into a new object naming the object to be equal to y similarly i’m extracting only the tenure column and i’m storing that column new x rn columns customer churn what seems to be the problem over here monthly charges let me put it to be capital c over here right now let me print out the head of these two y dot head and x dot head right so these are the values from monthly charges column and these are the values from the tenure column now let me go ahead and divide these two into training and testing sets so i’ll use in train test split i’ll pass in x as the first parameter so um all the features would be uh which are stored in x go as the first parameter after that i’d have to give in the target labels and the target labels are which are basically my monthly charges which are stored in y and then finally i’d have to give in the test size so let me check what was the test size so the test size was supposed to be 0.70 and then i’ll also set a random state so if i want to use these values again i can just set the random state to be equal to the same value which i’m giving over here this is smallest over here right so this test size 0.70 basically means that 30 of the records would uh go into the testing set oh this has to be 0.30 sorry for that yeah so 30 of the records would go into the test set and seventy percent the rest seventy percent of the records would go into the training set now i’ll be getting four results over here and those four results are extreme x extreme y train next test and widest these are actually the labels which we conventionally use i’ll explain what these are exactly extreme y train and then we have um systems this would actually be x test first extreme extras y train and y test so over here x strain represents all of of the you know um all of those values of your features which are present in the training set x test represents all of the features which are present in the test set y train represents all of the dependent values which are present in the train set and y test represents all of the dependent values which are present in the test set and whenever we are building a model we’ll build that model on top of the train set right so we’ll build the model on top of extreme and y train let me also show you the shape of all of these so like strain dot shape let me do the same for the rest extreme dot sheep i’ll make this to be y train i’ll make this to be x test and i’ll make this to be whitest right so um x strain y train so the training set has um these many records and the testing set has these many records over here right so these are all of the features which are present in the training set and these are all of the features which are present in the so these are all the features of your dependent variables and these are all the features of the independent variables and these are all of the um target values when it comes to the test set and these are all of the target values when it comes to the of you when it comes to your dependent variables all right now that we have extreme y tree x test and y beast let’s go ahead and build the model on top of the training data now i will go ahead and create and so normally your training data would be bigger because um so let’s say your splits are either 70 30 65 30 or 80 20 because the more data you give for training that is better but then again you can’t give out your entire data for the training set right so the purpose of training your model is to make sure that your model learns the underlying patterns of that data and once the learning is done you’d have to also test how well the learning is done right and to test you’d also need a sample space for that test set so or consider the simple case so let’s say you’re giving an exam and but for that exam let’s see if you got 100 exercises so your syllabus comprised of all of the 100 exercises and you’d have to learn all of those 100 exercises but when it comes to your test it will have only 10 exercises from all of those 100 exercises right so the training needs to be done but then again the test space it needs to be completely new which is not learned by the model or you know during the training phase right that is why the training set has to be completely different in the test it has to be completely different and this the division of training and testing set is done to make sure that overfitting doesn’t happen and when overfitting happens the problem is this model will perform well on this particular data set but when a new data set comes in it’ll miserably feel right so this is the reason why we divide the data into training and testing set right so now let me go ahead and create an instance of the linear regression model and i’ll name that to be regressive so i’ve created an instance of linear regression over here and i’ll go ahead and fit the model on top of the training set so it’ll be xtreme and y train right so i fit the model on top of the training set now it’s time to break the values so it’ll be regressor dot predict and i’ll be predicting the values on top of the x test and i will store this in let’s say y print now i’ve fit the model on top of the training set and i’ve also predicted the values on top of the test set now i’d have to know how well the prediction has been done and for this when it comes to linear regression we have something known as the root mean square error so the lower the value of root mean square error the better your model and uh again we have an inbuilt method to calculate the root mean square error so i just have to import sklearn dot metrics and from sklearn dot metrics i’d be importing square error now after this so mean squared error what we actually want is root mean squared error so i would need the np dot square root and let me use the mean squared error and this takes in two parameters first parameter comprised of the actual values which are present in whitest and second parameter are the predicted values which are present in white red so i will cut this and let me use this to be in this over here right so we get a root mean squared error value of 29.39 right so now let’s see if we build some other model with some other independent variables now we’ll go ahead and build the model and we’ll also predict the values now after breaking the values we’ll also have to calculate mean square error so this root mean square error is actually relative so let’s see if there is some other model let’s say model 2 and the root mean square error of that model 2 is let’s say 39 then this model would be better in the second model similarly if there’s let’s say model 3 whose root mean square error is 19 then that model 3 would be better than this model which you’ve built over here so here we are predicting the monthly charges values that is exactly right so let me actually show you that so let’s see i’ll put in y print and let’s say i’ll have a glance for the first five values right so these are the monthly charges predicted let me also show you y test five right so um these are the actual values and these are the predicted values over here right so this is not exactly dependent on the churn of the customer so what we’re doing is we are building an entire data science life cycle process so over here we are trying to understand the relationship between the tenure of the customer and the monthly charges of the customer right so here what we’re basically trying to understand is let’s see if the tenure of the customer is at around 10 months what would be his monthly charges again if the tenure of the customer is 30 months what would be his monthly charges similarly if the tenure of the customer is 70 months then what would be his monthly charges so this is what we are trying to understand over here right so this time we have to build a logistic regression model where dependent readable is shown and uh independent variables are tenure and monthly charges again i do the same thing let me actually copy these two so this is what we did over here radio right so this is where we went ahead and predicted the values didn’t we right so i have divided my data set into training and testing set and over here i am fitting the model only on the train set right so training is happening on the training side or the model is learning from the train set and we are predicting the values on the test set so this model this linear regression model has not yet seen all of the records present in the test set it has only learnt from the values which are present in the training set so um we’ll go ahead and build the linear regression model now customer shown let me also put in y i’ll do the same thing no and the logic regression model our independent variable is monthly charges let me put monthly charges over here and the logistic regression sorry the dependent variable is shown so i have extracted uh or i have uh got my features and my target over here so features are basically my monthly charges and my target is present in the churn column and i’d want to understand if the customer would churn or not on the basis of the monthly charges of the customer right so i’ve done this and the rest of the process would be the same overview i’ll go ahead and divide this data frame into train and test right so this is 65 35 ratio this time so i’ll say to be 35 so this time 34 35 of the records would be present in the test set and the rest 65 percent of the records would be present in the train set and i am storing all of those into extreme textures y train and white test right after this i’d have to import the logistic regression model so from sklearn dot linear model i’d have to import the logistic regression model and i’d have to create an instance of this so i’ll name this to be equal to log model all right so i have created an instance of this and i’ll go ahead and fit the model on top of the training set with the entire process when it comes to implementing this is pretty much the same right so python makes it extremely easy the sk learn library makes it extremely easy all you have to do is take in the data find out your independent variables and your dependent variable and then go ahead and divide those uh target and features into training and testing sit build the model on top of the train set and then break the values on top of the test set and then you will find out the uh metrics so for classification it’s confusion matrix and your normal accuracy score and for linear regression it’s it could be either root mean square error mean square error or moving average zero so log model dot fit and i will fit this model on top of the training cell again so x train and y train right so i have with the model yes we can also have multiple independent variables so again so if we have a single independent variable then that’s uh you know that’s a simple model so it comes to simple linear regression uh we have a single independent variable when it comes to multiple linear regression you’ll have a multiple independent variables right so this basically means that we are trying to understand how does our dependent variable change with respect to multiple independent variables over there let’s again consider this equation so linear equation equation is based on this let me actually open um notepad over here right so y equals let’s say um x1 plus x2 plus x3 plus x4 so what happens in a simple linear regression is you just have one independent variable which would be x1 what happens in multiple linear regression is you have multiple independent variables which are x 2 x 3 x 4 and so on right and you’re trying to understand how does y vary with respect to all of these independent variables that’s pretty much it right so let’s proceed with this we have built the model on top of the training set now let’s go ahead and uh predict the values on top of the test set so it’ll be log model dot predict so here we change the value of x and um no we are not changing the value of x we have multiple x values over here right so this becomes m x one plus uh you know m2 x2 plus m3 x3 plus m4 x4 and so on you have multiple independent variables and you are trying to understand how does y vary with those multiple independent variables so log model.predict and i want to predict my values on top of the test set and i’ll store the same again white red right so we have um predicted the values now right so when it comes to a classification model we can use the confusion metrics right so let me import the confusion metrics so from sklearn dot metrics i will import the confusion metrics also the accuracy score now let me find out both of this and i’ll pass in the actual values and the predicted values actual values are present in whitest and predicted values are present in white grid similarly accuracy score i’ll do the same thing white is then the white bread right so this is our confusion matrix and this is the accuracy right so to get the accuracy what you’ll basically do is you’ll divide the left diagonal with all of the values so left diagonal represents all of your correctly predicted classifications so this comprises of all of your true positives right so this part over here this is all of your true positives and this is all of your true negatives right so when you divide this with the entire sample space that is when you will get the accuracy so let me do that so that’ll be one eight one five divided by one eight one five plus six five one and i get the same accuracy which is 73.6 right so i get an accuracy of 73 for the model which i’ve built now i have to build a multiple logic regression model for the dependent variable is the same and the independent variables are tenure and monthly charges so now we have two independent variables so i’ll make the changes here itself right so character the question which you’re asking what if we had multiple independent variables right so over here we have two independent variables so this time the independent variables are monthly charges and tenure and i am trying to understand whether the customer would churn or not on the basis of these two columns which are monthly charges and tenure right so x now would comprise of these features and y is the same and um the ratio is 80 20. so i’ll change the test size to be equal to 20 over here i’ll go ahead and fit the model that’s the same again i’ll trick the values is the same right so i’ve built the model i predicted the values the only difference which i made over here is i have two independent variables this time instead of a single independent variable right writing the values after that i’ll import confusion matrix and accuracy score right now over here again i’m calculating the confusion matrix and this and this time i get an accuracy of 77.50 right so this time it’ll be 935 plus 157 divided by 935 plus 157 plus 106 plus 211 right so we get an accuracy of 77.50 so this left diagonal are all of the values which have been correctly classified and uh the sprite diagonal represents all of those values which have been mixed which have been misclassified so this was um logistic regression and then we’ve got two more machine learning algorithms left which are decision tree and random forest so let’s spell these two right so for decision tree dependent variable is the same which is churn and independent variable is tenure right so let me manually do this next as customer shown and the independent variable is tenure let me also extract the dependent variable so the dependent variable would be churn you have to make sure that you bring the spellings correct and also you have to take care of the small caps and the capital letters over here right now i will go ahead and also import the decision tree classifier so from sk learn dot tree i’ll be importing the decision tree classifier now i’ll go ahead and divide this data frame into train and test this will be the same process let me copy this let me paste it over here so the split s8020 which is what we are doing over here right so this is our feature this is our target labels and we have divided the data set into training and testing set now let me go ahead and create an instance of this decision tree classifier i’ll name this instance to be let’s say my tree right and i’ll go ahead and put the model on top of the training set so my tree dot fit this two parameters which are extreme and y train i fit the model now it’s time to break the values my tree dot predict and i’ll be predicting on top of the x test now i’ll import the metrics right so let me calculate the confusion matrix and also the accuracy score confusion metrics so i’ll actually have to store this in a object first again i’ll be storing this into y print let me run the cell again and this again takes in two parameters first parameter is all of the actual values which are present in whitest and all of the predicted values which are present in white red so this is our confusion matrix now let me calculate the accuracy which would be 965 plus 87 divided by 965 plus 87 plus 281 plus 76 right so for this decision tree model we get an accuracy of 74 percent right so the process is entirely the same guys right what we are doing is again at the risk of being redundant so what we are basically doing is finding out our features and target variable after that we are dividing the features and target into training and testing split and then we’ll go ahead and build the model on top of the train data and then we’ll break the values on top of the test setup once the prediction is done we’ll calculate the accuracy uh to find out how well our model has learned right so the in the entire data science life cycle our most important part is the data processing part right so because to the model building the so it depends on your problem statement other thing right so let’s see if your dependence on your prop it depends on your problem statement and what exactly you’re trying to find out so let’s say if your dependent variable is a continuous uh you know if it’s a continuous variable if it’s a continuous numerical then you’ll go with linear regression and if you want uh you know if you if you have multiple categories right so if you have multiple categories then you’ll go with either decision tree or random forest so when it comes to logistic regression it is a binary classifier so let’s say if you have just uh two or you know two labels in your dependent variable like this case over here right so this is where you will go with holistic regression but then again if you have multiple categories or if it’s a multi-classification problem then you’ll most probably go with decision tree and random forest and again when you compare decision tree and random forest for random forest is always better than decision tree because random forest is an ensemble model so again through your course you would have learned that a random forest is nothing but an ensemble of decision trees and the accuracy or the prediction given by a random forest is better than a decision tree so that you can actually take for granted right so whatever accuracy a decision tree gives the accuracy given by a random forest is better so just to verify that let me actually go ahead and build a random forest model on this same data over here right so we’ll take the same x and y values no so uh again a problem statement was different so linear regression we use to understand how does monthly charges vary with tenure but the other algorithms which we’ve built that was to understand how does churn vary with other factors right so over there monthly charges was a numerical column but churn is a categorical column so we can compare logistic regression and decision tree right so let’s say if we compare logistic regression and decision tree right so logistic regression gives us an accuracy of 77 percent yes yes right so logistic regression uh till now yes has given us the best accuracy right right so now finally we’ll also go ahead and build an ensemble model which is random forest so let’s actually compare the accuracy given by this decision tree and the random forest so from sk learn dot ensemble i’ll import the random forest classifier and i’ll create an instance of this maybe i’ll name this as rf and i’ll just create an instance right and i’ll go ahead and uh with the model on top of xtrene and y train extreme and y tree knife with the model now it’s finally time to break the value so rf.predict and i am predicting the values on top of the x test let me build the confusion metrics over here all right now let me check the accuracy so i’ll just type in accuracy score and i’ll bison pass in y test and white red whitest white red so you get an accuracy of 74.66 and it’s 74.66 right so there’s not much of a difference when you compare decision tree and random forest over here right so we’ve got the same values right but normally in general if you’re building a random forest model so you’ll either get a better accuracy or you’ll get the same accuracy right so this is your normal entire data science life cycle you’ll start off with the data pre-processing part data exploration part where you will understand the structure of the data set you will visualize the data set and understand whatever is happening underneath it right after that once you understand and comprehend your data properly that is when you’ll go ahead and build your model again when it comes to building your model you will sort of follow the same procedure we’ll find out your independent variables and the dependent variable then you’ll go ahead and divide those two into your training and testing set we’ll build the model on top of the training set now let’s actually start off by having a glance at the job trends of different programming languages so here we are comparing python r angular and c and it’s very obvious that python is the most preferred language across various industries so you see this blue color line over here so this blue colored line is for the python programming language and if you closely observe this blue colored line you will observe that the popularity of python has been steadily increasing over the years so now that we know how popular python is let’s head on to our interview questions right so this is our first question what are keywords in python so you can consider these keywords to be special reserved words which exist for a specific purpose now you cannot use a keyword name as a variable name or an identifier so these are some of the keywords which exist in python such as true false not continue and so on and in total there are 33 keywords in python 3.7 now you also need to keep in mind that these keywords are case sensitive that is if you look at the keyword true over here then you see that t needs to be capital so our next question is what are literals in python and then we’d have to explain about the different types of literals so literals are the constants used in python or in other words this is the data which is stored in a variable and there are four types of literals in python so we have string literals numeric literals boolean literals and special literals so let’s look at string literals so you can create string literals by just enclosing the text within codes so here we have created two string literals john and james so we see that john is enclosed double quotes and james is enclosed within single quote so this is how you can create string literals next up we have numeric literals so numeric literals comprise of all of your digits now if your numeric literal doesn’t contain any decimal then it is of the integer type and if your integer is too long then it would be of the long type and if your literal consists of a decimal point then it would be a float and finally we have a complex number which consists of a real part and an imaginary part going ahead we have boolean literals so these boolean literals comprise of just true and false values they are generally used when we are dealing with some condition whose output is either true or false now we’ll head on to special literal so python consists of this special literal called as none and it is used to specify a field that is not created here in this example i have assigned none to the variable while do so this variable would basically be empty so this is our third question what is our dictionary in python and we also have to create a dictionary where the key is fruit name and there are four fruit names as values so a dictionary is an unordered collection of elements and these elements in a dictionary are stored as key value pairs for example here we have a dictionary with the name my dictionary and we have three key value pairs so here our first key is one and the value for this key is john second key is two and the value for this key is bob and then we have the final key which is three and the value for this is alice right now let’s run to jupiter notebook and create our own dictionary where the key is fruit name and the values would be four fruit names so i’ll name the dictionary as my dictionary and we can create a dictionary with the help of these curly braces over here so i’ll given the key name which would be fruit name and after this i’d have to given values of four fruit names so let’s say my first fruit is apple and then the second fruit is mango third fruit would be orange and the fourth fruit would be guava now all i have to do is hit on run and then let me print this out so i’ll just type in my dictionary over here right so we have created a dictionary with the name my dictionary where the key is fruit name and the values are apple mango orange and guava and if you want to extract the individual key and individual values this is how we can do it now i’ll type in the name of the dictionary which is my dictionary i’ll put in dot and then i’ll just type in keys i’ll click on run and we see that the key for this dictionary is fruit name similarly if i want all of the values i just have to type in values over here now let me click on run so for this dictionary the values are apple mango orange and guava so our next question what are classes and objects in python so simply put you can consider a class to be a blueprint and objects to be real-world entities which are defined and created from classes for example over here you can see the actual blueprint of a house now this blueprint can be used for the rapid creation of unlimited number of copies so these copies of the blueprint are nothing but your objects and here we have created three houses or in other words three objects from the original blueprint which is our class so i’m repeating it so our class is nothing but a blueprint and from this blueprint we’ll be creating a set of objects which are our real-world entities so over here this house number one house number two and house number three are real world entities which are nothing but objects created from this class next we have to create a simple class with the name human which would give out the name and age of the person so let’s do this right so to create a class i’d have to use the keyword class and then i will given the name of the class which would be human now inside this i will create two variables so the first variable would be name and initially i’ll just assign none to it and my second variable would be h again i’ll assign none to it so we have created our variables now we’d have to get the value of name and the value of age from the user so for this i’d have to create some definitions or methods and to create a definition i’ll use def after this i will given the name of the method which would be get name and inside this i will pass in self right so now over here i’ll just use the print function and type in enter your name and after this i will get the name and the self.name so basically self.name basically means that whatever value the user enters it would be stored in this name variable of the object right and i’ll just type in input over here so this input function is used so that we can get a value from the keyboard right so this is how we can get the name of the person now similarly let’s go ahead and get the age of the person so i’ll type in def and i’ll name this method to be get age again i’ll pass in self inside this right this time i’ll use print and i’d have to enter the age so enter your h and then i’ll store this in self dot age equals input again right so we have created two get methods where we’ll be getting the name and the age of the user now after this we’d have to print out the name in the age so it’ll be def and i’ll create the method to be put name again this will be self and i’ll just print out the user’s name so your name is self dot name and then similarly i’ll create another method which would be put each and i’ll pass in self inside this so for this it would be print of erhs let me again put in double quotes over here right so your hs it would be self dot age so we have created our class where we have two variables name and h and i’ve created four methods and out of those from those two methods i’ll be getting the name and age of the person and the rest of the two methods would help me to print the name and age of the person right i’ll click on run so we have created this class now once we create our class which is basically our blueprint we’d have to create the objects from this class so let’s say i create an object and name that object to be person 1 now this would be our first instance and i just have to call in human so i’m creating a person instance of the human class now from this person object i will invoke the get name and get age methods so person one dot get name let me click on run i’d have to enter the name of the person let’s say the name of the person is sam right so now it’s time to get the age of the person so i’ll type in person one dot get h i’ll click on run and let’s say the age of the person is 28 so now i have successfully feeded in the name of the person and the age of the person now it’s time to print out both of those so it’ll be person one dot put h and person one dot put name so person one dot put name right so your age is 28 and your name is sam right so we have successfully created our class which would print out the name of the person in the age of the person right so next question what do you understand by the init method in python and after that we have to give an example of it so you can just consider the init method to be sort of constructed in python so it is a special method in a python class which is used to initialize the variables so now that we’ve understood what init method is and what it is used for let’s go ahead and work with this edit method so now here what i’m going to do is create a student class with the init method in it so let me do that class student and over here i’ll just create the init method and these are the parameters of this init method self so and after this it will take in the name of the person and after name it will take in the age of the person and after this it will take in the branch of the person right now what i have to do is just store these values inside the original variables so that’ll be self dot name equals name and then our next variable is h so self.h equals h and then we have the final variable which is branch so it would be self dot branch equals branch so we have created our init method now after this wait so we actually have to put in d e f over here because this is actually a method right so we have created our init method which is basically a constructor and going ahead we’ll just create another method which would help us to print all of these values tiff print student and l taken self and all i do over here is print in all of these values so print off name would be self dot name after this we have h so let me change this to be age over here and this would be self dot age after this we’ll have to print in the branch so let me just put in branch over here and this would be self dot branch right so we have created the student class which has the constructor i’ll click on run now let me create an object of the student class i’ll name the object to be student one and let me go ahead and create an instance of this now since we have this constructor over here so i can go ahead and initialize this student object here itself so i’ll given the name of the student so let’s say the name of the student is bob after this the age of the student is 12 and then which branch is he studying so let’s see this guy is studying engineering let me just put in engineering over here run now student one dot i will call in the print student method over here right so we have successfully created this instance student one right so what do you understand by inheritance in python and then we’d have to give an example of it so inheritance refers to the property of one class acquiring the properties of another class for example let’s say you have inherited your features or properties from your parents so if you see all this family tree over here you can understand that traits such as hair color and poor eyesight are passed from one generation to the next generation so over here this is generation one generation two and grandparents so your parents are inheriting their traits or their features from their grandparents and you are inheriting your traits from your parents now let’s go ahead to jupiter notebook and work with an example of inheritance so here what we are doing is we have a base class with the name fruit and this base class is being inherited by another class citrus now this is our base class fruit which has a constructor and this constructor just prints out i am a fruit now after this what we are doing is we are creating another class with the name citrus and this class inherits from the fruit class right so if a class has to inherit another class in python we’ll just give the name of the base class as a parenthesis inside our new class now again inside the citrus class we have created another constructor and inside the constructor of the citrus class i am using the super method so with the help of super method i can invoke the variables and functions from my super class right so over here i want the init method for my super class so i’ll just use the super method and i’m invoking the init method from this fruit class now apart from this edit method from the fruit class i would also print something else so over here i am also printing i’m citrus over here right and i’ll create an instance of this class citrus and i’ll store it as lemon now the result is i’m a fruit i’m citrus so this value i’m a fruit is coming from the super class and this value i’m citrus is coming from our new class or the child class so this is how we can do single level inheritance in python so next so what is numpy and how can you create a basic 1d and 2d numpy array well numpy is the most widely used python library for linear algebra and it is used for performing mathematical and logical operations on arrays and to import the numpy library in python you just have to use the command import numpy so again let’s head to jupyter notebook and create a 1d numpy array and a 2d numpy array so i’ll start off by typing import numpy as np after this i’ll create my 1d numpy array so a equals np dot array now i will given a list of values so this is my 1d array e and then let me just print it out so this is my 1d array which comprise of the values 1 2 and 3. now i’ll go ahead and create a 2d array with the name b so again the syntax would be the same np dot array so it’ll basically comprise of a list of lists so the first list would comprise of one two and three and the second list would comprise of four five and six let me print this out right so this is our 2d array which comprises of one two three four five and six so this time we’d have to initialize a phi cross five numpy array comprising of all the zeros so there’s a five cross five numpy that is there need to be five rows five columns and all of the values need to be zero and to initialize an array with all zeros we can just use np dot zeros method from the numpy library let me just type in import numpy as np and inside e i’ll just state and b dot zeros and i’d have to given the dimensions of this array so the dimensions of this array are 5 cross 5 let me just print out a so we have successfully created our phi cross 5 numpy array where all of the values are just zeros now let’s say we have two numpy arrays like this so this is our first numpy array this is our second numpy first numpy array comprise of one two and three second number i comprise of four five and six now i’d have to add the individual elements so four plus one needs to become five five plus two needs to become seven and six plus three needs to become 9. again our first step would be very simple i’d have to load the numpy library import numpy as np i will go ahead and create my first numpy array a equals and b dot array and the values are one two and three similarly i create my second numpy array which is b i’ll change this variable to be equal to b and the values are four five and six now to add the individual elements i’d have to use the np dot sum method now inside this i’d have to give in my first parameter so my first parameter would be the numpy arrays which i’d have to add so i want to add a and b and i will set the axis to be equal to 0. so when i set the axis value to be equal to 0 this would individually add the elements so this will do four plus one five plus two and six plus three right and this is what we have four plus one is five five plus two is seven and six plus three is nine now let me actually change the axis value to be one and let’s see what do we get right so when i change the axis value to be 1 then the addition happens across the row so when you do 3 plus 2 plus 1 you get 6 and when you do 6 plus 4 plus 5 you get 15. so now we have to get the n largest values from a numpy array so this is our numpy array over here and this comprise of one of three four five six seven elements 12 43 254 5 and 68 and i’d have to get the first two largest values which over here are 100 and 68. now let’s go to jupyter notebook and let’s see how can we do this right so again we’ll start by importing the numpy array import numpy as np and i’ll create this array x is equal to np dot array and i’ll given all of these values now to get the indices of values which are arranged in ascending order we can use the np dot arg sort function now what i’ll do is i’ll actually insert a cell below and i’ll go ahead i’ll do control x i’ll do control v so we have successfully created our numpy array let me just print in x over here so this is our numpy rate now what i’ll do is i’ll just copy this and p dot r sort of x and i will print it over here and let’s see what do we get so what we get over here are index values so 0 1 2. right 0 1 2 this is our lowest value and then we have 5 so element at index number 5 0 1 2 3 4 5 right 2 5 and then we have 0 which is 12 and then we have 43 right so we basically have the indices of the values arranged in ascending order so 2 5 12 43 54 68 and 100 so this is how it goes now if i want the indices of the first two highest values then i’ll just put in minus two colon over here and i have six and three right zero one two three four five six so 68 is the second highest value and then we have three zero one two three right so 100 is the first highest value now what i want to do is i want to arrange this in descending order so to arrange this in descending order again i’ll put in braces over here i’ll put double colon and i’ll put in -1 over here and i have sorted the indices in descending order now if i take all of this i’ll cut this and i will paste this inside of x i’ll click on run and this is how i get the first two highest values from this numpy array so 168 are the two highest values from this numpy array so now we have to give some examples for creating a data frame from list and dictionary so this is a very common and easy question which is asked in most of the python interviews right so first we’d have to go ahead and create a list and we’d have to convert that less into a data frame similarly we’d have to create a simple dictionary and convert that dictionary into a data frame so i’ll type in import pandas as pd so i’d have to start up by importing the pandas library now i will go ahead and create a list so i’ll name this list to be equal to l1 so l1 equals 1 comma 2 comma 3 comma 4 comma 5 so we have created our list now to convert this list into a data frame all i have to do is use pd dot data frame function so here we need to keep in mind that d is capital right pd dot data frame and i will pass in l1 inside this and i will store this in let’s say data 1 and i’ll just print out data 1 over here so we have successfully created a data frame from a list where the list values are one two three four and five now similarly we will create a dictionary and create a data frame out of the dictionary so i will name this dictionary to be dt1 and the key is fruit name and the values are apple mango and let’s say orange after this we’d have to create our second key value pair so our second key value pair would be count and the values would be let’s say 12 24 and 36 right so we have created this dictionary now again to convert this dictionary into a data frame we’d have to use pd dot data frame and i will pass in dt1 inside this right so we have created this data frame where our first column is fruit name and the second column is count and fruit name comprised of apple mango and orange and the count of the fruit just 12 24 and 36 so now we have this iris data set which comprises of all of these columns so we have separate length sepal width petal length petal width and species now out of this we’d have to extract some specified rows based on a condition so you’d have to extract only those records where the sepal length value is greater than 5 and the sepal width value is greater than 3. so i’ll start off by loading the pandas library import pandas as pd and after this to load a csv file i’d have to use the pd.read csv function and i’ll given the name of the file so the name of the file would be iris.csv and i’ll just store this in a new object and name that object to be equal to iris now let me have a glance at the head of this data set iris dot head so this will give me a list of the first few records right separate sepal with petal length petal width and species now let’s see how can we extract only those records where sepal length is greater than phi and sepal word is greater than three right so i’ll start off by giving the name of the data frame and i’ll put in these braces over here and i’ll give in the first condition so the first condition is again iris and from this i’d have to select only those sepal length columns where it is greater than five so sepal dot alien cheery h this value needs to be greater than five so i’ve given my first condition after this i’ll go ahead and give my second condition and this time i’d have to extract only those records where sepal dot width let me type this out and this needs to be greater than three now let me also put this inside these braces over here let me cut this and let me paste it over here i’ll click on run right so this is our list of values where sepal length is greater than 5 and sepal width is greater than 3. if i scroll down you’ll see only those values where sepal length is greater than 5 when sepal width is greater than 3. so what we do is we’re given the first condition where sepal length is greater than 5 after that we’ll use the and operator and then we get the second condition which is sepal width is greater than 3. right so after this we again have this iris data frame over here and we’d have to introduce nan values or null values in the first 10 rows of sepal width column and petal width columns so if you see this original data frame over here you see that these two columns comprise of some values but then again we’d have to fill the first 10 rows of these two columns with nan values so i’ll start off by loading the required packages which would be pandars and numpy so input pandas as pd and import numpy as np after this i’ll load the iris data set so iris equal to pd dot read csv and i will pass in the name of the data set which is iris dot csv let me again have a glance at the head of this data set so iris.head all right so now to introduce any values i can use the np.nan method so now what i’ll do is i’ll actually create a duplicate copy of this object so iris one and i will store this data frame into iris one now i’ll type in iris dot i lock and i would want to make changes in the first 10 rows and the second column and the second column right so first and rows and the second column so the index of the second column would be one and what i’ll do is i will introduce all of the nan values so np dot nan let me put an equal to over here now let me have a glance at the head of this modified data set so iris one dot head right so we see that this is our original data frame and with the help of np dot nan i have introduced any values from actually the first row so let me actually change this to be zero over here i’ll click on run so now i have any values for the first 10 records similarly i’ll also go ahead and introduce any values in the petal length column so over here i just have to change the index of the column which would be two let me have a glance at the head iris.head right so now we’d have to get the number of any n values present in each column of this enhanced data frame so this is our data frame over here which comprise of these columns so we have h bmi hyp and chl and we see that these three columns over here comprise of any n values and we’d have to find the count of the nan values present in each of these columns so i’ll start off by loading the pandas data frame i’ll type in import pandas as pd let me just wait till the package is loaded right so now that we’ve loaded the package i’d have to load the csv file which is enhanced.csv and for this i’ll be using pd dot read csv function and i’ll given the name of the file which is enhance dot csv and i’ll store this in a new object and name that object to be enhanced now if i want to get the count of number of nan values this is what i’d have to do enhance dot so i have this s and a function and after this i will just print it out let’s see what do we get right so we just get a bunch of true or false labels so wherever there is an nan value present we have a true label so over here in the bmi column the first record this is an n value this again is an nan value so wherever you see true values it basically represents all those any n values and if i want the sum of all of these nan values i just have to type in sum and now i click on run right so you see that each column has no nan values bmi column has nine nan values hyp column has eight nan values and chl column has 10 nan values now we’d have to open and read a file in python so let’s see how can we do that so i actually have this file with the name sparta and it is present in my d drive so let me actually copy the path over here right so this is just the path which i have to copy so first open a file in python i have to use the open function so i’ll just type in f equals open and this takes in two parameters the first parameter is just the path and after the path i’ll given the name of the file which would be sparta dot txt and the second parameter is the mode which i want to open this file so i would open this file in the read mode so again i’ll just given double quotes and i’ll type in r so r basically means that i am opening this file in the read mode and i’m storing this in this object f now let me go ahead and read this so f dot read right so this is the sentence which was stored in this file this is para and we have successfully read the sentence so let’s head to the next question so what is the lambda function and we’d have to create a simple lambda function to add 10 to a given number well a lambda function is an anonymous function and it can take any number of arguments but it should have only one expression and this is the syntax of a lambda function so you’ll type in lambda and then you’ll given all of your arguments after that you’ll put in a colon and then you’ll give the expression so let’s go ahead to jupyter notebook and create a symbol lambda function to add 10 to a given number right so let me just type in lambda and i’ll give the name of the variable to be a i’ll give a colon and all i have to do is add 10 to whatever variable is sent into this and i am naming the function to be let’s say x so this is how we can create a simple lambda function now i’ll call the function and pass in a number so let’s say i’ll pass an 8 now this is returning 18. so all i’m doing is adding 10 to the number which i’m passing into this now again let’s say if i pass 5 into this i’ll get 15. similarly let’s say if i pass 100 i’ll get 110 so we have successfully created a lambda function which takes in a parameter and adds 10 to the given parameter so now we’d have to create a simple line plot like this where x and y axis values range from 0 to 10 and the title of the plot is y versus x x label as x axis y label is y axis so simple line plot and we can create this line plot with the help of the matplotlib package so let’s quickly here onto jupiter notebook i’ll start off by loading the required packages so i would need the numpy package so i’ll type in import numpy as np and i would also need the matplotlib package so i’ll type in from matplotlib import pie plot as plt right so now that we’ve loaded the required packages let’s go ahead and create a data so our x axis and y axis values range from 0 to 10 so i will name x and i’ll get the values with the help of np dot a range and since the values go from 0 to 10 0 10 and uh the step factor is one so let me just print out x over here and let’s see what do we get right so these are all of the values which i have over here so zero one two three four five six seven eight nine right now similarly let me also go ahead and create the y values it’ll be the same thing over here it’s just that instead of storing the values in x i’ll be storing them in y so i have my x and y values to be ready all i have to do is use these data points and create the line plot so p lt dot plot and i’ll parson x comma y and this is my plot over here now i’ll also go ahead and add the labels for x-axis y-axis and i’ll also given the title plt dot x label so the label would be x-axis similarly pld dot y label and the label would be y axis for this after this i’d have to given the title so the title would be plt dot title and x versus y right so we have created our line plot where the label of x axis is x axis the label of y axis is y axis and the title of the plot is x versus y so it’s as simple as that guys so this is how you can create a simple line plot with the help of the matplotlib package so now we’d have to create a simple bar plot and we have these fruits over here represented on the x-axis so we’ve got apple banana and orange and we’ve got the cost of the fruits on the y axis so let me start off by loading the required library from matplotlib i’ll be importing by plot as plt after this i just have to create my data so i’ll be creating a simple dictionary over here and uh i’ll name this dictionary as data i’ll put in braces over here and it consists of three fruits which are apple banana and orange so apple and cost of apple would be let’s say 50 bucks after that we have banana and the cost of banana is 20 of that we have orange and the cost of four and just 30 right so now i’ll separate the keys and values from this so let me get the keys first data dot keys and i’ll get the list of this i’ll paste this inside this and let’s say i’ll store this in an object named as names now let me get all of the values similarly values and i’ll be getting data dot values so i have the names i have the values now all i have to do is make a bar plot and to make a bar plot i’ll be using plt dot bar and i’d have to pass in the names as well as the values all right so we have successfully created a bar plot and on the x axis we have the names and on the y axis we have the cost next question so what do you understand by a module in python so when we write everything in a single page it becomes difficult to track and not just this let’s say if you want to make a change a certain place in the project then it would affect the entire project and may prove to be disastrous and this is where we’ll be using the concept of modules so instead of writing one big software and one page you would have to break it down into parts so a module basically helps us to organize our python code now let’s say you want to write a program to create a calculator now instead of writing all of the features in the same file you can create separate modules for addition subtraction multiplication and division now if you want to perform addition you can invoke the addition module similarly if you want to perform multiplication you can invoke the multiplication module so for every single purpose you can have a separate module so that your work becomes easier so now we have to randomize the items of a list in place in python so let’s see we have a list which comprise of these elements so let’s say the first element is mary and then we have had a little lamp now i’d have to randomly shuffle all of these elements inside the list and to randomize the items of a list we can use the shuffle function and the shuffle function is part of the random library so i’ll type in from random import shuffle so i have loaded the function now let me go ahead and create the list mary had a little lamp right now i will just pass in this inside the shuffle function now let me print an x right so this is in place shuffling inside the list right so initially the list was mary had a little lamp or the sequence of the elements inside this list was mary had a little lamp and after passing this list inside the shuffle function the elements changed and the sequence now is a lamb mary had little so now we’d have to write a program to get the length of the string ophthalmology without using the len function and to get the length of the string we can just use the for loop so what we’ll do is we’ll start a for loop and it will iterate through all of the characters in this string and we’ll get the count of the number of characters present in the string so let me name the string so it is o p h d h p l m o l o g y so let’s hope this is actually the spelling of ophthalmology right now let me initialize a counter and set the value of counter to be equal to zero after this i’ll start the for loop so for i n a what would basically happen is count equals count plus one and finally i’ll print out the value of count so what is happening over here is initially i’s value is zero and it will loop through the all of the characters of the string which is present in a and so let’s say the loop starts over here initially i’s value is zero and it’ll enter o now the count increments by one again it’ll head to the next character the count again would increment by one it’ll head to the next character and the count would increment by one so till the end of the string it will keep counting the number of characters present right and we get the results so the number of characters present and the string is 13. so now we have to replace all the odd numbers in this numpy array to minus one right so this is our numpy array which comprise of the numbers from zero to nine and we’d have to replace all of the odd numbers so one would become minus one three would become minus one five would become minus one so wherever odd numbers are present all of those odd numbers would become -1 let me load the numpy library import numpy as np after this what i have to do is create my numpy array so a r equals np dot a range and it will go from 0 to 10 let me just print out this numpy array over here this is my numpy array now let me go ahead and replace all of the odd values with -1 so what i’ll do is i will basically divide each element with 2 and see what is the remainder so arr percentage 2 and if it is equal to 1 so what i’ll do is i will divide 0 with 2 i’ll check the reminder similarly i’ll divide 1 with 2 and i’ll check the remainder i’ll divide 2 with 2 and i’ll check the remainder so wherever the remainder is equal to 1 those elements i’ll be changing it to equal to minus 1 right so first i’ll divide 0 with 2 and i’ll check what is the remainder and since the remainder is not equal to 1 nothing will happen after this i’ll divide 1 with 2 and i’ll check the remainder so if the remainder is equal to 1 i’ll replace this with minus 1. similarly over here if i divide 3 with 2 the remainder which i’ll be getting is 1 and again this value would be replaced with -1 right so the changes have been done now let me prime this this was my original array after performing this step over here all of the odd values have been replaced with minus one one has been replaced with minus one three with minus one five with minus one seven with minus one and nine with minus one now we’d have to perform an operation so that we get the common items between two numpy arrays this is our first numpy array this is our second numpy array now let’s actually check the common items so if we look closely at these two arrays we see that two and four are the only two common items present among these two arrays and i’d want to extract these two right so we have created our arrays over here this is the first numpy array and i’m storing it in a this is my second numpy array and i’m storing it in b now to get the common elements i have the np dot intersect 1 d method so np dot intersect 1d and i just have to pass in the two numpy arrays inside this as the parameters so a comma b i’ll pass in these two i’ll click on run right so i’ve got the common elements present in these two arrays right so we have a panda series over here and we’d have to convert each of these elements into title case so mary had a little lamp so we see that all of this are in small cases now i’d have to convert all of these elements into title case so let me start off by loading the pandas library import pandas as pd now i’ll create the series pd dot series and i’ll pass in the values which are basically mary had little lamp right so this is done and i’ll store this in let’s say ser so i have created my series i’ll just print it out now right so this is my panda series now i’d have to convert all of these elements into title case that is m needs to be capital h needs to be capital a l and l needs to be capital right and as i’ve already told you we’ll be using the map method which helps us to replace or substitute values or all of the values inside a panda series so scr dot map and now inside this i’ll use a lambda function to convert all of these elements into title case so let me type in lambda over here x colon and i just have to convert this into title case so let me just put in title over here now let me click on run all right so we have successfully converted all of these elements into title case now let me store it back to cr now let me print ser over here right so all of these elements have been converted into title case so now we have the same panda series so merely had a little lamp now i’d have to calculate the number of characters in each word of the series so this is one two three and four so there are four characters present in the word mary there are three characters present in the word had this is a single character right so we’d have to find out the total number of characters present in each words or each of these elements inside this panda series import pandas as pd let me again create the series ser equals pd dot series and i’ll given all of the elements mary had a little lamb so now to get the length of each of these words present in this panda series i’ll have to use the map method so scr dot map and inside this i’ll again create a new lambda function lem bda a colon and since i have to get the length i’ll use the length function and i have to get the length of each of these elements inside a right so we see that the length of the word mary is of four characters in hat there are three characters a is a single character and in little there are six characters and in the word lam there are four characters time for next question so again we have this iris data frame and we have to change the column name sepal length to s underscore length let me load the package import pandas as pd and i’ll also load the file so pd dot read csv and the name of the file is ios.csv and i’ll store it back into iris let me have a glance at the head of this data frame so it’ll be iris dot head so we have all of these columns over here and i’d have to change the name of the sepal dot length column to s length so to rename the columns of a data frame i have the pandas dot rename method so first i’d have to given the name of the data frame which is iris and then i will invoke the rename method now i’d have to given all of the column names which i’d want to change so i’ll type in columns over here i’ll create a dictionary and my key would be the name of the column which i’d want to change so i would want to change the name sepal length to be equal to s underscore length so it’ll be s underscore length so this is how it goes so iris dot rename and the original column name is sepal.length and i’d have to change it to s underscore length and i’ll store it in a new object and name that object to be equal to iris one now let me just print out the head of this so iris one dot head so now we have to build a linear regression model on top of this boston data frame where the independent variable is this rm column over here and the dependent variable is this medv column over here and the train and test split needs to be equal to 80 20. so this question is basically related to machine learning with python where we are implementing a linear regression model on top of this data set to understand how does this medv column vary with respect to this rm column or in other words medv column is our dependent variable and rm column is our independent variable and we are trying to understand how does medv change with respect to rm so let me start off by loading the pandas library import pandas as pd right now we’ll go ahead and also load up our boston dataset so pd dot read underscore csv helps me do a lot of the data set and the name of the data set is boston.csv and i’m loading this file in this object boston and then i’ll have a glance at the head of this so these are all of the columns which are present in this data frame so i’ve got crimson indus cast nox rm and so on and medv is my target and my feature is rm or in other words mdb is my dependent variable and rm is my independent variable now i’ll separate the feature and the target all right so pd.dataframe and so from this entire boston data frame i am selecting only the rm column and i am storing it into the x object similarly from the entire boston data frame i am selecting only this medv column and i am storing it into this y object so x would have the feature values and y would have the target value so now i’ve extracted the feature and the target now it’s time to divide this data set into training and testing set so the train test split needs to be 80 20 or in other words 80 percent of all of the records would be present in the training set and 20 of all of the records would be present in the testing set and to do this i need the train test split from sklearn dot model selection so i’ll type from scalar dot model selection import train test split and this method takes in these parameters so first i’d have to pass in the features and then i’d have to pass in the target so x comma y and then i’d have to given the test size so the test size is 0.20 so this again means that the test set would comprise 20 records and the train set would compress the 80 percent of the records and the values would be stored in extreme x test y train and y test so x train would comprise of all of the training values or all of the training records for the features and xtest is basically the test set for the features y train is the train set for the target and why test is the test set for the target so we have our training and testing sets ready let me click on run over here all right now it’s finally time to build the model on top of the train set so from sklearn.linear model i will import the linear regression and i’ll create an instance of this so i’ll name that instance to be regressor and i will fit this model on top of xtrain and white ring or in other words i am filling the model on top of the train set right so now that we fit the model it’s finally time to predict the values on top of the test set and to break the values i’ll use regressor dot predict and the parameter which i’m passing inside this is x underscore test and i’ll store this in y underscore print now once you’ve predicted the values i have to find out the root mean square error so i’ll import metrics from s k learn and this is how i’ll get the root mean squared error so matrix dot mean squared error and this takes in two parameters y test and y print so y test comprise of all of the actual values and why thread comprise of all of the predicted values and i will pass these to inside mean squared error now when i do this i’ll get only the mean squared error but i want the root mean squared error so i’ll use np dot square root so i would also have to import the numpy library over here i’ll type in import numpy as np now i’ll click on run so now we’ve arrived at our final interview question and uh we have to build a decision tree classifier on top of this iris data frame where the dependent variable is the species column and the independent variables are the rest of the columns so separate length sepal width petal length and petal width would be the independent variables and your species column is your dependent variable and the train test split is 70 30. so i’ll start by loading the requisite packages which are numpy and pandas so import numpy as np and import pandas as pd then after this i’ll load my data set so pd dot read underscore csv i’ll pass in the file name rs.csv and i’ll store this in the iris object and we have a glance at the head of this data set so these are all the columns present in this data frame we’ve got separate length sepal width petal length petal width and the species column now out of this the species column is our dependent variable and these four columns the numerical columns are our independent variables so now i’ll go ahead and separate the features and the target variable so these are all of my features these first four columns so i’ll just extract separate sepal with petal length and petal width from this iris data frame and i’ll store that into this x object similarly i’ll extract only the species column from the entire iris data frame and i’ll store it into y object so i have my features and the target ready now it’s time to divide this entire data frame into train and test split so i’ll have to import train test split from sklearn.model selection and the stakes in these parameters first parameter are the features second parameter is the object which comprise of the target label and the test size is 0.30 so 30 of the records would go into the test set and the rest 70 of the records would go into the training set and i am storing the results into x train x test y train and y test so x train is the training set for the features x test is a test set for the features y train is the training set for the target and why test is the test set for the target now we’ll import decision tree classifier from sklearn.tree so i will go ahead and create an instance of this decision tree classifier and i’ll name that instance to be classifier and i will fit this classifier on top of the training set so classifier dot fit and i’ll pass in x train and y drain inside this so we have successfully fit the model on top of the train set now we’ll go ahead and print the values on top of the test set so classify dot predict so i’ll show the result into y print after this we’ll calculate some metrics so we’ll get the confusion metrics and we’ll also get the accuracy score right so this is my confusion matrix and this is the accuracy so the left diagonal which you see in this confusion matrix all of these are the values which have been predicted or classified correctly so this first row represents the species of setosa second row there’s species of versicolor and the third row the species of virginica right so you see that all of the species of setosa have been classified correctly when it comes to versi color 16 of them have been classified correctly and two of them have been classified incorrectly and when it comes to virginica 13 of them have been classified correctly and two of them have been classified incorrectly so to get the accuracy we’ll have to add 12 plus 16 plus 13 by all of the values so let me just do that 12 plus 16 plus 13 divided by 12 plus 16 plus 13 plus 2 plus 2. so the accuracy is the same and it comes out to be 91.11 by understanding what exactly a sentiment analysis then will understand the need of it for language we’ll look at some very interesting applications of sentiment analysis and finally we’ll do twitter sentiment analysis so all of the big companies out there they try to understand the sentiments of their customers they try to analyze what are the customers talking about how are they seeing it and what do they exactly mean by it so this is where sentiment analysis comes in so sentiment analysis is basically that particular domain where you try to understand human emotions with a software and if these human emotions are in written form we can go ahead and classify these sentiments to be positive negative or neutral and sentiment analysis is also known as opinion mining because what we are basically doing is trying to figure out the opinion or the attitude of the customer with respect to a particular product so this is basically sentiment analysis now we’ll go ahead and understand the need of sentiment analysis so today a customer plays a very big role in the market he’s responsible for making or breaking your business and if a company is able to tap into all of the sentiments of the customer then it could be very beneficial for the company so this is where a company can take the help of sentiment analysis and obtain useful information which can be used to determine market strategy improve business kpis generate leads and so on so this is basically the need of sentiment analysis so now we’ll go ahead and look at some interesting applications of sentiment analysis so sentiment analysis can be used for review classification now there are a lot of customers who post a lot of reviews but then again how do we know the sentiment associated behind these reviews this is where we can take the help of sentiment analysis and classify these reviews to be positive negative or neutral another application of sentiment analysis is product review mining so let’s say there’s a company which has a product a now the company wants to know what are the features which are liked by the customer and what are the features which are disliked by the customer so again the company can take the help of sentiment analysis and figure out what are the features like by the customer and what are the features disliked by the customer and depending on that it can go ahead and improve this particular product so sentiment analysis can also be used during election times let’s say there are two candidates candidate a and candidate b and with the help of sentiment analysis we can understand which candidate is more popular in that particular area and in the past decade or so there has been a huge increase in the online activity across the globe so every single second people make millions of posts and this is where social media plays a pivotal role now social media is not just any other platform people go on to social media and express their views they talk about their likes and they talk about their dislikes and this is where if a company is able to tap into all of these sentiments it can be very useful to it and there are a lot of social media sites such as twitter facebook and linkedin so today we’re going to do a bit of twitter sentiment analysis right so to do sentiment analysis with twitter we actually need a developer account so we’ll go to developer.twitter.com and over here this is my profile which is already there i’ll click on apps and then go ahead and create a new application so we have this option to create an app and if you guys want to create a new app you’d have to click on this fill up a few details and you’ll have your new app ready but then again i have already created an app with this name over here sentiment 123 demo now let me have a glance at the details right so this is my app and it contains these keys and tokens so in our code we’ll have to use these consumer api keys and access token and access token secret so you’d have to note down these values and then put in inside your code all right now let’s go ahead and do some sentiment analysis so first we need this to epi package which would basically act as the api with twitter and then with the help of text blob we can understand the sentiment of different tweets so we’ll import these two first is to epi next is text block and then we’ll go ahead and give the values of these four consumer key consumer key secret access token and access token secret so you have the same values this is the api key api secret key access token and access token secret and i’ve given the same values over here after that using twy.org handler i would have to first given the consumer key and the consumer key secret and i’ll store that in this object auth now again this object contains set access token now i’ll given the values of access token and access token secret now finally i will send this object inside the twi dot api so this is how we are basically establishing the connection with twitter so we have given all of these four values and we are authenticating this api and then finally we can go ahead and give in the hashtag for which we’d have to find the sentiment so let’s say i want to find the sentiment for this word avengers so using this object api i’ll just type in api dot search and it contains the keyword avengers and i’ll store this in the object public tweets now i’ll start a for loop so i will go through all of the tweets which are stored in public tweets and i’ll start off by printing the first tweet and then i’ll pass the tweet dot text as a parameter for this text blob function and i’ll store the result in this analysis object now this analysis object helps me to find out the polarity and the subjectivity of these tweets it has these two things one is polarity next is subjectivity so first what we’ll do is we’ll just check if the sentiment analysis dot sentiment of zero so basically this is the first element inside this list so if the value inside this is greater than zero so we are basically checking if polarity is greater than zero then we will just print that the sentiment of this tweet is positive on the other hand if the polarity of the sentiment is less than zero then we’ll print negative and if it is not greater than zero if it is not less than zero then it basically means that is equal to zero so we’ll just print that the tweet is neutral now let me go ahead and run this so these are all of the tweets which contain the keyword avengers so this is our first tweet we see that the polarity is 0.0 subjectivity 0.0 and it is neutral next again this is the tweet again polarity 0.0 subjectivity 0.0 and then again it is neutral you see that for most of these polarity subjectivity are zero and it is neutral over here and for this case over here polarity is 0.6 if the value of polarity is greater to plus 1 then it would mean that it is highly positive similarly if the value of polarity is closer to -1 then it would mean that the sentiment of the tweet is very negative and then what we have over here is subjectivity so subjectivity basically tells us how subjective or how personal the tweet is so we see this tweet that i just needed to share with everyone how herds is a truly amazing company and for this polarity is 0.6 and subjectivity is 0.9 and then again for this statement over here we see that polarity is 0.375 subjectivity is 4.7 and we see that it is positive and then again for this statement over here polarity 0.0 subjectivity 0.0 and it is neutral right guys so this is how we can do sentiment analysis in python so nlp is a subfield of artificial intelligence where we teach computers how to understand and interpret human spoken languages or in other words natural language so what in the world is text mining well simply put extracting information from unstructured textual data is known as text mining now let’s say there’s a 500 page thesis on some financial topic and you come across a sentence which ends with the number 30. now you have absolutely no idea what that number means it could be the number of days in a month the amount of dollars a stock increased over the past week or the number of items sold in a day so this number 30 could mean anything and without proper context you wouldn’t get the right information and this is where text mining comes in so with the help of text mining we can give proper context to the words before and after the number 30 and this would help us to find out more information now let’s look at the need of text mining so today with the advent of social media companies now access to massive behavioral data of their customers and most of this data is present in the form of unstructured textual data but beneath this textual data lies an enriching source of information which can help companies to boost their business and this is the reason why text mining is being used across various industries now text mining can be divided into four practice areas so we’ve got information extraction which basically deals with identification and extraction of relevant facts and relationships from unstructured data then we have document classification and clustering which aims at grouping and categorizing terms paragraphs or documents using classification and clustering methods after that we have information retrieval which aims at storage and retrieval of text documents and then we have nlp which forms the major part of text mining so here we use different computational tasks to analyze and understand the underlying structure of the text data right now we’ll dive into nlp so when you hear the term natural language processing the first question to pop into your head would be what is a natural language so any language that is used for communication by humans is known as natural language so english french chinese all of these are natural languages now since we are humans we’re able to understand these languages but what about a machine how is a machine able to comprehend all these human spoken languages well this is where nlp comes in so natural language processing is a sub domain of artificial intelligence where we understand and interpret human languages using a computer now let’s see how is nlp used in text mining so just think about how much text you see each day email sms web pages the list is just endless so the help of nlp we can go through all of this text data and use it for different purposes such as character recognition sentiment analysis and spell checking right now we’ll go ahead and install anaconda into our systems right guys so just a quick info if you are interested in doing an end-to-end certification in nlp then intellipaat provides the natural language processing training course so you can check out the course details in the description below so let’s continue with the class so anaconda is basically your free and open source distribution for python and our programming languages and it is specifically designed to perform data science and machine learning applications right so these are the four steps to work with anaconda so first we’d have to download the anaconda installer and then we’ll install anaconda then we’d have to open anaconda prompt and update anaconda right now let me just guide you how we can do this so first we’d have to go to this website anaconda.com distribution download section and you’d have to download the version for whichever operating system you’re using so if you’re using a windows system then you’ll download anaconda for windows if you’re using mac then you’ll download it for mac and if you’re using linux then you’ll download it for linux now you’ve got two versions of python so python 3.7 and python 2.7 and we’ll be downloading the latest version of python which is python 3.7 so once we download this we have the setup wizard over here so we just have to click on next and it’ll take care now once the installation is done we’d have to open up the command prompt so this is what we get when we open the anaconda prompt now inside this we have to type in conda update anaconda and we’ll have the latest version of anaconda in our system all right now let me just go ahead and show how to do it so i’ll just type in download anaconda i’ll click on this link over here anaconda python r distribution [Music] and over here i’ll click on this download so i’ve got different versions over here for windows mac and linux and since i’m using a windows system i’ll download anaconda for the windows system so this is the latest version of anaconda python 3.7 i just have to click on download right so once the download is done i’ll get a setup wizard and i’d have to set it up and i’ll go and click on anaconda prompt so this is my anaconda prompt over here so now that we’ve successfully installed anaconda into our systems we’ll go ahead and work with the os module right so the os module it provides functionalities to perform operating system dependent operations so these operations include creating directories creating renaming and removing files and many other tasks and these are the different methods we’ve got so we’ve got os dot rename os dot open os dot list directory os dot make direct os dot get current working directory and os dot fd open now to work with the os library we’d have to launch the jupiter notebook so we can either launch the jupiter notebook through anaconda navigator or we can launch it through the anaconda prompt so anaconda navigator basically provides you a gui so you can either use this or the anaconda prompt now when you open the anaconda navigator you have jupyter notebook inside this and all you have to do is click on launch now once you do that you have your jupiter notebook right in front of you so over here you would have to click on new and open up a new python notebook now once you do that you have a new page and over there you’ll be performing different operations so your first task would be to import the os library so you’ll just type in import os now if you want to have a glance at the current working directory all you have to do is type in os dot get current working directory and it’ll give you the current working directory and if you want to create a new folder then you’ll type in os dot makers and then you will given the path after that you will given the name of the directory which you want to create so let’s perform this so again i’ll just type in anaconda prompt over here now inside this i’ll type in jupyter notepro right so now we see that we’ve launched the jupyter notebook over here now i’ll click on new and i’ll open up a new python 3 notebook so our first task was to import the os module so i’ll type import os and i’ve successfully imported this module so as i stated in the ppt the os modulus basically used to work with different operating system applications so you can create different directories you can open up new files you can rename files and so on so our first task over here is to have a glance at our current working directory so i’ll type in os dot get cwd and i’ll put in the parenthesis over here right so this is my current working directory c uses intellipath now i want to make a new directory in my c drive so this would be the command for that so i’ll type in os dot make durs and over here i’d have to given the path so i’ll type in c colon double slash and i’ll name the folder as let’s say new folder let me click on run right so now i’ve successfully created this folder now let me actually go to the c drive and see if this folder exists or not so this is my c drive over here and i see that we have this folder with the name new folder now we’ll go ahead and perform some more operations so let’s say i want to open a file which is present in my system so to do that i’ve got this command os dot open and inside this i will given the path of the file and then i’ll get the name of the file of that i also have to give the extension of file right so this is a text file so i’ll type in file dot text and over here this is basically the flag so so over here i’d basically be setting what sort of operation i want to do so oh underscore rdwr it stands for read and write so i’ve got different operations so i can open this file in only either read format or write format or read write format now after that we set in another parameter over here which is os dot o underscore create so this means that if a file with the name file.txt doesn’t exist then it’ll automatically create this new file and then it’ll open it up and then we’ve got another operation so if i want to have a glance at all of the contents which are present in a folder then for that the command would be os dot list and i will given the path so over here i want to have a glance at all of the contents which are present in the d drive so i’ll just given b over here so these are all the contents which are present in the d drive right so now i’d want to open this spata.txt file using the os module so let me do that so the command was os dot open and inside this as we saw first we’d have to given the path so let me copy the path over here i’ll click on properties i’ll copy all of this and i’ll paste it over here you’d have to keep in mind that you’d have to given double slash over here so the path over here so to separate all of these you’d have to give in a double slash again you’ll give in a double slash over here and then you will given the name of the file so the name of the file is sparta dot txt after this i’ll given the flag over here which would be os dot o underscore lb read and write and then i’ll given the pipe operator so over here i’ll just state if the file doesn’t exist then i’ll tell the operating system to create it so os dot create i’ll click on one right so we have successfully opened this file now i’d want to have a glance at all of the contents which are present in the c drive so i’ll type in os dot list directory and inside this i’d have to give in the path so i’d want to have a glance at all of the contents which are present in the c drive right so these are all of the folders or all of the contents which are present inside the c drive then i have a couple more commands over here so this is the command to change the working directory so over here i am changing my working directory from d drive to my folder so i just have to given the path for this and this is how i can change the current working directory and then we can also change the name of a file so let’s say if the name of the file was old file.txt then to rename it i’ll just given new file.txt over here so let’s perform both of these operations so let me perform both of these operations over here so i’ll type in os dot ch dir and inside this i’ll set the working directory to just c drive c colon double slash right i have changed the working directory now let me also check the current working directory now right so now i’ve changed the working directory to c drive now let me go ahead and rename a file over here so os dot i’ll type in rename so over here this takes in two parameters first is the path where the file is located and the name of the file so i’d just rename this file over here spider right so as we saw earlier the name of the file was spatter.txt let me also remove this over here let me just put in a single quotation mark now after this i’d have to rename this so i’ll copy this all again and i’ll paste it over here now i’ll change this to spark out new i’ll click on run right so now we see that we’ve got a new file over here so initially it was sparta.txt now we’ve changed it to spy dot new right so that was the os module so the os module we saw how can we create new directories and how can we change the names of some files now we’ll go ahead and do some proper file handling in python so the io module is the default module in python that is used to perform file handling operations and it contains functions that return a file object called handle and this handle is then used to read from a file or write to a file so this is the command to create a file object so we’ll be using this function from the i o module so open and this has two parameters so this is the first parameter file and over here we’ll given the path of the file or just the name of the file after that we’ve got the mode so over here this mode basically specifies the way we want to open a file so we’ve got these different modes over here r r plus w and w plus so r basically stands for read-only mode right so if we open a file in read-only mode then we can’t edit it so it’s only read-only mode and it starts from the beginning of the file of that we’ve got r plus which stands for read and write mode so over here the file pointer again starts at the beginning of the file and with this we can read the file as well as write something in the file then we’ve got w which stands for write only mode and even this the pointer starts at the beginning of the file and it starts to overwrite the existing file then we’ve got w plus which allows us to write and as well as read now this doesn’t start from the beginning but it actually starts at the end of the file right so we can start writing from the end of the file over here so these are the different modes which we can open a file now these are the different methods from the io module with which we can read from a file so now let’s say we’ve opened up a file so open and then i’ll given the path and i’m opening this in read-only mode now i’ll store the file in this f1 instance over here now if i want to read the entire file i’ll just type in f1 dot read over here so this would give me the entire file over here now instead of reading the entire file let’s say if i want to read only the first line over there then what i’ll do is i’ll type in readline so i’m storing this file object in this f1 instance over here of that i’ll type in f1.readline and this would just give me the first line over here so let me perform these two operations right so i’ll type in import io over here let me load this up now i’d have to open the file so open i’ll take the path so i want to open up the same file which is sparta.txt right now i want to open this in read-only mode so i’ll type in r over here right so i’ve also given the board now let me click on run right so i’ve created an instance of this file now let me go ahead and read the contents of this file so i’ll type in f1 dot read so these are the three lines which are present in this file this is part this is 300 and this is intellipaa so let me just verify this i’ll open up this file right so these are three lines this is powder this is 300 and this is intellipath now as you see all of these three lines are separated by slash n over here slash and denotes a new line over here now instead of reading all of these three lines i want to read only the first line from this file so the command for this would be os dot read line right so now let me click on run right so i’ve just read one line from this entire file now we’ll see how to read a file line by line using for loop so again first we’d have to open up the file and save this file in an instance so over here i’m saving this in f1 after that i’ll start the for loop so over here i’ll create a new variable so for line in f1 and this line stores every line which is present in this file over here so it’ll store the first line and then we are printing the first line now once that is done we’ll go back and then print the second line and it’ll go back and then print the third line right so this is what happens now after we print all of the lines i’m just typing f1 dot close so when i type f1 dot close it’ll just clear all of the buffers and close the file so i’ll just type in for line in f1 and over here i want to print f1 right so let me start by using the for loop so i’ll type in full line in f1 and i’d have to print each of the line so i’ll type print line and i’d have to close the buffer so f1 dot close let me close this now let me click on run right so let me start off by using the for loop over here so i’ll type in for line in f1 so this is f1 and not s1 i’d have to print each of the lines so i’ll type in print line and then i’d have to close the file so i’ll type in f1 dot close right so let me run this right so i’ve extracted each of these lines with the help of the for loop so we got this is powder this is 300 and this is intellipal now let’s see how can we write something to a file so to write something to a file we can use the right method so again we’d have to start off by opening the file but this time we’ll be opening this in right mode so over here i’ll set mode is equal to w and while writing i also have to set the encoding type so the encoding of this file is utf-8 and what i’ll do is i’ll add three new lines over here so i’ll type in this is the first line this is the second line and this is the third line and these three would be added to this current file now instead of adding one single line let’s say if i wanted to add multiple lines over here so for this what i’d have to do is i’ll create a list and inside this list i’ll separate these three items with the help of slashing so as we already know this slash in it denotes a new line right now after that we’ll open the file and over here we’ll open this in the mode a plus so a plus stands for the append mode and we’ll write this entire list into the file over here now let’s perform these operations right so what i’ll do is i’ll copy the same command over here and paste it over here so the file would be the same and over here i’ll set the mode to be equal to w so w stands for right and after this i’d have to set the encoding so encoding would be equal to utf-8 so i’ll type in utf-8 over here and i’ll click on run right so i’ve successfully opened this file in write mode now i’d have to go ahead and write something into this so i’ll type in f1 dot right over here and let me write something inside this so let me add the first line over here so the first line would be this is one i’ll click on run then i’ll type let’s say f1 dot write and i’ll type this as 2 i’ll click on run again then i’ll type f1 dot write and i’ll add the line this as three now let me go ahead and open the file so this time i’ll open the file in read-only mode so i’ll set this mode to be equal to r and i’ll remove the encoding from over here i’ll click on run right now i’d have to type in f1 dot read right so this is what i have this is one this is two and this is 3 right so i’ve successfully added these three elements into my file now as you see over here my file initially had these three lines over here this is part this is 300 and this is intellipath now when we use the right mode and we entered these three lines this is one this is two and this is three what it did was it replaced these three lines with this and this happened because the pointer starts at the beginning of the file and when the pointer starts at the beginning of the file all that which was there initially that would be replaced so let me go to the file now and see how it looks like right so i have a single line now this is one this is two and this is three because the earlier thing was replaced now i’ll go ahead and actually create a list over here so i’ll name this as l1 and i’ll add some values over here so let’s say the first value is a i’ll add a new line after this then the second value is b again i’ll add a new line after this and then i’ll add a new value which is c after that i’ll add a new line now i’ll open this file in append mode so this would be the same the mode would be a plus and i’d have to give the encoding again so encoding would be utf-8 right so this time i’ve opened the file in append mode now what i’ll do is i will go ahead and add this so f1 dot right lines and inside this i’ll just put in l1 now i’ll go ahead and open the file let’s see what do we get so i’m opening the file in read mode now now let me read what we have in this file so now i have successfully added these three characters over here a b and c so now let me open the file right so this was the first line after that i added a b and c all right guys now let’s see how can we work with word document files so we have this library with the name python.x and we can use this for creating and updating microsoft word documents so to install this library we again have to go to anaconda prompt and inside anaconda prompt we’ll given this command conda install hyphen c conda forge python docx right so with the help of this command we’ll be installing python docx library so once the installation is done we’ll go to either anaconda prompt or anaconda navigator and then launch the jupyter notebook so we’ll open a new notebook over here new python notebook and we’ll type in import docx so here you have to keep in mind that when importing the library you’d have to type docx and not python docx so you’ll type in import docx so once you import the library now let’s see how can we create and update a word document so starting off we’d have to create an instance of our word document so to create an instance we’ll invoke the document method so we’ll type in docx dot document and we’ll create an instance of the word document so i’m naming this instance as document now once we have this instance we can add all of the content into our word document so i’ll type in document dot add heading and with the help of this i can add a heading to my document right so over here i’m adding the heading this is a zero level heading so here zero indicates the level of the heading right so once we add the heading now we are going ahead and adding a new paragraph so here i am typing document.ad paragraph and i’m just writing this is a paragraph and after that if i want to add a page break so we can use this method document.add page break so we created an instance and with the help of that instance we added a heading and we also added a paragraph now it’s time to save the document so to save the document we’ll type in document.save and inside this will given the name by which we want to save the document so i’m saying the document as document.docx right so this is how we can create a document now let’s see how can we read a word document so to read a document we’d have to start off by importing this document method from docx and then we’ll use this document method and inside this will pass in the name of the document so we’ll pass in document.docx and this is the file which we had just created earlier and we’ll save this in a new instance and name that to be doc so once we have this instance over here we’ll use the for loop and with the help of this for loop we’ll print all of the content inside this document so i’m typing in for para and doc dot paragraphs print para dot text right so we’ll take in each paragraph over here and it will print each paragraph with the help of this for loop right now let’s go to jupyter notebook and actually perform this so i’ll start by importing the library which is docx so i’ll type in import docx now i’ll go ahead and create an instance of this so docx dot document and i’ll name the instance as doc so now that i have the instance i can add the content into this so i’ll type in doc dot add heading now let me add the heading so i’ll just say heading 1 over here and then i’ll add a couple of paragraphs so this time dot dot add paragraph and inside this i’ll type in p1 and then i’ll add another paragraph so i’ll type in doc dot add paragraph again and i’ll add p2 inside this after this i’ll add a page so i’ll type in doc dot add page break and i’ll click on run now after the page break i’ll give another heading so doc dot add heading and i’ll give the heading as heading 2. so all of this is done now it’s time to save this document so i’d have to type in doc dot save and i’ll give the name of the file so i will save this as let’s say my doc dot doc x all right so now i have successfully saved this file now it’s time to read this document which i’ve just created so i’d have to import document so i’ll type in from docx i’d have to import document all right so now that i’ve imported this so i’ll just given the name of the dock which i’ve just created so my doc dot dot x and i will store this in a new document again and i will name this as d1 right so now i have created a new instance of the document which i had created earlier now i’ll use the for loop so for para in d1 dot paragraphs and i’ll just print this out so print or para dot text so now we have the content which is present in the document over here heading one para one para two and then we added a page break after that we have the heading two right so this is how we can work with the dot x module now let’s head on to the natural language toolkit so so nltk is one of the most widely used packages for the purpose of natural language processing so nltk provides multiple packages to perform different operations such as tokenization pos tagging named entity recognition and sentiment analysis right so an ltk provides very easy to work with modules and with these modules all of your work becomes much more organized and much more easier now again let’s see how can we install the nltk package so again we’ll have to go to anaconda prompt and we’ll give this command inside anaconda prompt over here so we’d have to type in conda install hyphen c anaconda nltk and this is how we can install the nltk package right now once we install the lte library we have something known as the nltk corpora now you must be wondering what exactly is a corpora so before understanding copra let’s understand what is corpus so a corpus is a huge collection of written texts and the compilation of these corpuses is what is called as a corpora now nltk provides this huge corpora for multiple purposes so it has got all of these different corpuses pertaining to different domains and we can use it to perform various nlp tasks right now let’s see how can we download this nltk corpora so once we install nlt kit we’ll have to go to jupiter notebook and then type in import nltk and then after that we just have to type in nltk.download so this would give you a setup wizard now all you have to do is click on download and this will install all of the corpuses which are present in nltk so this is how you can import a corpus from an ltk plus this corpus with the title names and we are importing this corpus so i’ll just type in from nltk.corpus import names and this just has a list of all of different words so to have a glance at all of those words i’ll type in names.words right now let me go to jupyter notebook and just show you how can we do this right now let me start off by importing an ltk so from nlpk dot corpus what i want is names so this is basically the name of the corpus which are importing right now i’ve imported this corpus now i want to have a glance at the list of the names which are present so i’ll just type in names dot words and i’ll click on one so this is the list of the names which is present inside this corpus so what will we be doing in this session well the first thing that we’ll be doing in the session is that uh the first thing we’ll be doing on in this session is that we’ll be taking a look at the global data for the google 19 that’s available it’s an open source data so you can use it whenever you want and however you want to and we’ll be building a model that can tell us the number of confirmed cases and fatalities that are expected to occur the next day and that will be done by analyzing the trend of the data that we have right now so our model will do all of that for us uh just a quick reminder for those of you who are not comfortable with python you can stick with me till the end of this this session and i’ll guide you through some resources that can help you out with that also yeah another thing that you need to understand is that we aren’t going to be diving deep into the mathematics of how random forest works and how that works i’ll show you the general overview but it doesn’t really matter it matter whether you are very math heavy or technically person the point of this session is to show you that even if you are not very educated in math or even if you are not very comfortable with math you can still get around with using machine learning algorithms using data and making sense out of the data that you have so with that in mind let’s take a look at what we have next so let’s take a look at the data that we’ll be using now the data as i’ve already told you is open source so you can use it whenever you want to but basically we’ll be using the global kovid 19 data this data contains information about countries it contains information about regions it contains information about the population and it gives you a column named target target basically tells us that the data we have is of either the confirmed cases or the number of fatalities so as you can see in the screenshot that i have right below me we have cases for afghanistan and we have certain columns that are having the values of n a n n a n stands for not in number or not available so the problem is that the problem is that when you have a data set there are many times when you don’t get to use the uh when you don’t get to when you don’t get to get the data that’s completely accurate they’re going to be some missing values they’re going to be some values that are out of order and not in the correct format and there could be a lot of other things so that would be the problem so if you want to get this data set there you can get it from kaggle and that will explain to you how there are competitions going on on kobe 19 this data is available on kaggle right now you can take a look at the kobe 19 competition and you’ll get the data you’ll get the problem statements and everything but just the data is enough for you to get started so what is random forest random forest is the algorithm that we’ll be using in this session we’ll take a look at why we are using random first in a bit but first let’s understand what is random forest so random forest is a supervised learning algorithm so for those of you who are unaware supervised learning is a type of machine learning algorithm in which we have a lot of data and we have the thing that we want to predict as well as the things that we want to use to make the prediction so to give you an example if you want to make a prediction about a person’s estimated salary the salary would be the thing that you want to predict and the other things such as the person’s age qualification uh the college that they graduated from the cgpa or gpa that they had when they graduated all of these things when taken into consideration are the things that we will be feeding our model to get the estimated salary of a person so these are the kind of things that supervised learning helps us do it helps us firstly train our model on a labeled set of data and then give it some data and it will predict the data that we want it to predict like the person’s age salary or whatever we want to predict so that’s why we do it now random forest as the name suggests is a is an algorithm that’s also called an ensemble method basically instead of using some other very big vague algorithm it uses decision trees but it uses them in a very uh smart view so that we can train multiple decision trees and then we can use those multiple decisions trees that have been trained on the data that we have and then we use that those trained decision trees to make the decision that we want how does it make the decision let’s take a look at that so let’s firstly understand why we use random forest so random forest is used to solve a wide variety of challenges random forest is used to solve a really a large set of problems and it’s mainly because it provides many many benefits such as no overheating so our model will not be able to memorize the entire data so overfitting basically means when you try to train a model and it essentially just memorizes all the mappings so when it gets the training set and some of the values matches then it will basically what what it will do is that it will compare the values and then it just get the output which sounds the most similar to it instead of making a calculated intelligent decision it will basically be remembering stuff mugging up stuff and then giving you the answer which is not the way that it’s supposed to be and when we over fit our models then we get really good accuracy like 100 of accuracy many times but then the problem is in real world it doesn’t perform really well because in real world there is no it’s not able to identify the pattern that can map the data that we have to the output that we want so random forest allows us to overcome that problem because it uses multiple decision trees and it gives us high accuracy now that depends on the data that you have if you have a really large set of data then at high accuracy is something that you can be sure of getting when you’re using random forest or you can be quite certain that you’ll be getting a really good level of accuracy so there are many things that you need to take into account when that happens especially the quality and the quantity of data but if everything is in the right format and it’s in the right shape then you can expect a higher accuracy from random forest and then it comes to one of the most important aspects which it deals with missing data quite easily so as i’ve already told you that if you have columns that have a lot of missing values then it’s probably better not to use it but if you have columns that have some missing values let’s say that there are 10 rows of data that you have and one of the columns doesn’t have data for two of the rows so essentially you have 10 values you are supposed to have 10 values but you only have 8 decision tree could fill up these decision trees and random forests could fill up these these data points by just figuring out what exactly is the data that we have figuring out a trend in the data and then fill those uh missing values using that now again it’s important to understand that we don’t have to go overboard with this sometimes what happens is we are missing around 50 of the data that time it’s not really a good idea to use a random forest or decision tree to predict the values that those columns should be containing because you don’t have that much of data available to train your model to begin with so it’s better to drop the data or to insert it insert the data manually like filling it all with zero or filling it all with the mean value or doing some complicated things by separating the data points into groups and then filling the means of those groups in those columns there are many ways of doing this but if you have some missing data random forest it’s quite easy to uh use random forest to get the data inside those columns now how does random forest work again will not be going into the mathematical intricacies of the of the entire working of random forest but it should give you a good understanding of how it works so basically in random forest we train multiple decision trees and what are deficient trees we’ll take a look at that in a moment but think of decision trees as flowcharts that are derived from the data that we have so we make multiple decisions multiple true or false decisions and if we make enough decisions and we come to the end of the tree then what we realize is that that is the answer so we use that after training those multiple decision trees what we do is we make predictions and we use all the decision trees to make the prediction so let’s say that i have trained a decision uh tree i have trained a random forest to predict the stock price of a particular company in the coming few days now what i will do is let’s say that there are three decision trees that are trained inside a random forest and then i will ask all the three decision trees to make the prediction for the next couple of days let’s say the next seven days then when all three of those entries have made the decision and they have given us an output the majority decision is what is considered to be the final decision of a random forest for instance if i have a decision tree that’s that’s trying to predict the house that’s trying to predict the prices of stocks then what i have to do is i have to take into account how many decision trees have predicted a particular value and if majority of them have predicted a particular value then that value is the one that i use so to give you an example let’s take a look at this so uh suppose that i want to create a model that predicts whether a person has a particular disease or not so it’s as you can see in the image down below we have a decision random forest it contains three decision trees and what it does is that two of those decision trees give you the output of true that means the person does have a uh the sad disease and one decision tree gives you the value of false that means that since the majority of the trees are pointing towards true the decision tree the decision taken by the random forest is going to be true so the person does have the problem that we are talking about so let’s take a look at what decision trees do so in each decision tree what we do is we try to split the data on the basis of a particular feature so let’s say that i have a data set in which i have the person’s name i have the person’s age and i have the person’s qualifications so as you can see the data will be quite varied depending on the person’s name age and qualification then what we will do is we will try and find out which of the particular columns are namely name age and qualification allow us to best separate the data into separate parts so that way we can figure out what we’re trying to do so in that case what will be and that again takes a lot of mathematical computations but you don’t have to worry about that once we figure out the the feature that we want to use and we figure out what is the value on which to separate then we do this again because we have three features and we have multiple ways of splitting the data so we keep doing that keep doing that until we reach some sort of a conclusion and when we do reach a conclusion then we make the predictions based on the rules or the uh or the rules that will come up to how to separate the data let’s take a look at an example a very popular example when it comes to machine learning and any sort of machine learning or data science algorithm is the iris dataset it basically contains some information about flowers such as the supper length petal length sample width and the petal width and we come up with uh we try to figure out whether a particular image contains iris satosa or virginia certusa virginica or versicolor type of flowers these are particular groupings of flowers or particular types of laws so let’s say the decision tree comes up with this uh this kind of um formula so right at the beginning it will take a look at the petal length and it says whether or not the petal length is less than 2.45 if it is then it’s the setosa so as you can see all of the better length in the data that we have are less than 2.45 so they are all iris satoshi but let’s say that it’s not it’s it’s not less than 2.45 but it is less than 4.95 so if the petal length is less than 4.95 centimeters then it’s versicolor and if that’s not true as well then it’s finally virginia so essentially any any data point that has petal length less than 2.45 it’s satosa any data point that has the petal length between 2.45 2.46 and 4.95 it’s vertical and any data point that has the value of about four point nine five four five length it’s bajillion so this is how decision trees come into play and we use multiple decision trees to train them again and again so that’s why we don’t run into the problem of overfitting because we’ve trained them again and again using different parameters and we’ve gotten decision trees that work differently on different data so the majority decision brings in that so with that in mind we come up to the hands-on for the global prediction so let’s take a look at the hands-on for that we have a jupiter notebook and this notebook is taken from kaggle so we’ll be explaining to you how these things work and how the person who has come up with the code and the understanding how are they doing it so what did what we’re doing here right at the beginning is we’re importing everything that we need to now it’s not necessarily the case that you need to import everything right at the top but it’s really a good idea to import everything right at the beginning because then the person who’s taking a look at your entire notebook can understand what are the libraries that you have used whether they have those libraries or not if not that they can install them and then run your google so these are the libraries that they have used and to be quite honest they are not really uh in the aso telen libraries many data scientists use this libraries quite frequently matplotlib pandas numpy c bond so these are for data manipulation data extraction and visualization plotly is an advanced visualization library and they have used the datetime library to convert the dates into multiple formats which will be taking a look at in a moment now i’ve already run all the cells because the data is quite huge and this would take a lot of time for me to run the entire data set through the entire process and i don’t want to waste your time so this is how it’s happening then we have imported multiple things from the uh scalar library we have imported the random forest regressor the reason why we’re importing the random forest regression is because it’s a regression problem we’re trying to predict a continuous value so the number of people who have been found as the confirmed number of cases they could be 14 there could be 45 they could be 60 they could be 61 so they are not something that you can either predict into true or false or one group or the other what’s happening here is this we’re trying to predict a continuous value continuous numerical value and that’s why we’re using random forest regressor then we have some we have the standard scalar and label encoder to con to resize or reshape our data into a particular standard size and to convert categorical data into the map finally we have pipeline sensory intestine state we’ll understand what that means and we’re telling matplotlib that whatever whatever visualizations it produces it needs to be displayed it needs to be optimized to be displayed inside a jupyter notebook but with that in mind we have data set such as train.csv and test.csc again this has been taken from kaggle so they have provided the data set it has trained.csv and test.csv which is the training data and the testing data so with that we firstly take a look at the training data so as you can see we have the id which is not really useful in our scenario but it could be that it has been taken from a database that contains the entire ids and it has been converted into a csv file as you can see most of the top values of a country are not available and from for province state is not available this gives you an insight in what a data scientist would have to do on a day-to-day basis which is essentially just take a look at the data and if you’re not getting or if the data that you have contains a lot of missing values then what you do and we’ll take a look at how to deal with this we have country region we have the population weight date confirmed cases and the target value so we’re taking a look at the target which specifies the target value is either the confirmed cases or fatalities so on 23rd of january the confirmed cases and fatalities in afghanistan were zero so there were no corona cases founded and similarly we have the test cases and in the test data what we have is we have the entire data set and in that data set we are missing the target value which is something that we have to predict so that’s what we’re supposed to do here now we take a look at the shape a shape basically means the rows and the columns so we have as you can see this is the number of the rows that we have which if you see is quite large so we have nine lakh seven thousand three hundred and six rows which is a really large data set this is why i didn’t want to import the data again and run the entire jupyter notebook again for you and for the testing we have thirty one thousand one three lakh eleven thousand six hundred and seventy and there are nine columns in the training data and eight columns because we have to predict the ninth column now we take a look at the number of missing values in the training data and the testing data and as you can see 83 840 which is uh quite a lot if you take a look at the data that we have so we will be taking a look at that and we have proven state which is 4048 994 so we have quite a lot of missing data in country and province state in both testing and training so with that in mind and by the way this is all being done by numpy for pandas for you so pandas is the library that we use to read the csv file then we take a look at the head of the file and it gives you this nicely formatted list of all of this it is displayed well in your notebook and then what we do is we take a look at how many null values are there and the sum of the number of nonverbal value columns inside it so this is what we have now what we do is we take the id and the forecast id from that id will be taken from the training data set and the forecast id will be taken from the testing data set and now this is just to show you how to get the data out of a particular data frame and now finally what we do is we drop the values or drop the columns that either have null values or are of no use to us so what we do is we remove the country because it has a lot of null values we remove the province state it has the null run values and remove id because it’s not going to help us in any way shape or form another reason is that uh another reason would be that someone has asked that they want me to recommend some free courses to you uh i will be recommending free courses so if you want to learn more about that then stick with me till the end of the presentation and i’ll be showing you some resources that you can learn free and paid as well so with that in mind let’s continue so we will drop the country proven state and forecast id because these are the incorrect values or null values and forecast id is something that we don’t really need and if we put data that is of no use to us in our training set then it could negatively impact the performance of our model so we drop them so we are this we are telling the columns that we wanted to drop and then the return data frame is the one with the drop column so we assign it again to our data frames and we get the dropped columns now we try to visualize the data so as you can see that the target values are there we have the weights we have the population as you can see it’s quite widespread so certain things such as the population of of a country or something could could be leading up to something like this so 1.5 billion or something like this so the most populated countries have this much population similarly you can take a look at all of the other things that we have and when we cross these things we get this so what we’re doing basically here is that we’re taking a look at the population and the target value we’re taking a look at the weight and the target value and then the target value with the target value basically shows you that we have zero orbitals now we take a look at the pair plot or the power plot which basically takes a look at how many confirmed cases and how many fatalities we have in a trading data set so as you can see here this is the number of that so we have a lot more confirmed cases than we have in the uh then we have the fatalities so not a lot of people have uh it has caused a lot of fatal deaths for number of confirmed cases as you can see the number of confirmed cases is quite high and with that now we take a look at the target and the population so confirmed cases and fatalities they have quite a good amount of confirmed cases and fatalities according to the population that of a country is there and we are using the training data again so now we take a look at a and this is where plot link comes in matplotlib comes into play what it does is that it creates a pie chart or a donut chart necessarily for you and it shows you the target value what the confirmed cases and all of that so as you can see in u.s the majority of cases are in the u.s which is 53.7 percent of our entire dataset it could be that our data set is not completely correct so it could be that there are some values servicing maybe 46 but the majority of the data that we have belongs to the citizens in us as you can see it’s right there and again uh you don’t need to know any maps to do this you just need to understand how to write code for this we create a figure we create a pie chart we update the traces and once you once you understand matlab if you understand how you use these things basically we just tell it to we want the text inside our inside our pie chart inside the pies that we have created out of the entire chart and with that we give the uniform weight size and if the text is too large to fit then we hide it so on and so on and then what we do is we um group the group of data by the country region and then what happens is we take a look at the target value so after grouping it we summit and then we take a look at the target values as you can see in afghanistan there are 16 000 15 cases and you can take a look at any number of cases here as you want to so as you can see in us there are a lot more people than there are in afghanistan and with that we come as you know we have nine lakh cases uh in our data set and out of those five point seven lakhs are from the us so with that we get the largest five values of target values so top five mostly most affected countries are the ones that we have here and we have the us brazil we have russia united kingdom and spain so these are the top five countries that have been mostly affected with google 90 and this is what our data indicates and so as someone has asked what if i have to find the data of a particular country so if you want to find the data of a particular country then as you can see we have it here we are we’re printing the target value of all the countries if you want to find of it of a particular country then you have to type it like this here you will have to type let’s say that i want to find of india uh just make sure that you have written the word india correctly here if the i is capitalized from the given data that we have it seems that the first letter of the entire country is capitalized so we can so again this is something that you will learn once you learn about matplotlib and pandas and all of these libraries i won’t be running this again because for this to run i would have to re-execute the entire python notification notebook and that would take a lot of time so i’ll just show i just showed you how to do it and you can do it on your own with that in mind we have this and now if you want to visualize this again just three lines of code is required here so again no mathematics was required as of now what we’ve done is we’ve created a cat plot in which we have these five values we have the target value as the x-axis and the population is the y-axis and this is what we have now with that we get the top five most populated countries in the world and we have china india u.s indonesia and brazil and these are the top five so we have the end largest uh function that’s doing this for us and now what we do is we come up to some interesting visualizations so this visualization is called tree map it allows us to basically create a tree or a jigsaw like structure in which the larger the value of a particular country or a particular data the more size it gains in the entire plot so as you can see if i hover over us i get this uh interesting interesting hover menu in which we get all the values that we want the target value the country population and many other things you go to brazil you go to russia you go to india you will get a lot more so as you can see india is quite densely populated so it’s getting all of this then there’s this this united kingdom’s iran turkey and as you go down you can take a look at other countries as well but those countries don’t have that much of population so the squares are quite small now that is done but as you can see the uh date and time was not in the correct format so if i were to show you the date and time that we had earlier take a look at the data that we had we had data like this which is 2020-04-27 now a problem with this is that it’s it cannot be converted into a numerical form because there are dashes in it if there are slash if there is if there’s any character in it that’s not number then that’s a problem so to convert it into completely numerical what we do is we use the datetime library that we already have installed but we need to use it correctly so here’s what we do right means the tree map we get the date and we convert it into the date time format and then we convert it into years months and days so as you can see 20 20 0 4 24 which is uh 24th april 2020 so this is how they convert it and the reason why we convert it into this way so that it could be numerical is mainly because if you don’t do it this way then you get values that are string based or character based and machine learning models don’t work on data like that similarly as you can see we have confirmed cases we have name of the countries and these are the things that we need to convert into not into converting from categorical numerical categorical is thing that’s not numerical for instance us brazil the computers and the algorithms will have no understanding of what these things are these could be anything as far as they’re concerned similarly confirmed cases and fatalities they have no idea what that is so with that we create some interesting plots these are called heat maps heat maps will show you the date and the number of cases that they got on those dates so as you can and we have gotten just a few dates so that we can take a look at that and as you can see majority of these are quite uh empty but china got these cases quite early on so that’s why we’re getting this here and then either they didn’t have the data but they had no cases so that’s why we’re getting these values and we can explore more about this data later on we can explore more into the dates and what are the missing things and all that similarly we have it here we have it here and this allows this uh allows us to understand just how many people were coming each day so the brighter red the column the more people change that so if it’s light like green or yellow then it’s not that many people as you can see on the legend right on the left hand side here on the right hand side here then we take a look at the uh and largest values here as well and we get the same thing we get target values we get fatalities and as you can see china is right here and we get the values and then we create again the heat map but this time we create for it for a long larger date range and this is what we get so as you can see these are the countries and again as you can see bangladesh had data till this date and then it later on it didn’t similarly for others we don’t have the exact same thing so let’s take a look at now comes the time to understand what are the columns that don’t have the numerical values basically what we do is we say in the training data set get the column names that are of type object object is anything that is not numerical so we have country region and target and similarly in the test one we have the same thing so now it’s time for us to convert our categorical data to numerical data so we get label encoder and what it does is that it assigns a unique value to each of the data that we have so for instance if we had a column in which we had male and female as the only values that it contained let’s say that it’s an employee one so it will assign the value of zero to male and one to female and then whenever it sees zero in the data frame it converts into it says male in the data converts it into zero and if it is female it converts it to one similarly we keep going on and on so we do that here and then we get the data that we want we get the first column which is the country region and then we convert it into uh from string then we convert it into the numerical data that we want we do the same thing with the training data and we do the same thing again but for the other columns which is the target column and now as you can see this is what our data looks like countries uh afghanistan i think is given the value zero here and confirmed cases are zero and fatalities is given one so now everything as you can see is numerical that’s good now what we do is we separate our data into text and y data so x and y basically is something that we take a look at when we’re trying to understand what is the data that we have x is x are the values that we want to use to make the prediction y is the thing that we wanted to predict so with that in mind we take the data and then we split it into training and testing so we already have some testing data but what we’re doing here we’re trying to create a little more testing data to check on a small piece of data first so that’s what we’re doing we’re converting into training and testing data with uh 20 of the data set to testing and eighty percent of the data set for training yes we have nine lakh rows then it’s quite easy for us to do ten or twenty percent if we had smaller data like one lakh or fifty thousand then we would do something like thirty percent at seventy percent or seventy five and twenty five percent now comes the interesting part we create a machine learning pipeline or a data science fight which will describe the process so what we do we use the standard scalar what it does is that it removes the outliers for us and it converts the data into a normal form so that it’s easier for us to crunch through the entire data set and it’s easier for us to calculate the things that we want to calculate and we’ll be using the random forest regressor here so convert the data into a standardized format and then use the regressor to create a random forest regression tree so in the pipeline we fit the training and testing data so we get the training data the values that we want to make the prediction and the values that are going to be predicted and then we make the prediction on the testing data that we had already set aside now we make the prediction and we take a look at the accuracy and we have 95 percent accuracy which is quite good if you’re quite honest any data any large sort of large data sets that you get output for and you get around 80 percent or above 80 is quite a good thing because for a machine to be able to learn all of these patterns and then be able to predict things at anything accuracy of about 80 is quite good ours is 95.4 which is quite good and now is the test data that we had already gotten from kevin we made the predictions we converted into a data frame using the pd.data frame format and here we have it we add the forecast id and we get the target values so here we have it and this is what our data frame looks like so as you can see for the first value here we’ll be getting the value of 106.6 so and the second value will be getting 5.2 so it means that for the first day we’ll be getting 106 confirmed cases and five fatalities now these values are in uh floating point of numbers or in decimal format and that’s because we’re trying to predict continuous values so you could get decimals but we could just round them so 107 and 5 107 and 1 and 155 so these are the values that we have gotten and using these values what we can do is now we have calculated what the training data has given us to do so we have we created a model and now later on if we get some data then we will use that data to make the prediction we just have to pass it to our model here instead of test we just have to pass in one value and what it will do is that it will make the prediction and it will give us the prediction when we get the first value as the thing that is the confirmed cases and the second value would be the fatalities so we’ll be making the progress like that now again as you’ve seen we didn’t have to use a lot of mathematics in this case it’s mainly because that for solving normal problems you don’t need a lot of accomplishments so with that we have come to the end of the hands-on so fraud by definition is an act of deception that is used illegally to deprive a person or an organization or entity of money property or legal rights so if you talk about financial frauds which are the most common type of frauds so a typical organization loses around five percent of the yearly revenues to fraudulent cases and according to an rti report there were around 2480 cases of frauds in 18 public sector banks which involved around 32 000 crores and according to an rba report in 2017 and 18 a total of 911 credit card fraud cases were there which amounts to around 65 crore rupees which were illegally transferred from different bank accounts so now let’s discuss some of the types of frauds that we can encounter in our daily lives so the most common types of frauds nowadays are online frauds where a person’s account may be hacked and all his money can be transferred to different accounts illegally and the most common types of rods nowadays are the credit card frauds where stolen credit cards and credit card numbers are used to make illegal transactions and then we have threats and theft of inventory when someone breaks into a bank or an atm and then we have other types of frauds where someone can steal your idea or your professional work so now that we know what are frauds and what are the different types of rods and how they can affect different organizations and how important it is to detect the frauds and save ourselves from heavy losses so now we’ll discuss some of the rule-based approaches or traditional approaches that are used for detecting frauds so in the rule-based approaches the algorithms are actually written by fraud analysts and these are based on strict rules so if there are any changes that have to be made for detecting a new type of fraud so they are all done manually so either by making those changes in the already existing algorithms or by creating new algorithms so in this approach whenever there is an increase in the number of customers and the data the human effort also increases so in conclusion we can say that the rule based approaches are time consuming and they are costly so now let’s discuss some of the drawbacks of the rule-based approaches so one drawback of the rule-based approach is that it is more likely to have false positives so this is a type of error condition where the output of a transaction it depends upon the rules and the guidelines that we have made for the training algorithm so if you have a fixed rule and a fixed threshold and if a transaction is rejected where it should not be rejected so it will generate a condition of high rates of false positives so these false positive conditions actually result in losing lot of genuine customers so a rule based approaches cannot recognize hidden patterns since all of the rules are mentioned by the fraud analyst and it cannot also predict new types of frauds because it always abides the rules that are mentioned by the fraud analysts and that leads us to a situation where it cannot respond to new situations which is is not trained on or it is not explicitly programmed on so now that brings us to our next topic that is how we can use data science to solve these problems and use data science for fraud detection so the modern alternative is to leverage the vast amounts of big data that is collected from different online transactions and model it in such a way that allows us to flag or predict fraud in the future transactions so for these different data science techniques such as machine learning and deep neural networks are the obvious solutions so here i’m going to show you an example of how machine learning techniques can be used to identify fraud in financial transaction so machine learning is a broad field it has a large collection of algorithms and techniques that are used in classification regression clustering or anomaly detection so two main classes are supervised and unsupervised learning so in supervised learning it is used to predict either the values of a response variable which is also called regression or the labels of a set of predefined categories which is also called classification tasks so in supervised learning the algorithms learn how to predict unknown samples based on the data of samples with noun response variables and labels so in fraud detection we are technically dealing in the classification task so for each sample that is a transaction the predefined labels tells us whether it is a fraud transaction or not so however there are two main problems while using supervised learning for fraud detection the first problem is data labeling so in many cases fraud is actually difficult to identify and some cases will be obvious the some cases will be easy to recognize with rule based techniques and they usually won’t require complex models so where it becomes interesting are the certain cases so they are hard to recognize as we don’t usually know what to look for so here we use the power of machine learning but because fraud is something that is hard to detect so our training data sets that is collected from the past transactions are probably not classified correctly in many of these cases so this means that the predefined labels will be wrong and some of the transactions will be wrongly labeled so in this case our supervised machine learning algorithms won’t be able to learn how to find these type of transactions or these type of frauds in the future transactions and the next problem that we face is the unbalanced data so an important characteristics of fraud data is that it is highly unbalanced so this means that one class is much frequent than the other so in a fraud detection example less than one percent of the transaction will be fraudulent so in most of the cases you will find that less than one percent of all the transactions are fraud transactions and most of the supervised machine learning algorithms they are actually very sensitive to unbalanced data so we have to use different techniques to balance the data before we actually use the supervised machine learning algorithms to predict the fraud cases in the future so the next class of problems that we can solve using the unsupervised learning so it does not require our predefined labels or response variables so it is used to identify clusters or outliers or anomalies in the data sets so in our fraud data set we don’t trust the predicted labels to be 100 correct so hence there will be lot of incorrect labels as well but we can assume that the fraudulent transactions will be sufficiently different from the vast majority of regular transactions so that our supervised learning algorithms will flag them as anomalies or outliers so in supervised learning we can use the dimensionality reduction which is also called a type of dimensionality rejection is called principal component analysis so in preparation for our machine learning analysis dimensionality reduction techniques are very powerful tools and they are used to identify hidden patterns in the data and whenever we have high dimensions in the data so it is better to use pca so that we can reduce the number of features for machine learning while preserving the most important patterns of the data and similar approaches can be used such as clustering algorithms and like a means clustering can also be used to find the patterns in the data using number of clusters so in this video we are going to use the supervised machine learning model on our data set so that we can predict future fraud transactions so now let’s discuss some of the challenges that we’ll face while building a machine learning model on our credit card data set so various challenges arise when building models for fraud detection so the first challenge is the unbalanced data so there are plenty of legitimate cases but only a very few fraudulent cases so for example in a credit card transaction typically less than 0.5 percent of the transactions are fraudulent so it might cause an analytical technique to experience difficulties in creating an accurate model and depending on the exact application operational efficiency may be a key requirement so the fraud detection system might only have a limited amount of time available to reach a decision so in a credit card fraud detection setting the decision time to let a transaction pass or not is typically less than 8 seconds and also incorrect flagging so when flagging a particular transaction if it is if the transaction is flagged as fraudulent for a good customer so you risk losing this customer due to the harassment caused by flagging the transaction as fraud so now let’s discuss some of the techniques that we can use to solve the first type of challenge that is unbalanced data so here you will find that most of the cases in a data set most of the transactions will be legitimate transactions and very less transactions will be fraudulent transactions so if you train your classifier model on this so your classifier will tend to favor the majority class that is your legitimate transactions so which results in large classification error over the fraud cases since it has not learned much because of the less data available to it so the classifiers actually learn better when it is trained on a balanced distribution rather than an unbalanced data so now we’ll discuss sampling methods to solve this unbalanced data problem so we always perform the sampling methods on the training set on the not on the test set so we’ll always divide our data into two sets that is our training and testers so we’ll build our model on the training set and before building the model will balance the training set so there are model learns on the training set and then we’ll use the original test set which is the unbalanced asset to predict whether a transaction is fraudulent or legitimate the first technique is called the random over sampling technique so in this technique we over sample the minority class which is our fraud cases so we copy the cases that are already present in our fraudulent cases so we’ll just copy the same cases multiple times till we reach a particular threshold or the value of the cases that we want in our data set and the other techniques that we can use is the random under sampling so in this technique we under sample the majority class which is our legitimate cases so we remove some of the cases that are present in our data set which are from the legitimate transactions so we’ll remove some of those cases and then we’ll downgrade some of those cases till we have a data set that has an equal distribution or an almost equal distribution of both the fraudulent and the legitimate cases or we can do both so either we can up sample or we can down sample so now let’s discuss some of the techniques that we can use to solve the first type of challenge that is or unbalanced data so you’ll find that most of the data sets have most of the transactions as legitimate and very few transactions as fraudulent so if you train your classifier model on this particular data set so our classifier will tend to favor the majority class that is the legitimate cases so which results in the large classification error over the fraud cases since it has not learned much from the fraudulent data because the data that was available to it was very few so now let us discuss some of the techniques that we can use to solve the first type of challenge that is our unbalanced data so you will find that in most of the data sets or legitimate cases they’re outnumbered just a quick info guys if you want to become a certified data scientist then intellipaat provides just the right course for you you can check the link in the description box below now let’s continue with the session so now let’s discuss some of the techniques that we can use to solve the first type of challenge that is our unbalanced data so you’ll find that in most of the data sets most of the transactions are legitimate transactions and there are very few transactions which are fraud transactions so if you perform a classification model on top of this data set your classifier will tend to favor the majority class that is all legitimate cases so which means that it will show a huge classification error over the fraud cases so classifiers actually learn better from a balanced distribution so if you train or model on a balanced distribution it is more likely that we have a less classification error for our fraud cases and it will be easily available to us so now let’s discuss some of the techniques that we can use to solve the first type of challenge that is our unbalanced data so you’ll find that in most of the data sets most of the transactions are actually legitimate transactions and very less transactions are fraud transactions so if you create your machine learning model on this type of data set your classifier will tend to favor the majority class which is your legitimate transactions and which will result in large classification error over the fraud cases so classifiers or machine learning models they learn better from a balanced distribution where the balance of the two classes that is legitimate and fraud can be equal so now let’s discuss some of the methods that we can use to solve the first type of challenge that is our unbalanced data set so you’ll find that most of the data sets they have large number of legitimate cases and very few fraudulent cases since in credit card transactions less than 0.5 percent of the transactions are fraudulent so our classifier or machine learning model will tend to favor the majority class that is on legitimate cases so which means that it will show a large classification error over the fraud cases will not be able to identify which case is fraud or not so a classifier is actually learn better from a balanced distribution instead of an unbalanced data so now let us discuss different sampling methods used to balance an unbalanced data set so the first technique is called the random over sampling so in random over sampling we increase the number of fraud transactions in the dataset by creating duplicates of the already present fraud cases and the next technique is called the random under sampling so in random under sampling we decrease the number of legitimate cases to get a balanced distribution and also we can perform both random over sampling and random under sampling by increasing the number of fraud cases and decreasing the number of legitimate cases so now this is how our data set will look like after performing both the over sampling and the undersampling but the problem with over sampling is that it is done by creating duplicates of the fraud cases that are already present in our dataset so that means that we will be training our model using a lot of duplicate values which won’t explain the variance in the data and the problem with under sampling is that we end up throwing away a lot of useful data and information which is not preferred in general so that brings us to our next sampling technique which is called the synthetic minority over sampling technique aks mode so in this technique we over sample the fraud cases by creating synthetic fraud cases so if you look at the scatter plot at the right hand side the green dots here represent the level legitimate cases and the red dots here represent the fraud cases so in smode we find the k nearest neighbor of a fraud case let’s say x here and let’s take the value of k as three so we’ll find the three nearest neighbors of x and after choosing the nearest neighbors we’ll randomly choose one of the x’s nearest neighbors let’s say y here and then we add a synthetic sample using the following method so we’ll take the x and the y coordinate of both the points and find the x and the y coordinate of a synthetic point which is given as follows so the synthetic point will lie on the line that joins both x and y and here the number 0.3 it is a random number between 0 and 1 so we can use different numbers between 0 and 1 so that we get different synthetic samples that are present on the line that joins x and y and we can also repeat this step multiple times for each neighbor of x which is a fraudulent case to get a desired value of the fraud cases in our dataset and hence will get a balanced data set after all so now let’s go to rstudio to implement credit card plot detection model using smooth so now let’s build a fraud detection model in r so first of all we will import the data set that we are going to work on so i’ve stored my data set in this particular folder so once i load this data set using the read.csv function and i’ll store a data set in credit underscore card data frame so here on this line your data set will be imported and after importing the data set we’ll look at how our dataset looks like so after your dataset is imported you can click on the dataset name in the variable explorer and once you click on the name you will find that our dataset contains 284807 entries that is rows and around 31 columns this data set is about different transactions made by different credit card users and the first column here you can see is the time so this is a time stamp and then we have columns from v1 to v28 so these columns are actually the reduced versions using the dimensionality reduction so the actual dataset contains the actual columns which the dataset does not contain so these are actually the reduced versions using the pca so that the confidentiality of the users is maintained so these are different features that are extracted from the actual features using the dimensionality reduction so after the 28th column or the 29th column we have the amount column that represents the amount of transaction that was done using this particular credit card and then the last column here you can see is the class column which contains two values that is zero or one so zero here represents that the transaction was a legitimate transaction and one represents that the transaction was a fraud transaction so after that let’s look at the structure of the data set and see what are the different types of variables that we have in our data set so if you’ll use the if you use the structure function and write the name of your data set that is credit card so you’ll see we have around 31 variables that is columns and most of the variables are numerical columns and the last column here you can see is class which is also an integer column but in actual it is not an integer column it is a categorical column it has two categories that is 0 and 1 and 0 here represents the legitimate transactions and one represents the fraud transactions so first of all we will convert our class column to a factor column two levels that is zero and one so that we can work with our data set later on so we will convert this to a factor column using the factor function and i’ll pass the name of the variable that is credit card dollar sign class and then we have two levels that is zero and one zero for legitimate cases and one for fraud cases so we’ll again store it in the same variable that is credit card dollar sign class so once you run this function f and if you now look at the structure of the data set you will see now our class column is a factor column with two factors that is zero and one zero i represents the legitimate transactions and one represents the fraud transactions so now let’s look at the summary of the data set to display the summary statistics of different columns so here you can see the last column is class column so here around two eight four three one five entries are legitimate cases that is zero and only 492 entries are your fraud cases that is one and then we have most of the columns are integer columns with mean of zero and the first column is our timestamp so now before moving ahead now let’s check whether there are any missing values in our data set or not so we’ll use the is dot n a function to check whether there are any missing values in our data set or not and then the sum function will count the number of missing values if there are any so once we run this function we’ll see that we don’t have any missing values in the in the data set so it is okay for us to continue and work with our data set so now we have to use the class column so first of all let’s see how many fraudulent and legitimate transactions are actually there in our data set so we’ll use the we’ll get the distribution of fraud and legitimate cases all hdmi transaction is the data set we’ll use the table function and inside the table function we’ll write the variable that we want the distribution for so our variables or the transactions are stored in class variable in our credit card data set so we’ll use the table function to get the distribution of different transactions so after running this command here you can clearly see that zero represents the legitimate transactions and most of the transactions in this data set are legitimate transactions so there are around two eight four three one five legitimate transactions and only 492 fraud transactions and if you want to get the percentage of the transactions you can use the prop table function and then inside the profitable function you can write the table function which will get you the distributions and the prop table function will get you the percentages of the distribution so here you can see that 99 of the more than 99.82 percent of the transactions in this dataset are legitimate transactions and only 0.001 percent transactions are fraud transactions clearly we can see this is an imbalanced data set so now let’s also plot a pie chart to get the distributions of our legitimate and fraud cases in our data set so we’ll firstly set the labels of our pie chart so the labels are legit and fraud so we’ll create a vector it contains two strings that is legit and fraud and we’ll store that in labels and after that we’ll paste or we’ll concatenate the percentages of your fraudulent cases and legitimate cases with the respective labels so if you pass two vectors in the paste function so first we’ll use the paste function to concatenate two strings so we’ll first vector that we have passed is the labels which contains two values legit and fraud and then we’ll pass the round function so this function will actually calculate the percentages of the distributions so we’ll calculate the percentages of the prop table that we have just seen above which will display this particular result and we’ll multiply it by 100 to get the percentage and then we’ll round this percentage to two digits so we’ll only get two digits after the point so use the round function which will contain our percentages up to two digits of decimal and then we’ll have the labels which contains your two strings that is legit and frauds and we’ll paste it together and we’ll store the resulting label so if you run this command and i look at the labels right now so now your labels actually look like this we have a 99.83 percent so here we have rounded the values up to two decimals using the round function here and the fraud cases are 0.1 percent and legit cases are 99.83 so clearly we can see there is an imbalance in our data set so let’s plot a pie chart first of all to get the distribution or display the distributions of our legitimate and fraud cases so we use the default pi function from the default r package and we want a pie chart for the different distributions or the or the different categories of the class variable in our credit card data set and we will use the table function to get the different distributions of the categories and our labels will be equal to these labels and then the color of the labels will be according to the orange and the end so the orange will represent the number of legitimate cases and that represents the number of broad cases and then this is the title of our pi plot so title will be priced out of credit card transactions so let’s run this command to get a pi plot so this is our pie chart for our credit card transactions and here you can see most of the color is orange it means we have around 99.83 of the legitimate transactions and only 0.17 percent are fraud transactions so now if you make a model on this particular data so you’ll get most of the accuracy so we’ll first of all make a model that is no model so even if you don’t have any particular machine learning model and if you predict that every single transaction in this data set is a legitimate transaction so you can see what kind of accuracy we’ll get so first of all we’ll predict that all the transactions in this data set are legitimate transactions so we’ll store our predictions in predictions variable and will repeat integer 0 equal to the number of rows that are present in our credit card data set so in our credit card data set we have around 284 807 rows so we’ll create a vector that will contain zeros two is four eight zero seven times so after creating a vector we’ll convert this factor to a factor variable with two levels that is zero and one here zero represents the legitimate transactions and one represents the fraud transaction so we are predicting without using any model that all the transactions are legitimate transactions so once you run this command we’ll factor out or we’ll convert this particular vector to a factor with two levels that is on legitimate transactions and fraud transaction zero and one so after that we’ll use the confusion matrix to get the accuracy for this particular predictions or that we have done without using any particular mod so when you use the carrot package so if you have not installed this package you can use the install dot packages command and write the name of your package to install it and then after this after installing the package you can load the package using the library function and once you run this command your package will be loaded and then you can use the confusion matrix function from the carrot package to get the confusion matrix and the accuracy and if you want to know more about this particular function you can press f1 here so it will take you to the documentation on the right hand side and then you can know more about this particular function so in this function we will firstly pass your predictions that we have stored in predictions variable as data and then we have our reference variable which will contains the actual classes that we have predicted so we have predicted the classes of the credit card data set we’ll pass our reference variable or the actual classes that we have predicted so once we run this command we’ll get the confusion matrix for this particular model that we have built without without using any machine learning model so you can clearly see that we have correctly classified all the samples that were all the transactions that were legitimate transactions but we have not classified any of the transactions that were that was a fraudulent transaction so still we get an accuracy of 99.83 because we have flagged every transaction as a legitimate transaction and we have not flagged all the fraud transactions as fraud transactions so these will be our true positives and this will be your true negatives which are the correctly classified values so our true positives are total and our true negatives are zero it means we have wrongly classified all the samples that were fraudulent transactions your zero represents legitimate transactions and one represents your fraud transactions so out of the fraud transactions out of all the fraud transactions we have flagged all the transactions as legitimate transactions which are false positives so here you can clearly see that if you only check the accuracy of the model that we build so we’ll be wrong most of the times because because their accuracy does not represent anything here our goal is to maximize the number of cases or maximize the true negative so that we can classify most of the fraud transactions as fraud transactions so now let’s move ahead and build a model so before moving ahead we’ll take a small subset of our data set so that it is computationally more faster and later on once you have built a model on the smaller subset that you can make then you can make the model on the whole data set which will take some time so we’ll take a smaller version of the data set so that we can compute it faster so we’ll use the deployed package to get a small fraction of our whole data set that is credit card so i’ll load the deployer package first of all so after loading the deployer package i’ll use the sample fraction function sample frac function to get the fraction of our data set which will be random fraction so it will contain both zeros and ones and i’ll pass my credit card data set using the forward pipe operator which will pass my credit card data to the sample frank function which will extract 0.1 percent that is 10 of the samples or 10 percent of the rows that are present in our data set so right now our dataset contains 284807 so after extracting 10 percent of the rows it will be rows 28480 we’ll store the result in our credit card data frame again and we have used the set seed function so if we run this particular code again so we’ll get the same sample again if you don’t use the set state function and you run this code again and again so your sample will be different every time so we’ll run these two lines together and we’ll get a sample that will be 10 of all the rows that are present in our actual data set so now let’s look at the data set uh that is a fraction and there is a subset of our original data set so we have around twenty eight thousand four hundred and eighty one rows and all the columns so this is a ten percent of data set and then we’ll make a model on this particular data and after building a model later on we can build the model on whole data set so now let’s see what is the distribution of classes that we have in this particular data so right now we have around 28 437 transactions as legitimate transactions and only 44 transactions as illegitimate or fraud transactions and now let’s plot a scatter plot between the two columns that is v one and v two so we’ll plot a scatter plot between the columns v one and v two and we’ll see what is the distribution of classes between these two variables so we’ll use the gg plotter function to plot a scatter plot so firstly i’ll load the ggplot2 library and after loading the ggplot2 library i’ll use the ggplot function to plot a scatter plot so the g plot function is based on grammar of graphics so it has different layers that will stack up on each other to build a final plot so our first layer is our data layer where we passed our data frame so data frame is our credit card data frame that we have just extracted from the whole data free and then we have our second layer as the statics layer so in the statics layer we have our variables that we want to plot on our scatter plot so on the x x-axis we have our variable v1 and on the y-axis we have our variable v2 and we want the color of the points of the scatter plot according to the different categories of the classes so we’ll have two categories in the classes that is zero and one so we’ll have two colors two different colors for each particular category and then to plot a scatter plot we’ll add the third layer using the plus sign here so our third layer is our geom layer in the geometry layer we’ll pass our function that we want to plot scatter plot for so here we if you want to plot a scatter plot we will pass zoom underscore point so it will plot a scatter plot for us and then if you want to change the background of your plot so we can use the theme function so theme vw that is black and white so it will make the background of your plot as black and white and then you can also change the color of the point so here this will get you the basic plot that we want so if you zoom in on this plot you can see that our this particular color represents our legitimate cases and or blue color hair represents our fraud cases so if you want to change the background of this particular plot so you can use the theme function which will change the background to black and white and if you want to change the color of these points to different colors so we can use the scale color manual function and inside of that function we can declare our values so our values of the colors are dodged bluetooth and red so now this color represents the zero that is legitimate cases and that represents the fraud cases so now let’s run the whole command and we’ll get the final plot black and white background and these colors as the color of the points so if you zoom in on this particular plot you will find that now this particular color doesn’t blue 2 represents our legitimate cases and our red color represents of fraud cases so here also clearly you can see that we have a comparatively huge number of legitimate cases and only a few fraud cases so if you train your model on this particular data or model will not be able to learn a lot because the number of fraud cases are very less so you have to use different techniques to balance the data set before we actually make uh build a model on this particular data so before balancing the data set we first of all create training and test sets so we only use balancing on the training set to train the model and we don’t use balancing or the test set because that will be the original test set that we’ll predict values for so first of all we’ll create training and test sets and then we’ll balance the training set and train our model and then evaluate the performance of the model using the test set so now let’s build our training and test sets for the for detection model so we’ll be using the ca2s library so if you have not installed this library you can use the install dot packages and the name of the library to install it and after installing this library you can use the library function to load the library and after loading the library we can use the sample dot split function to split our data set into two sets that is training and test sets so in the sample dot split function if you want to know more about this function you can press f1 here so it will take you to the documentation on the right hand side so now if you want to split your dataset into training and test sets so here we’ll pass or column that we want to split according to since we are going to predict whether a transaction is a fake transaction or legitimate transaction so that column is a class column that represents zeros and ones so we’ll use the class column to split our dataset into training and test sets so i’ll pass the column my class column from the credit card data set and now i’ll mention the split ratio which will represent how many rows i want in the training set and how many rows i want in the test set so if you write split a ratio equals 0.8 so 80 of the rows that are present in your data set will be in the train set and 20 percent of the rows that are present in your dataset will be in that test set so if you run this particular line and will store the result in data sample once you run this line and if you look at the data sample now so it will contain contain true and false values so here the true values will be 0.8 so 80 of the values here will be true and 20 of the values here will be false and these number of values will be equal to the number of rows that we have in our dataset so now we’ll use this particular object to split our dataset into two sets using the subset function so we’ll use a substrate function on our credit card data set and we’ll check wherever the sample value is true we’ll store all those rows in chain data so after creating this line or 80 percent of the rows from the credit card data set will be in the train data and wherever the data sample value is false we’ll store all those rows in the test data 20 of the rows in the credit card data set will be in the test data so now let’s run all these lines together so now we have successfully created our train data and tells it so now let’s click on the train data in the variable explorer to see how many rows are there so we have around 22 785 rows in our training data that is 80 of the rows of the whole data set and then if you look at the test data so we have around 5696 rows in the test data and all the columns so we’ll train our model on the train data after balancing this data and then we’ll test the performance of the model using confusion matrix or on the test data so now if you want to check the dimensions of your training and test data you can use the dim functions which will display the dimensions so our training data contains 22 785 rows and our test data contains 5696 rows and all the columns that is 31 columns so now before making a model we have to balance our data set first so we’ll use different techniques so the first technique that we’re going to use is called the random over sampling which means that we will increase the number of fraud cases in our data set so we’ll over sample the minority class so right now if you look at the distribution of the class variables class variable you can see around 22 750 transactions are legitimate transactions and only 35 transactions are fraud transactions so now if you want to increase the number of fraud transactions so we’ll use this particular method so first of all we will write how many will store our number of rows in our data set or number of legitimate transactions in our data set in n legit variable and then we’ll store the number of fraction or the amount of fraction that we want for our fraud cases so if we want our fraud cases to be 50 of the whole data set using the random over sampling so we’ll write 0.50 and we’ll store the result in new fraction legit so now this fraction will be 50 or in order final data set after performing the random over sampling the 50 of the rows will be your legitimate transactions and 50 of the rows will be fraud transaction now if you want to calculate how many rows should be there in our data set to accumulate both of the legitimate and fraud transactions as 50 percent so we’ll divide the number of legitimate cases by the fraction and it will tell us how many rows we want in our data set to balance our legitimate and fraud transactions as 50 50 percent so after dividing the number of legitimate cases by the fraction we’ll get the number of rows that we want in in our data set so if you run this function we have stored the number of rows in new and total so if you so if you click on new one total so we’ll find that we need around forty five thousand five hundred and fifty forty five thousand five hundred rows in our data set and out of these roles fifty percent of the rows or fifty four under fifty percent of the cases will be fraud cases and fifty percent of the cases will be legitimate cases so now to perform random over sampling we’ll use a package called rose so if you have not installed this package you can install the package using install.packages function and then write the name of package that you want to install and after you have installed the package you can use the library function to load the package so that you can use the functions that are included in this package so in this package we are going to use the open.sample function which will help us to perform random over sampling so if you want to know more about this function you can press f1 here so it will take you to the documentation of this function so in this documentation you can see the first argument that we want is our formula argument then the formula argument we have to mention or variables that we want to over sample so here our class variable is the variable that we want to over sample and then we have used a dot which means the rest of the variables are independent variables and this tilde sign is used to separate our independent and dependent variables so our class is our variable that we want to over sample or that we will predict later on and we’ll use all of the variables in our data set except the class variable which we can mention using the dot symbol here and after the second argument is our data argument so it will require the data that we want to over sample so our data comes from train data that we have just created above our training set and the third argument here you can see is our method argument so if you go to the documentation you can see that we have three methods that is over under and both so firstly we’ll perform the over sampling of the minority samples so we’ll use method equals over and then we have our n argument which will be the number of rows in the new data set after the over sampling is performed so that is equal to the number of new and total which is 45 500 rows and then we also use the seed function so that if we run this code again so we will be getting the same sample again or the same number of oversampled rows again and again let’s run this function now to get our over sampling result so after running this function if you want to if you just click at the variable that is has been created so here you can see we have different attributes of this variable that is over sampling result so data is actually stored in the data attribute so if you want to reach the data attribute you can use the oversampling data and then the data attribute and will store the data set in over sample credit so if you store the result in over sample credit and now if you look at the distribution of the classes that we have after random over sampling so you will find that we have 22 750 legitimate cases and 22 750 fraud cases so now the distribution of our fraudulent and legitimate cases are equal using the random over sampling so before moving ahead let’s plot this distribution using the same v1 and v2 columns of our overlay over sampling result so now if you click click at the over sampling credit so this is our whole sampling credit we have 45 500 rows so let’s plot a scatter plot between v1 and v2 and see what is the distribution of classes between these two variables so we’ll use the ggplot function to plot a scatter plot again so here now our data comes from over sample credit here and then we have the variables v1 on the x-axis and v2 on the y-axis and the color of the points will be according to the categories of the class column that is 0 and 1 and then we’ll use the geom function so if you want to plot a geom so if you want to plot a scatter plot we’ll use the geom point function and then the theme of our plot will be black and white using the theme function and the color of the points will be dodge blue and red so if you want to change the color you can use this scale color manual function so once you run this command you will get a scatter plot between v1 and v2 that represents your over sampled data set that we have just created using the rows package so now if you zoom in here you will find that we have the blue color or the dodger blue color as class 0 which is the legitimate cases and the red will be our fraud cases so as you can see still we do not our points do not represent a lot of points compared to the blue points so the reason is that in our random over sampling we create duplicate points so most of these points are overlapping on each other because we involved random over sampling will just create duplicate points that are already present in one dataset so if you want to know whether there are points overlapping on each other or not so we’ll add some jitter to these points once we add some jitter to these points so we’ll see that there will be a lot of points in this on top of each other because we have just created duplicate points using the random over sampling so to add jitter to the points we’ll use the position argument of the geom point function so if you set your position argument to position jitter and inside the precision jitter function we can mention our width of the jitter that we want so if you write 0.1 here so now if you run this function you’ll find that there will be a lot of points that are over plotted or stacked upon each other because we have just created duplicate points using the random over sampling so now here you can see the points when you add some jitter you will find there are a lot of points that are stacked up on top of each other because we have just performed the duplicate values we have just increased the number of duplicate values using the random over sampling so now let’s perform our random under sampling so in random over sampling we just train our model and duplicate values which is not a good condition so we’ll see what is random under sampling so in random under sampling we’ll reduce the number of legitimate cases we’ll keep the legitimate cases equal to the number of fraud cases so let’s first of all see the distribution of our credit card data set so we have this distribution we have around 22750 legitimate cases and only 35 for our cases in our training set so now if you want to set the number of legitimate cases equal to the number of fraud cases so we’ll use this method so first of all we will store the number of fraud cases in and fraud and then we’ll calculate the or we’ll store the fraction that we want for our fraud cases or legitimate cases so we want the number of fraud cases equal to zero fifty percent of the whole data so we’ll store that is that in and frac fraud and then we want to calculate the number of rows that will be in your data set so that our fraud cases and legitimate cases are equal in number so we’ll divide the number of broad cases by the fraction that we want our fraud cases to be in that is 0.5 so once we divided we’ll get the total number of rows that we need in our data set to satisfy this fraction and once you run this particular command so if you look at your new total so now we’ll later we our data set will contain 70 rows and out of the 70 rows 35 rows will be number of legitimate cases and 35 rows will be number of fraud cases so now let’s use the one dot sample to create that particular data so we’ll use the same one dot sample and now instead of mentioning the method as over we’ll write method as under and then all of the arguments are same so you just have to change the method and once you run this particular function or per particular command so now your under sampling result will be calculated and now if you want to get to the get to the data set that is stored in under sampling results we will get the data set attribute or the data attribute of the under sampling result and will store that in under sample credit so once you run this line you will have your data set that is under sample credit and if you look at your data set right now so we have around 70 rows and out of this 70 rolls 35 rows will have will have register made cases and 35 rows are fraud cases so now let’s plot a scatter plot between same v1 and v2 to see how our distribution of the points look like and the distribution of the classes look like between these two variables v one and v two so if you plot it now so you will see that and now we have equal number of register medications and fraud cases so in random over sampling we decrease the number of legitimate cases to a certain ratio so that we we can we have more or we have an equal distribution of legitimate and fraud cases but the problem here is that we end up losing a lot of data which is not preferred in a general case so now we’ll perform both the random over sampling and under sampling to see how our data actually looks like so now we’ll perform both random over sampling and under sampling so now the number of rows that we want in our dataset will be equal to the number of rows that are present in our training set so our new dataset will also contain the same number of rows that is two two seven eight five and now we want our fractions of the legitimate and fraud cases equally in our dataset so out of the twenty two thousand seven eighty five rows fifty percent of the rows will be fraud cases and fifty percent of the rows would be legitimate cases so we’re not going to increase or decrease anything we’re just going to resample our dataset in such a way that 50 of the rows will be broad cases and 50 of the rows will be legitimate cases using both random over sampling for increasing the number of the fraud cases and random under sampling to decreasing the number of legitimate cases so after storing our fraction in the fraction fraud new variable we’ll use the same oven dot sample so now we have used the method as both so here if you want to perform both random over sampling and under sampling you can use the method as both and then the n argument here is the number of rows that we want in our new data set and then the p is your probability of fraction so if the method is both here so the probability will be default by 0.5 so 50 percent of the rows will be your legitimate cases and 50 of the rows will be broadcases and then we have used the seed function so that we get the same sample again and again if you run this code again so now if you run this code and if you get to the data set using the data attribute of the sampling result you will store that data set in sample credit now let’s look at the data set so we have around 2785 rows and out of the 2785 rows you can see around eleven thousand four thirty rows are legitimate cases and eleven thousand three five five rows are fraud cases so if you get the poi if you want to get the percentages of these rows so we’ll get around fifty percent of the theta are legitimate cases or legend legitimate transactions and 49.8 percent are fraud transactions so now let’s plot a scatter plot between v and v 1 and v 2 so to see how our class distribution looks like so i’ve already added some jitter so that we can see the overlapping of the points because we have performed random over sampling and unassembling both so we have reduced some number of legitimate cases and we have increased some number of broadcasters so now they are in equal ratio but the number of fraud cases are overlapping on each other because there are duplicates of each other and the number of legitimate cases are scattered on hold of the plot so now let’s move ahead and use the third method that we use which is called this mode which is a synthetic minority over sampling technique which will add the synthetic samples in our data set without actually creating duplicates of the fraud cases so to implement smart we use this library called smart family so if you have not installed this library you can use install.packages and the name of the library to install it first and then after installing you can use the library function to load the library to use the functions that are contained in this lab so after loading the library if you look at the class distribution we have the same class distribution that is 22 750 are legitimate cases in our trained data and 35 are the fraud cases is not trained so now we want to increase the number of legitimate cases or we want to add synthetic samples of the floor fraud cases so that we have a balanced data set that we can use to train our model so we’ll start by first of all setting the number of fraud and legitimate cases and the desired percentage of the legitimate cases that we want in our dataset so our legitimate cases are represented by n0 here so we have we have around 22 750 legitimate cases and n1 are the broad cases so we have around 35 fraud transactions and then r0 represents the ratio that we want in our data set after the smoke so we have we want sixty percent of our rows should be legitimate cases and forty percent of the rows will be fraud cases so we’ll add synthetic samples in such a way that after adding the synthetic samples or data set as sixty percent of the legitimate cases and forty percent of the fraud cases so now to get the number of times that we have to perform smooth to get the desired number of ratio of the legitimate and fraud cases so we’ll use this more function first of all from this mode family and if you want to know more about this function you can press f1 here and you can see we have different arguments that it’s x which is our data that we want to apply smarter then we have our target value so the vector of the target class attribute corresponding to our data set x so our x will be our independent variables and our y will target will be our dependent variable that we are going to predict later on and k is the number of nearest neighbors we want during the sampling and then we have our dub size so it will be the number of times we run our smooth or number of times we want to run ours more to add different synthetic samples so if you want to get the number of the tube size which will actually tell us how many times we have to run smooth in order to get the desired percentage of our legitimate and fraud cases so we’ll use this formula so this formula will help us to get the number of times we have to run smooth in order to get a percentage of synthetic cases such that our final data frame contains 60 of the legitimate cases and 40 of the fraud cases so using this formula which contains both r 0 and 0 and n 1 uh once you run this here and before running this we have to run these three together and after that if you run this so if you now look at the number of times we will get we have to run 432 times so after running 432 times on this mode we’ll get to a value of your variables in such a way already of the points in such a way that or 60 of the points will be flagged as legitimate transaction and 40 percent will be for our transaction so we’ll use the small function first of all and or x will be our training data so if you look at your trained data right now so clean data first column is our time column so in we are not going to use this particular column because we are only going to use these features v1 and v2 up to v28 so we will be removing this particular feature from our data and then we will also remove the last column which is our target column which is our class column 31 31st column so we’ll remove these two columns and our training data will be from amount till v1 so after declaring our training data that is our x now we’ll declare the target so our target is our class variable that we are going to predict either 0 or 1 and the number of neighbors that we want is 5 and then the tube size is equal to the number of times so once we run this particular code so we will get our output has smart output and if you right now if you look at this mode output so it will have different attributes and our data is stored in our data attribute here so we’ll get a data set which is data frame from this particular attribute so we’ll use this attribute data and we’ll store the data set which we have created using the smart method so once we run this particular line and if you could now click at the credits mode we’ll get the data set that has around 37 905 entries and around 30 columns so we have removed the first column and our last column is our class column so the class column around 60 of the class values will be 0 that is legitimate cases and 40 percent of the class values will be one which is fraud cases so now let’s change the name of this variable to class with an uppercase c so that we can we are according to the all of the data types that we have in our or all of the data frames that we have used till now so we’ll change the first of all the name of our last column that is our 30th column to class with an uppercase c using the call names function so after changing the name of the class so if you want to now get the percentage of the number of legitimate and fraud cases in our data set so we’ll we can use the prop table function and then we’ll pass a table function which will get the distribution of the categories present in the class variable so now you can clearly see that we have around 60 of the legitimate cases and 40 percent of the fraud case is represented by one so now let’s plot a scatter plot between v1 and we do and compare the original data set with the data set that we have created using this mode so this will be our original data set where our data comes from the training data that is without smoke and the rest of the code is same as above so this will be our original scatter plot between v1 and v2 for the training data so now let’s plot a scatter plot for the v1 and v2 in the credits more data after we have used the credit or after we have used this mode method on all training data so you have just have to change your credits mode the data that you are passing towards plot function so once you run this function you will see that we have added a lot of synthetic points around 40 of the synthetic points using this mode method so now if you look at this particular data you can see the blue points represent class 0 which is the number of fraud legitimate cases and the red points represent number of fraud cases and you can see we have not only added a points but we have only we have added synthetic points and not the duplicate points so these are all the synthetic points that our smooth algorithm has added using this mode method and now after creating this particular data we’ll train our model on this data and then we’ll evaluate the performance of the model on the original test data so now after making the data set 60 into 60 and 40 ratio so the 60 percent of the cases now are legitimate cases and 40 percent of the cases are fraud cases so now let’s build a decision tree on top of this data so that we can predict whether a transaction is fraudulent or legitimate so we’ll be using the r part package to build classification and regression trees so and we’ll use the r part dot plot package to plot the regression trees or the classification trees so if you have not installed these two packages you can install using the install dot packages command and install the packages for our part dot plot so once you have installed these packages you can load it using the library function and once you load these packages now we’ll use the r part function to make our classification and regression tree so if you press f1 here you will get to the description of this particular or the documentation of this particular function so the first argument here is a formula argument where we have to mention the independent and dependent variables so now we are predicting the class variable so the classes are independent or the dependent variable which we are going to predict and we will use all of the variables except the class variable as independent variables so these are if you want to write all the variables you can use the dot instead of writing every single name and then we use we use tilde sign to separate our independent and dependent variables and then we’ll pass our data that we want to train our model on so now our data that we will train one model on is a credit smart data which is our balance data so once we run this we’ll create a tree which will predict the number the class of any particular sample based on all of the variables so we’ll store the tree in cart model variable so after building a decision tree on this particular data so now let’s plot the decision tree using the r part or plot function so we just have to pass our model that we have built that is our card model and then we have used extra equals 0 so once we plot this model we get a decision tree so if you don’t want any extra information on the leaves we write extra equals zero and type equals five will change the display of these plots the shapes that we have used in this plot and two week equals 1.2 it enlarges the text to 120 percent so the text will be enlarged by 120 percent so this is how the decision tree that our model has built so it will use only one column in our data set that is v14 and it will use this column to classify different samples so particular columns v fourteenth value is greater than minus two point six so that sample will be classified or that transaction will be classified as a legitimate transaction and if it les if it is less than minus 2.6 so that transaction will be classified as a fraud transaction so now after building the decision tree let’s now predict values on top of the test set to see how many samples are correctly and incorrectly classified so we’ll use a predict function to predict the values on top of the test set so we’ll use the first method that the first argument that our predict function takes is our model that we have built that is our card model and if you want to know more more about this function you can press f1 here so it will take you to this particular documentation so we’ll use the cart model that is our machine learning model that we have declared above and then the data that it wants to predict on is our test data and then we want to predict classes so it will predict whether a class is zero that is legitimate cases legitimate case or class one that is fraud transaction so it will predict on top of the test data and then we will compare the predictions with the actual values of the test data that we already have with us so that we can see what is the distribution of the correctly and incorrectly classified values so we’ll store all the predictions in predicted val variable so once we have predicted value variable if you look at our predicted value variable looks like so here you can see the first value is four if you go to your test data so here you can also see the first row of the test data is the fourth row of our old data set and if you want to see what our model is predicted so for this particular row the class value is zero and our model has also predicted zero and for this uh the second row the class value is zero and our model has also predicted zero so if you want to know how many predictions are correct and incorrect according to our model and the actual values so we’ll build a confusion matrix to see how many predictions are correct and incorrect so let’s build the confusion matrix using the carrot package so i’ll load the carrot package and then i’ll use the confusion matrix function where i’ll pass the particular values first and then our actual values that are present in our class variable in our test data so once you run this you will get the confusion matrix for this particular model so you can see here our true positives are five six two five and our two negatives are seven so out of nine samples that we have in our test data are seven samples or the sam seven fraudulent cases were correctly classified by our model that we have built using the credits more data that is after using this mode method the data that we have built if we use of build on our model on top of that so we’ll get our out of nine cases that are present in our test data or nine cases seven cases which are fraud cases are actually classified so we’ll be able to detect seven cases out of nine cases that are present in our test data and here accuracy is insignificant but still if you want to see the accuracy our accuracy is around 98.98 so now if you use the model if you build at the same model without using small data or on the original trinity and let’s see what is the classification rate or how many samples are incorrectly and correctly classified so this is our decision tree without using this more data so now our data is actually the trend data and i’ve excluded the first column of my data that is our time column so we will only using the variables v1 to v28 and then the amount column not the time column and the rest of the code is same so this is our formula which dependent variable that is class and independent variables the rest of the variables that are present in our data so we’ll use this particular function r bar to create a classification regression tree and we’ll store the modeling card model and after that i will plot the card model so this is our model looks like this is a decision tree that our model will use to classify different samples based on the legitimate cases in the fraud cases so now let’s predict on top of the test set and see what is the classification of the miss classification rate so we’ll use the same predict method and then we’ll call the card will the predict function will mention the first argument as our model that is our card model and then we are predicting on the test data so here also i have excluded the first column so if you want to exclude the first column of the test data you can write this so here it will include all the rows and it will not include the first column that is our time column of the test data so test it how it looks like so if you want to exclude this particular column because we are not going to predict using this column so you can write -1 here and after that we have to add type equals class so it will predict classes for each sample so it either it will be zero or one so now we have predicted the classes of the test data using our model that was built on the original train data without using smooth now let’s look at the confusion matrix so now if you now look at the confusion matrix you can see that our two negatives is six so out of the nine samples we are only able to classify six samples if we use the data set that we have that is our trained data without using the smooth so now let’s compare these two models using the whole data so now instead of predicting on the test data we’ll predict on whole of the data set and that will show us how many samples will be correctly classified by a model that was built on smooth and by a model that was built on the original data that is unbalanced data so first let’s build our model that was built on the credits mode so this will be our model that we built on the credits mode now let’s predict on the whole credit card data set it will contain all the values so we’ll link exclude the first column and then we’ll predict the classes and we’ll store the predictions in predictval and after storing the predictions let’s see the confusion matrix so in this computer confusion matrix out of the 44 samples that we had in our dataset 40 fraud cases were correctly classified by using the smart data or the model that was built on this more data because our true negatives is one so these are the sample servers that was the transactions that were actually fake transactions one and our model is also also predicted that these transactions are fixed so out of 44 40 transactions were actually correctly classified and now let’s build the same model using the model that was built on the train data or the original data without the balancing factor or this mode after creating this data that is built on the original data set that is unbalanced and now let’s predict on the top of the whole data set and now if you build the confusion matrix so you will see that we have only 35 right now so we have x using this mode we can easily detect five more samples then we can detect using the unbalanced dataset so that is why using smooth is preferable because it balances balances that balances the data set first and then it predicts so you can easily see we are able to classify more fraud cases using this mode which is 40 instead of using the original data that we have which is unbalanced okay guys a quick info if you’re looking for an end-to-end python certification training we at intellipaat provide that and you can check those details in the description okay guys we’ve come to the end of this session i hope this session was helpful and informative for you if you have any queries regarding this please leave a comment below and we love to help you out thank you
-
Sale!
Wireless WIFI Repeater Extender Amplifier Booster 300Mbps
$29.99$14.99 Add to cartWireless WIFI Repeater Extender Amplifier Booster 300Mbps
Categories: Electronics, Wi-Fi Router, Wireless Wi-Fi Extender Tags: 300Mbps, 802.11N, Amplifier, Booster, Extender, mobile wi-fi booster, Remote, WIFI, Wireless, Wireless WIFI, Wireless WIFI Repeater, Wireless WIFI Repeater Extender, Wireless WIFI Repeater Extender Amplifier, Wireless WIFI Repeater Extender Amplifier Booster, Wireless WIFI Repeater Extender Amplifier Booster 300Mbps$29.99$14.99 -
Sale!
Full RGB Light Design Gaming Headset Headphones with Mic
$24.99$14.99 Add to cartFull RGB Light Design Gaming Headset Headphones with Mic
Categories: Electronics, Gaming, Gaming Headsets Tags: Design, Full, Full RGB Light Design Gaming Headset, Full RGB Light Design Gaming Headset Headphones, Full RGB Light Design Gaming Headset Headphones with Mic, Gamer, Gaming, Gaming Headset Headphones, gaming headset wireless, Headphone, Headphones, Headset, Light, Mic, Package, RGB$24.99$14.99 -
Sale!
Wireless BlueTooth Multi-Device Keyboard Mouse Combo
$39.99$19.99 Add to cartWireless BlueTooth Multi-Device Keyboard Mouse Combo
Categories: Electronics, Gaming, Gaming Keyboards, Keyboard Mouse Combos Tags: Combo, Keyboard, keyboard mouse combos, Mouse, MultiDevice, Set, WireKeyboard Mouse Combo, Wireless, Wireless BlueTooth Keyboard Mouse Combo, Wireless BlueTooth Keyboard Mouse Combos, Wireless BlueTooth Multi-Device Keyboard Mouse Combo, Wireless BlueTooth Multi-Device Keyboard Mouse Combos$39.99$19.99 -
Sale!
High Back Leather Executive Adjustable Swivel Gaming Chair with Headrest and Lumbar
$199.99$139.99 Add to cartHigh Back Leather Executive Adjustable Swivel Gaming Chair with Headrest and Lumbar
Categories: Gaming, Gaming Chairs Tags: Adjustable, Chair, computer chairs, Desk, Executive, Gaming, Girl, Headrest, High, High Back Leather Executive Adjustable Swivel Gaming Chair, High Back Leather Executive Adjustable Swivel Gaming Chair with Headrest, High Back Leather Executive Adjustable Swivel Gaming Chair with Headrest and Lumbar, High Back Leather Executive Adjustable Swivel Gaming Chairs, Leather, Lumbar, Office, Racing, Swivel$199.99$139.99 -
Sale!
Professional LED Light Wired Gaming Headphones with Noise Cancelling Microphone
$29.99$19.99 Select optionsProfessional LED Light Wired Gaming Headphones with Noise Cancelling Microphone
SKU: N/A Categories: Electronics, Gaming, Gaming Headsets Tags: Cancelling, Gaming, Gaming Headphones with Noise Cancelling Microphone, gaming headset, Headphones, Headset, LED, Light, Mic, Microphone, Noise, Professional, Professional LED Light Wired Gaming Headphones, Professional LED Light Wired Gaming Headphones with Noise Cancelling Microphone, Wired, Wired Gaming Headphones, Wired Gaming Headphones with Noise Cancelling Microphone$29.99$19.99 -
Sale!
Gaming Desk with LED Lights USB Power Outlets and Charging Ports
$349.99$249.99 Select optionsGaming Desk with LED Lights USB Power Outlets and Charging Ports
SKU: N/A Categories: Computer Desk, Gaming, Gaming Desk Tags: and Charging Ports, Charging, Desk, Desks, Gaming, gaming desk with led lights, Gaming Desks with LED Lights, Home, LED, Lights, Monitor, Office, Outlets, Port, Power, Room, Stand, USB, USB Power Outlets, White, Workstation$349.99$249.99 -
Sale!
Wired Mixed Backlit Anti-Ghosting Gaming Keyboard
$99.99$79.99 Add to cartWired Mixed Backlit Anti-Ghosting Gaming Keyboard
Categories: Electronics, Gaming, Gaming Keyboards Tags: Antighosting, Backlit, Blue, brown, Gaming, Gaming Keyboard, gaming keyboards, gaming keyboards and mouse, Keyboard, Laptop, Switch, Wired, Wired Mixed Backlit Anti-Ghosting Gaming Keyboard, Wired Mixed Backlit Anti-Ghosting Gaming Keyboards, Wired Mixed Backlit Gaming Keyboard$99.99$79.99 -
Sale!
Wireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headset
$119.99$59.99 Add to cartWireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headset
Categories: Electronics, Gaming, Gaming Headsets Tags: 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headset, ANC, Audio, Bluetooth, Cancellation, Ear, Earphone, gaming headset, Headphones, Headset, Hi-Res Over the Ear Headphones Headset, HiRes, Noise, Wireless, Wireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Headphones, Wireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headset, Wireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headsets$119.99$59.99 -
Sale!
Wired Sports Gaming Headset Earbuds with Microphone
$19.99$9.99 Select optionsWired Sports Gaming Headset Earbuds with Microphone
SKU: N/A Categories: Gaming, Gaming Headsets Tags: Accessories, Earbud, Earphone, Earphones, Gaming, gaming headset with microphone, Headphones, Headset, IOS, Microphone, Sports, Wired, Wired Sports Gaming Headset Earbuds, Wired Sports Gaming Headset Earbuds with Microphone, Wired Sports Headset Earbuds$19.99$9.99 -
Sale!
150W Universal Multi USB Fast Charger 16 Port MAX Charging Station
$49.99$29.99 Add to cart150W Universal Multi USB Fast Charger 16 Port MAX Charging Station
Categories: Charging Stations, Electronics Tags: 150W, 150W Charging Station, 150W Universal Multi USB Charging Station, 150W Universal Multi USB Fast Charger 16 Port MAX Charging Station, 150W Universal Multi USB Fast Charger 16 Port MAX Charging Stations, 150W Universal Multi USB MAX Charging Station, 16 Port MAX Charging Station, 3.5A, Charger, Charging, Fast, laptop charging stations, Max, Multi, Port, Stand, Station, Universal, USB$49.99$29.99
Guys, what else do you want to learn from Intellipaat? Comment down below and let us know so we can create more such tutorials for you.
Is this full course of python
L.😂😂😅
Hii❤
Jai Hind Sir
sir can I get the notes of this tutorila..it was really helpful…!!!
x==15 or y!=16 this returns true
👍👍👍
You posted 3 tutorials of same python course. Which among these 3 tutorials, you prefer to beginner?
Very good teaching ❣️
The Course was very amazing ❤❤
very great, i havent gone through it but im expecting it to be good💜
Pls Share Powerpoint document of this video. it is more useful for notes.Pls Share
thankyou anirudh bhaiya for helping me out ..i am in class 11th and done with my syllabus of python and now learning of 12th
Thanks bro
I am beginner for programming. No base on this field. Am I eligible to join the course?
Sir …which version book is best in python
come on man seriosly my so much time has waste, you should have provided us with data so that we could learn and practice side by side
can i get html code for scraping
Is this contain info about python turtle
Can this video cover whole syllabus for class 12 cbse python ?
will this be useful for +1 class students
Graphics design
1:32:13 : what command have to press ?
great
Awesome explanation about python and it is helpful
Can I add images inside a function ?
Please make video on JavaScript
This Channel Is Best Ever Channel On Youtube👍👍👍👍❤️❤️❤️❤️
You know everytime i think of learning something I come back to this channel. You guys are pure legends! God bless you all! 🙏🏻