Complete Python Pandas Data Science Tutorial! (Reading CSV/Excel files, Sorting, Filtering, Groupby)
- January 2, 2024
- Posted by: MainInstructor
- Category: Data Science Go Python
Video Title: Complete Python Pandas Data Science Tutorial! (Reading CSV/Excel files, Sorting, Filtering, Groupby)
What’s up guys and welcome back to another video in this video we are not talking about the fluffy animal or the song by designer but instead we’re gonna dive into the pandas library of Python which I find to be one of the most useful libraries when you’re doing
Anything data science related in Python so this video will be a good standalone video if you’ve never done anything with pandas kind of going from zero to like fairly comfortable in one sitting it also is a good video if you have you know some Python pandas experience but
You’re looking to try to figure out how to do something specific if you’re in that second case look down in the comments and I’ll pin a timeline of what we’re doing this video so you can find exactly what you’re looking for quickly so one question you might have as you
Watch this video is like wipe andis a lot of the stuff you’ll see me doing you probably could replicate in Excel but there are specific benefits to using pandas and the first benefit is you have a lot more flexibility with pandas and using Python in general like what you
Can do in Excel I think is very limited compared to what you can do using the whole Python programming language so like the flexibility that Python offers is reason one to use pandas and then the second reason this is also very important reason is pandas allows you to
Work with a lot larger data sets then Excel does Excel really kind of struggles once you start loading in really large files so the second reason of why pandas is you can work with big big data if you’re finding this video useful don’t forget to click that subscribe button because I’ll be making
A lot more tutorials building on this type of stuff in the future ok to begin this video you should already have Python 3 installed and then you need to open up a terminal window and type in pip install pandas if you don’t already have the library as you can see I
Already have it once you have the library we can actually begin loading in data super quickly so I just want to dive into the data right away and we’ll use that data to kind of learn everything we need to know regarding this library so I have a link in the description
Chintu my github page where I have a CSV of data that we’re going to be using for this video so go to my github page and then this data is going to be on Pokemon I found this data on kaggle it’s like a good open source machine learning
Website that you can kind of like do all sorts of challenges and I thought it was perfect for an introductory video on pandas so you don’t have to be a huge fan of Pokemon but it’s a great data set to get started so click on the CSV
Version and that’s kind of the most important one as you can see you can kind of get a feel for what’s in this data so we have all the different Pokemon and then all of their kind of stats in we’ll be doing all sorts of manipulations and doing all sorts of
Analysis on this data throughout the video but I want you to click on raw and then once you have the raw file you can just save as I called my Pokemon data and I save it as a CSV CSV is important for loading it in properly but you can
Name it whatever so hooking my data to the CSV and one thing to note is wherever you’re writing your code you should save this data in the same exact directory just so it’s easy to load in these files okay once you have the data saved locally open up your favorite text
Editor for the purposes of this video I’m going to be using Jupiter notebooks because I like using that for data science related stuff but you can use sublime text pycharm whatever you like to write your code in and I’m going to just clear all this that I have on the
Screen right now okay so the first thing we’re gonna do is load data into PI pandas so we have this CSV and you can open up the CSV and look but exactly what you saw on this page so this is what we’re gonna load into the pandas
Library and we’re gonna load it in and what is called a data frame so it’s super important you know everything about a data frame but that’s kind of what the object type is that panis allows you to manipulate everything with okay so the first thing
We need to do is type in import pandas to get the library and usually what you’ll see is it’s kind of annoying to have to reference pandas every time you type anything in that uses it so we usually import it as pandas as PD so
Just do that and then to quickly get our data loaded in we’re going to say pokey meaning Pokemon or maybe I can just call this like DF for data frame equals PD and then there’s this really useful function called read CSV and then you
Have to pass in the path to that CSV and if you wrote the put the CSV in the same file or in the same location that you’re writing your code you can just do the name of the file dot CSV if I run this it loaded it in and you can’t see that
It loaded it in but if I went ahead and did print DF you can see that all that data is there in that DF variable and if you don’t want to load in all of the data you can use the there’s these two useful functions to look at just the top
Of the data and just the bottom so I could do DF head and then I could specify a number of rows so I’m going to just say three for now think the default if you didn’t put that three in there is five so you see I just now can see the
Top three rows and it’s a little bit easier to read my data using that and I also could do if I wanted to see the bottom three rows I could do tail three and get as you can see the index is changed to the bottom got those bottom
Rows okay I’m going to just comment this out real quick I also want to show you that if you don’t have your your data in a CSV format that’s fine we can also very easily load in Excel files or tab separated files so on that github page I
Also just for the sake of practice included this same exact file in txt format which is a tab separated format as you can see in the Excel format so if you want to try this or you have a set of data that you’re trying to manipulate you can also do
I’ll load those files in so I can do PD dot read Excel that’s another built-in function to pandas and my excel file was like Pokemon data xlsx I believe just check yeah I don’t know yeah I think that’s the extension we’ll get an error if not I’ll comment
This line out too and I can do a print of DF xlsx ahead three as you can see that same data is read in from that excel file and then the last thing we can try to do I’ll move this just so it’s a little bit cleaner just
Down here comment it out real quick I can also load in that tab separated file so this one’s a little bit different I can do PD read CSV Pokemon data dot txt and watch what happens when I run this it’s probably not giving me an error let’s see oh it didn’t print
Yeah let’s see what happens I think it’s gonna yeah it loaded it all in is like this one single column so the difference with this tab separated file and just to remind you what this looks like just instead of having commas that’s separating the different columns its
Tabs we need to in our read CSV function specify a delimiter it’s actually separating them in this case it’s a tab which is specified by /t I believe I don’t remember the differences between forward slash and back slash and yeah look at that we have the columns in the
Way they were looking when we were just doing the CSV also note for this TSV the tab separated file you could change this to anything that was actually separating your column so if like let’s say for whatever reason you had three exes separating your columns you would set
Delimiter equals xxx all right let’s move on to the next thing and that’s going to be actually reading our data easily within the pandas framework so the first thing is reading the headers so in our data we have several headers and we can figure out what those are by
Doing a print D F dot columns so if we want the headers we just do DF columns as you can see there’s the Pokemon number or the Pokedex number I think it’s been a little while since I’ve refreshed my Pokemon scale is the name of the Pokemon but typed the two types
And all of the stats information is whether or not they’re legendary so these are all the columns we can work with it’s all just not print that for now also this a Jupiter notebook I’ll save this on the github page once I’m finished with the video so you can also
Look at this if you use Jupiter notebooks follow along with this if you just download it from the github page or clone it alright so now that we know our columns let’s read a specific column so to do that we have our data frame still
That we loaded up here and I can do DF dot let’s say I wanted to get the name of the Pokemon so if I just did print do you name and ran that as you can see I get all the Pokemon and it does actually abbreviate it just so I’m not printing
Out like 800 different things so that gives me that and I could also specify that I only wanted like 0-5 probably by doing this yes now I just get the fit the top five names one thing that’s interesting you could also do DF name like this this doesn’t really work for
To word names but you could also get the names like that I usually just do it in the using the brackets and if you want to get multiple columns at the same time you can change this just single word to a list of column names some in
Turning a list here and then separating it by commas so name say type 1 and that’s a like H pace we’re getting 3 different columns and there aren’t even all in order so it’s kind of nice so that you can get so if you want to look
At specific things and not be cluttered with so much extra stuff you can do that moving on to printing each row that’s just comp this up real quick probably the easiest way to print out each row so I’m going to just show you remind you what’s in our actual data set
Again so let’s print out the first four rows and let’s say we wanted to print out this first row index of one row so I guess this is actually this zeroth row so that it has Ivysaur grass poison etc and it if we want to adjust that row we
Can use this eye luke function on the data frame which stands for integer location so if I passed an eye look of 1 that will give me everything that was in that first row could also use this to get myself multiple rows I could do 1 to
4 and that would get me all of these rows so another way to get rows wherever you want in the data frame and the same ìlook function can be used to grab a specific location so let’s say I wanted to I’m gonna just change this to 0 real
Quick I wanted to get the venusaur name here so if we did the indexing of that it’s on the second row and it’s the 0 first column if we’re counting with numbers so if I wanted to just get that specific position we could do print di
Luke and then the second row and then do comma the actual position 1 so we want the first position the first column as you see that gives us Venusaur building up on this one thing I often myself trying to do is iterate through each row in my dataset as I’m reading
Yet and so to do that what I would recommend you do there’s probably other ways I’m sure there are is do for index comma row in D F dot it er Rose it her iterate through rows probably the easiest way to just go row by row and just access any sort of data
You might want so I could do print index comma row and run that as you can see it didn’t format it nicely for me but it’s going first row then getting the data for the second row etc and one thing that’s pretty nice about this is I could
If I just wanted the name in the index free throw I could iterate through and just get that information I don’t know I find this pretty useful for all sorts of different tasks that I’m doing well working with my data and then one additional function I want to get to
Right now I’m gonna go into this and more depth a little bit but in addition to having the I look we also have DF cloak and this is used for I guess finding specific data in our data set that isn’t just integer based isn’t just like the specific rows it’s based on
More textual information numerical information so one thing that’s really cool you can do with this is I can do DF cloak and then I can access only the rows that have DF name or let’s say type one equal to let’s say fire so this should give us like chars are
Immediately oh gosh Charmander Charizard the middle one so let’s run it and hopefully this works oh yeah afterwards they’re pretty nice when they don’t print it Wow I should have known this yeah so as you can see it’s only giving me the type one that’s equal to fire and I
Could do the same thing if I only wanted to look at the grass pokemon as you can see now we get Bulbasaur Ivysaur venusaur etc you can just keep doing this and you can use multiple conditions so this is super super powerful to do all sorts of conditional statements and
I’m going to get into this the more fancy advanced stuff with regarding this later on in the video while we’re on this topic another useful thing we can do with our data frame is we can use this stop described method which gives us like all the high level
Like mean standard deviation type stats from that so as you can see some of these categories it’s not super super useful like pokedex number it doesn’t we don’t really care about the mean but for like HP attack defense special attack etc it’s pretty cool little method to
Use because you have all these metrics you can quickly just look at your data another useful thing we can do is I just print the data frame again we can do some sorting of the values so let’s say instead of going from first pokedex downwards we could do sort of by let’s
Say alphabetical name so i could do sort values and then i have to pass in the column i want to sort so if I sorted values by name now I have it alphabetical if I wanted to make it the other way I could do extending so that’s
And that equal to false so now it’s gonna be descending as you can see he also can combine multiple columns in this so let’s say we had sorting by type one and then we wanted to have our second sort parameter P by H P so this
Would give us all like I guess probably the bug pokemon because that would be the first alphabetical one and then it would give us the lowest or highest HP from that let’s see what happens yeah as you can see bug and this is the lowest so what we could do is also pass
It in descending and this time because we have two columns when you specify true or false for both it might you might be able to do this yeah we can do this but if you want to separate if one is extending one’s descending we can do
See now I got it descending but we got the farthest down type one so I can do something like this so we want the first one to be ascending and the second one to be descending so now type one will be going a through Z and each P will go
From high to low as you can see so sorting the values is very useful as well okay now that we know how to read our data that’s like we start making some changes to it so let’s look at our data again okay so let you get this data
One change that I think would be cool to make is we have all these stat categories I think would be pretty cool to add all these stats together and create like a total category which kind of potentially could help us rank which Pokemon are the best so let’s go ahead
And do that and one thing that’s cool and I guess true about most things programming is there’s multiple ways to do this so we’re adding a column that is the total of those stats so one way we could do it is we just go ahead and
Access our data frame and then just call this new column total and we can just reference it it like this right now and we will say that that equals this is probably the easiest way to read but not the I guess fastest way to do it but you
Could do DF of HP plus DF of attack I’ll just probably speed this up when I’m actually editing this okay so now we had to find this one was you have the dataframe total is gonna equal all the other columns run that I guess we don’t see anything but if I go
Ahead and do data frame dot head five as we can see over here on the right side we have this new column name total and I would say I recommend always when you do something like this just making sure that you did it like actually is the
Total that you’re trying to get 49 plus 65 plus 65 plus 45 because you could easily see that this total is a valid number but if you don’t actually double-check that it’s the right number you kind of run into a dangerous territory and as we can see perfect 318
Is exactly what we are looking for so that’s one way to do it another way we could go about doing this and actually because I have I’m using a Jupiter notebook actually what we might want to do first is drop some columns so one thing about Jupiter notebooks is like if
I run this again even though I’ve commented it out lost myself it still has that data frame in memory so it even though this is commented out it still has that data frame in memory so it doesn’t remove the total after I even comp this out it just stays in memory
But so one thing we might want to do is drop a specific column so if I wanted to go ahead and drop the total column and just show how to do it in another way I could do data frame drop and then I can specify the columns and I’m gonna
Specify total I’m gonna run that yeah why did you not disappear and so the reason this did not disappear is because it actually directly modified or doesn’t directly remove that column I believe you have to just reset it to dataframe so I can go ahead and do this and we should see this
Total column here the right side which my face is blocking will see this disappear run that yay so that was dropping a column so now if I wanted to go ahead and do the add a column in a different way maybe a little bit more succinct of my way I can go
Ahead and do DF total that stays the same and then what I’m going to do this time is I’m going to use that I Lok function that we learned so integer location I want all the rows so the first input is going to be the colon which just means all rows everything and
Then the columns I actually want to add together will be HP through speed so that will be this is 0 1 2 3 4 so this will be the fourth column to the 5th 6th 7th 8th 9th to the ninth ninth column and I’ll run and then I can there’s a
Dot sum function you can use and you want to specify if you’re adding horizontally you want to specify axis equals 1 if you said actually su equals 0 that would be adding vertically ok and we have our totals again and one thing you might have noticed I don’t
Know if you caught this but because I have this 318 down here I realized that this 273 is actually wrong so that’s why it’s good to make that check the error I made was that it shouldn’t end at 9 if we want to include this speed it
Actually has to go to the next one because the end parameter and like lists and don’t be everything we the end parameter enlists is exclusive so 10th means the tenth column is the first one we don’t include so every run that now you see that the totals are actually
Correct as we did the math down here the last change we’ll make before we resave this as a another CSV the last change we’ll make before we receive this as an updated CSV will be let’s say we didn’t want this total column all the way over here on the right side it makes
A little bit more sense I would say to be either to the left of HP sorry I don’t know what tells us or the right of speed so we can do this in a few different ways and the way I’m gonna choose is probably not the most
Efficient but it it makes sense given what we’ve done already so remember that we could grab specific columns like this so if I wanted total I wanted HP and like defense let’s say and note I can order these however I want so if I wanted to reorder my columns and then
Save it afterwards I could just do DF equals whatever order I choose and because it’s a little bit annoying to type out all these things I’m going to get the columns as a list to do that I will do calls equal DF and you don’t
Have to know why why I’m typing what I’m typing works exactly I’m looking at the documentation as I do this and I recommend you guys do the same always look at the documentation there’s great stuff here I can’t get everything out here in this single video doing the best
I can but definitely check the documentation out I’ll link to that in the description ok so I’m getting the columns and instead of ordering it like this I’m going to do ranges so if I want these first four columns and then total and then the last the remaining columns
I could do something like this calls of 0 to 4 they’ll get me the first four in the same order I want it plus calls of negative one that’s just reverse indexing getting the total here I might be blocking that again the here and then finally the remaining
Stuff we would need to add to that would be four to five six seven eight nine ten eleven twelve and we include twelve because that would be in our the first one we actually don’t include in the final data frame so let’s see what happens when we do that we want to see
This total go over here know what happened okay so why do we get this error can only concatenate lists not string to list so that’s telling me something probably in here is messed up and what I’m seeing is that because this is a single column it’s not gonna be
It’s just gonna be a string so I have to actually share out of that in / in brackets to make it a list and then I can go ahead and run this again and we wanted to see the total switched over to the left side and there we go it is
There cool and one comment I want to make as I said before this type of change doesn’t actually really modify our data at all it’s just kind of a visual thing so I didn’t really care too too much about how I went about and did
It but one thing I really want to note here is be careful when you’re hard coding numbers in like this if your data is changing and you have these hard to criticism uh kind of like just using actual names so even calculating the total like this
Is a bit dangerous so maybe instead of using four to ten one thing you could potentially do is get the index of your start so that would when we were doing this it was the index of HTP and then go to the index of speed that would be one
Way to do it’s a little bit safer I would say all right now that we’re done with all of this let’s move on to saving our new CSV so just a reminder of what we have in our data frame we follow this information and in this previous of
Cells is where I actually defined the data frame just as a reminder this date frames not coming out of midair so I have this data frame and now I want to save this updated and let’s start by saving it to a CSV so just like
We had the dot read CSV we also have a built-in function in panda is called to CSV so I could just call this something like modified TA or modified CSV and now it will take whatever is in this data frame and output it to nice comma separated values format so because I got
To this next cell we know it did that I can check my directory and as you can see there’s this modified CSV and I’ll just open that up real quick just so you can see it all the information is there load okay so we see we have all the
Stuff we wanted and this total column there which is cool the one thing that is annoying about the current state of this stop texting me I’m making a video who has the nerve okay sorry so the one thing that might be annoying is that you have all these indexes over
Here and I don’t really care to have those so the quick fix to not save all these indexes with your data you can if you want to but you can go ahead and pass in the variable index equals false run that again and then I reopen my
Modified CSV you will see that that stuff is all gone so yeah now we just have the Pokedex sr for this column to the left which is perfect you can also go ahead and there’s also a built-in to excel function so I could if I wanted to
Save this as a excel even though right now we’re just working with the data frame it’s easy to output it to that format so to excel we’ll call this modified dot X there xlsx and we can also make the index false here run and so that well now we have these
Two modified this is the actual excel file I could load that but for the sake of time I’m not going to and then finally the last way we load it in three formats I might as well save three formats so the last one is what if we wanted to save that tab-separated file
So we can do to CSV again modified I’m going to call this modified txt and index equals false and then the one thing on this is there’s no delimiter parameter for when we’re doing to CSV which is kind of annoying but there is a separator parameter you can pass in and
Look at the documentation if you need to remember this I’m looking at the documentation as I speak and so I can specify that I want to separate it with tabs instead of commas is that that’s gonna happen by default so run that and I will actually open this one up just so
You can see modified here and if I drag that and you can see that all the data is there no indexes on the left and it’s all separated by tabs so that looks pretty good alright now that we’ve done all of that let’s move into some more
Advanced Panda stuff and we’ll start out with some more advanced filtering of our data so just a reminder this is our data frame so as a first example I showed before was that we could specify a specific type for example that we want it’s okay the DF cloak and then we said
DF of type 1 equals or equals equals let’s say graphs we’re only going to get the rows that actually have grass as their type on so as you can see all these type ones are grass in addition and we can do just more so than just one location condition we can pass in
Multiple so I can do something like DF type 1 equals grass and let’s say we wanted DF of type two two equal poison so I can type it in like this run it oh no we got an error so the thing you got to do here is we have to separate our conditions
With parentheses for whatever reason not quite sure why that is so here I have two conditions separate them by parentheses now as you can see we only have grass and poison now and one thing to note is usually we’re typing out and like this but inside of our pandas
Dataframe when we’re filtering we just do the actual and sign let’s say if we wanted type 1 equals grass or type 2 equals poison then we could do the or sign like this it’s a little bit different than you’re normally used to just the convention of the Python pandas
Library and just look this up if you forget so I run that now we should have one is poison either type one is grass or tuck two is poison and as you can see this is a bug type this is poison so I was able to separate those two
Conditions by N or instead of an and and we don’t have to just use text conditions I could also add in let’s say we wanted type 1 is equal to grass type 2 is equal to poison and let’s say we wanted the HP to be a fairly high value
So just looking at these feel like 70 is a good cutoff value so HP has to be greater than 70 I can also specify conditions like this and around that now you see we only have five rows that it actually filters out and you could go
Ahead if you wanted to there’s a couple different things you can do with this so first if you let’s say you wanted to make a new data frame that was just the filter data I could just do something like new DF equals this and now if I print out
Nudee f we get just those five rows but I could go ahead and just print out D F and we still have everything also worth mentioning real quick I can easily save this new data frame as a new CSV kind of to checkpoint my work and maybe if I
Wanted to do this on many different filters kind of have this more specific CSV files that I could dive in and look at in more depth it’s like I call this something like filtered dot CSV if I ran this you’d see in here that I have this
Filtered and it contains the data that I just grabbed out one thing to note when you are filtering your data and you shrink down the data size is when you print out that data frame so I’ll comment this out okay I can just print
Out new D F as you can see one thing that’s weird is this is the index here so it goes to 350 77 652 even though we’ve filtered out our data the old index stayed there and that get annoying if you’re trying to do some additional processing with this new data frame so
If you want to reset your index you can go new D F dot reset index and you can start off by just setting new D F equal to new DF is don’t reset index now if I print out new DF you see that we have 0
1 2 3 4 and by default it saves that old index there as a new column if you don’t want that to happen we can modify it further we can do we can do drop equals true so this will get rid of the old in these indices as you can see now we
Don’t have that then the last thing is if you don’t want to have to reset it to a new data frame you can actually do this in place as well which just probably conserves a little bit of memory and if I run this I don’t even
Set it to a new variable it just will change the value of within the given new DF and as you can see we got the new indexes for our filtered out data so that’s something useful too to be aware of because if you’re running through your new data frame like row by
Row and you’re trying to get a specific spot even though it’s like the fourth row that you see it might be you might need to index like you know the semi first position and that would get really annoying so resetting indexes is helpful in this case in addition to I guess
Equals conditions greater than less than etc not equals we also have other types of conditions we can use basically anything you can think of so one thing that I see that is kind of annoying me with this data is if you look in here maybe this is because I’m like a little
Bit outdated on my Pokemon knowledge but I’ve seen these like in mega versions of Pokemon and I’m not quite sure what that really means so let’s say I wanted to filter out all the names that contained mega and it’s tough to do with equal science you know because contain is not
Quite equal to because we want to allow a lot of different things there so I could not allow the name to include mega by doing the following so I’m going to delete the stuff that’s inside of here maybe I’ll just comment it out so you
Can still see it but I’m going to do DF cloak and then I’m gonna pass in da name then I need to get the string parameter of the name this is something you should just kind of I guess remember about the contains function string and then dot
Contains mega so if I run this you’ll see that all of these ones are just the columns that include the word mega and then if we want to get the reverse of this this is another good symbol to remember because it’s not quite what you would think it would be but within the
Alok function if we want to do not instead of maybe think you know to be the explanation point it’s actually this squiggly line so if I run this now we drop all those ones that had the mega so as you can see there’s no Megas anymore in our data so
That’s pretty useful and taking this even a step farther this contains function I find to be very very powerful because in addition to just doing exact words we can also pass in reg X expressions and do all sorts of like complicated filtering with this so let’s say that’s the first example let’s say
We wanted to see if the string wanted a simple way to get if the type one was either grass or fire so to do that first have to just import regular expressions and I would recommend looking into regular expressions if you don’t know what they are super super powerful and
Filtering data based on certain textual patterns so I can do reg x equals true and right now I’m trying to find if type 1 is equal to fire or let’s say grass so in the writer reg X expression this means or so I want it to either match
Fire or grass run that shoot it did not give me anything and the reason it didn’t give me anything is because the capitalisation was off so this is gonna be another good point so see that did work type 1 grass type 1 fire etc but a probably nicer way to do this because
You might have all sorts of funky capitalization is I could go ahead and change it back to this way but there’s a flag that you can use so I can say Flags equals re dot I and that’s going to be ignore case so I run that again as you
Can see grass and fire is grabbed even though I specified it without the capital letters one more example let’s say I wanted to get all Pokemon names that contains started with API so probably the first example you might think of as Pikachu but he also would have like Pidgeotto and probably a bunch
Of new ones that I don’t know so if I wanted to just get data in the name category that started with P I I could use red x’s to do the following I could do P I and then specify that I need it to start with P I but the next set of
Letters can be a through Z and let’s say like this star means one or more and yeah this is all just Rex information if it seems super super foreign to you look into Ray guesses and if I do this we didn’t get anything what happened that’s
Because I said type 1 so if I actually change this to name run it as you can see oh we got Caterpie so I did something messed up with my reg ex but as you can see there’s all these PA names in it and if I wanted to eliminate
This from happening the PA letter to be in the middle I can specify a start of line with this carrot run that now we’ve got only our names that begin with a P I and you might find this you know there’s many different use cases where you might
Find something like this useful to do to filter out your data in a kind of complex manner building off the filtering we did in the last examples we can actually change our data frame based on the conditions that we filter out by so let’s imagine I wanted to I didn’t
Like the name fire for type 1 I thought that you know bitter and if it was name like flame flamie flamer if you have our fire type you’re actually a flamer so let’s make that change and I know this is going against Pokemon tradition but
Just to show you DFL oak and we want to have DF of type 1 equal equal fire and if that is the case well I can do if I specify with a comma I can specify a parameter so I’m going to say type 1 so this is the column I
Want and I can do equals like flamer it looks like something is off why does it look like something is off that’s because I have an extra bracket there now it should be good run that don’t see anything but if I do DF oh shoot you can
See that now type 1 is flamer as opposed to fire if I wanted to change it back I could go fire and this is flavor now I have fire again we also can do like specify this to be some different different calm it doesn’t have to be the
Same column we’re editing so maybe you decided that legendary pokémon are all Pokemon that are of type fire and you can make this true in that case and as you can see now all the fire pokemon are legendary which obviously isn’t true but it’s kind of cool that we can use one
Condition to set the parameter of another column and I’m kind of screwed up this data frame in general now because I did that so what I could do is use kind of my check point that was the modified CSC so I’m gonna just say CSV equals DAR PD dot read CSV modified dot
CSV now I’m just kind of loading my check point that I had a while back yeah so now I fixed up the false the legendary by just reloading my data frame you can also change multiple parameters at the time so I’m gonna just do this as a demonstration but imagine
We wanted to say like something like if the total is greater than 500 it’s pretty damn good Pokemon I was gonna say these changes that I’m gonna make her here don’t really matter but just to show you that certain conditions can be modified multiple conditions can be modified at a
Single time so if I want to modify multiple columns at a single time I can pass in a list and if I set like this to the test value don’t worry about this that’s what I’m saying test value so if the total is greater than 500 these two
Columns should instead of having their normal values should have test value let’s see if that’s true oh and I just need to print out data frame as you can see this total is greater than 500 this is greater than 500 this is greater than 500 all of them
Modified as we wanted and another neat thing to know is that you can modify them individually as well so this could be tests that are like will say that this is test one and this is test two so now we are specifying what generation becomes and what legendary becomes if
This specific condition is met as you can see that updated appropriately comment all this out real quick and just reload the data frame as it was initially okay so that’s just I’m just resetting the changes cause I don’t want to stay but showing you that you can do these
Things then they become super super useful all right we’re gonna end this video by doing something that I find very useful and that’s using the group by function to help you do some aggregate statistics so let’s start by loading in that check pointed CSV we
Kind of created so I modified dot CSV is what I’m gonna load it in and just reminder this is what it’s looking like all right so with this group by function we can start doing some really kind of cool analysis on things so for example
One thing I could do is if I wanted to see like the average HP and attack of all the Pokemon grouped by which type they are so like maybe trying to figure out like Oh our specific types like have better skills so maybe a rock pokémon would have like hi defense I think
There’s a rock pokémon steel Pokemon would have hi defense you know maybe a poison Pokemon would have high attack so we can start seeing that as a kind of holistic measure by using this group by function so I can go D F dot group I and
Let’s say I wanted to group by type one and we’re gonna look for the averages of all the type one Pokemon so I can run that and here we get all the stats broken down by their mean sorted by what type one is so if I look at a bug it has
Like an average attack of 70 and make this even more useful I can do that use the sort function we learned so sort values and we’ll sort on let’s say defense and I’m gonna make this ascending the equals false so I can see the highest defense type one and as I
Mentioned before this is we did kind of see what I was expecting to see that if we took the average of all steel pokemons they have the highest defense which is kind of cool it’s cool to see that we could also instead of doing defense we could look at what is of all
Pokemon that are in the have a type 1 that’s the same who has the highest attack and I always thinking maybe would be poison but I’m not positive okay so we got this does make sense we got dragon with the highest average attack and a lot of the dragon pokémon
Are legendary so there’s no surprise there that they’re super powerful and like holy crap like a dragon would be very scary fighting obviously that makes sense that they should have good attack let’s see who has the best HP this might the dragon also is the best HP but it’s kind
Of cool that we can group by type 1 and see all these like useful stats about them just on that group it’s a useful little analysis tool and I could additionally do something like dot some so I can use the three that come off to like come to me in my mind immediately
There might be others that you can aggregate statistics you can do with group I but some or mean some and count so if I did this and some do everything here I have like all of the HP’s added up and you know you got to be thinking about why you’re doing something when
You’re doing anything data science related in this case it doesn’t really make sense for me to sum up these properties because like I could sum up and see that one’s like way higher than another but because you don’t know how many type ones are bug or type ones are are dark
Etc this aggregate sum doesn’t make sense in this context but you can do it then you also then you also have count so if I run this we have all the counts of the different out of pokémon that are type 1 so 69 are bog 31 are
Dark 32 or dragon type 1 and if you want to clean this up a little bit like you have a lot of the same values everywhere basically it’s any time it’s a non zero non false number or I think just yeah any time the row is filled in so like
The reason this is 52 right here is because the type 2 is just blank so it didn’t count those counts so if you wanted to just have like a clean this is the count you could add DF equals or DF of count equals 1 so basically what I’m
Doing is I’m filling in a column added to the data frame and I’ll show you this that’s just a one for every route as you can see on the right side where my face is normally blocking there’s this one here so now what I could do is I could
Do that same group by count right oh shoot it’s calming it up and I get all of this back but if I want to just make my life easier I can do just get the count column and now I have this useful little set where there’s 69 bug 31 dark
32 dragon etc this is a little bit easier to read now and now that I have been it easier to read format I could also group by multiple parameters at the same time so I could do type 1 and type 2 so looking at all
The subsets so of type 1 bug to have a type 2 of electric to have a type 2 of fighting etc I can do all sorts of count and this gets really useful if you’re working with a really really massive data set and I’ll get into that quickly
I don’t know if I’ll go through the full example but imagine you had a really really big data set and you couldn’t even load it all into one data frame this group I and like count and sum is super useful because you can take your
Data frame and kind of squish it so if you’re like tracking some number of like how many times an event occurs in your data frame you can use this group I and like count things and then kind of squeeze your data frame make it smaller based on this group by function if that
Made any sense so I’m not gonna show you guys exactly just because we’ve been working with the data set that’s not that small I don’t really feel like bringing in a new data set right at the end but imagine you’re working with a data set a file that’s on the order of
Like 20 gigabytes it’s pretty dang big and you don’t really know how to best process it one thing that’s really useful about the Python pandas library is it allows you to read in like a file like that you can read it in chunks at a time so instead of reading it into all
20 gigabytes because now unless you have a very very powerful machine you’re not going to be able to load all of that into memory you can load it in let’s say 100 megabytes at a time and so normally when we are reading the CSV we would do
Like PD read CSV modified F CSV is file I’m using right now and that would load everything in so instead of doing that what we can do is we can pass in this chunk size parameter so I’m going to just say for now chunk size equals 5
Just for the example and that means 5 rows are being passed in at a time so if I did for DF in PD read CSV that means that my DF right here would be 5 rows of my total data set modified CSV and because this is rows and you might would
Rather think a bit in like terms of memory size you can do a little bit of math with the Rose to figure out how much memory that will actually be taking if you think every row is probably like maybe 10 or 20 bytes shouldn’t be much
More than that you can do some math on how big this is and if you run into an area you can always like that you don’t have enough memory you can always shrink this chunk size so we’re working with a really big data set we made a set you
Know our chunk size to a hundred thousand rows at a time which is a lot of rows but nowhere near how much that full 20 gigabytes would be but for our example we’re just loading again five rows at a time and I can show you that
That is happening so 10 or like chunk data frame and then the data frame just to see how it’s working so we have the first data frame and as you can see it’s five rows second data frame another next five rows third data frame the third set
Of five rows etc so this loaded in the data frame but in chunks of five so what’s useful with like the aggregate stuff we were just going through is you could also like to find some new data frame equals that say PD data frame and
You can give it like the same columns as you had in your original data frame this would just create a new data frame that’s empty with the same column names basically what you can do is you could let’s say like define DF group by type 1 let’s say
And get like the count of that stored in results and what you can do here is with that new data frame you defined you could do something like you can use the concat function of pandas which just appends two data frames together and you could do something like p or new data
Frame equals PD concat of the new data frame and results so basically what what this would do is always take your new data frame as you go through chunks append on results and store it back to new data frame so as you did this as you did more iterations
You be building this new data frame of all the information in your original really really large data set but because your each chunk you’re like aggregating doing some sort of group buy and count you’re shrinking that data size down so that this final new data frame has the
Meaning that comes out of that big original data frame but it’s a lot smaller you can actually do more analysis now on this shrunken down size new data frankly if that hopefully hopefully that makes sense if you need me to clarify this just leave me a
Comment down below and I’ll try to clear things up regarding that all right that’s all I’m gonna do in this video hopefully you have control you kind of feel like you have control the pandas library now if you felt like you learned something make sure to hit that
Subscribe button it would mean a lot to me I’m gonna build off of this video in future videos such as like plotting stuff in our data frames and you know kind of doing some advanced stuff using like regular expressions I don’t know if it will be tell it specifically a pandas
But a lot of useful information that you can take your panda skills and build off of so subscribe for all of that if you have any questions about anything I covered in the video you leave a comment down below and I’ll try to help you out
And clarify and also if there’s any like additional features you would love to see in pandas that I’ve missed leave a comment down below what that is and maybe I’ll make a follow-up part two to this video all right that’s all I got thank you guys again for watching and peace out You
-
Sale!
Wireless WIFI Repeater Extender Amplifier Booster 300Mbps
$29.99$14.99 Add to cartWireless WIFI Repeater Extender Amplifier Booster 300Mbps
Categories: Electronics, Wi-Fi Router, Wireless Wi-Fi Extender Tags: 300Mbps, 802.11N, Amplifier, Booster, Extender, mobile wi-fi booster, Remote, WIFI, Wireless, Wireless WIFI, Wireless WIFI Repeater, Wireless WIFI Repeater Extender, Wireless WIFI Repeater Extender Amplifier, Wireless WIFI Repeater Extender Amplifier Booster, Wireless WIFI Repeater Extender Amplifier Booster 300Mbps$29.99$14.99 -
Sale!
Full RGB Light Design Gaming Headset Headphones with Mic
$24.99$14.99 Add to cartFull RGB Light Design Gaming Headset Headphones with Mic
Categories: Electronics, Gaming, Gaming Headsets Tags: Design, Full, Full RGB Light Design Gaming Headset, Full RGB Light Design Gaming Headset Headphones, Full RGB Light Design Gaming Headset Headphones with Mic, Gamer, Gaming, Gaming Headset Headphones, gaming headset wireless, Headphone, Headphones, Headset, Light, Mic, Package, RGB$24.99$14.99 -
Sale!
Wireless BlueTooth Multi-Device Keyboard Mouse Combo
$39.99$19.99 Add to cartWireless BlueTooth Multi-Device Keyboard Mouse Combo
Categories: Electronics, Gaming, Gaming Keyboards, Keyboard Mouse Combos Tags: Combo, Keyboard, keyboard mouse combos, Mouse, MultiDevice, Set, WireKeyboard Mouse Combo, Wireless, Wireless BlueTooth Keyboard Mouse Combo, Wireless BlueTooth Keyboard Mouse Combos, Wireless BlueTooth Multi-Device Keyboard Mouse Combo, Wireless BlueTooth Multi-Device Keyboard Mouse Combos$39.99$19.99 -
Sale!
High Back Leather Executive Adjustable Swivel Gaming Chair with Headrest and Lumbar
$199.99$139.99 Add to cartHigh Back Leather Executive Adjustable Swivel Gaming Chair with Headrest and Lumbar
Categories: Gaming, Gaming Chairs Tags: Adjustable, Chair, computer chairs, Desk, Executive, Gaming, Girl, Headrest, High, High Back Leather Executive Adjustable Swivel Gaming Chair, High Back Leather Executive Adjustable Swivel Gaming Chair with Headrest, High Back Leather Executive Adjustable Swivel Gaming Chair with Headrest and Lumbar, High Back Leather Executive Adjustable Swivel Gaming Chairs, Leather, Lumbar, Office, Racing, Swivel$199.99$139.99 -
Sale!
Professional LED Light Wired Gaming Headphones with Noise Cancelling Microphone
$29.99$19.99 Select optionsProfessional LED Light Wired Gaming Headphones with Noise Cancelling Microphone
SKU: N/A Categories: Electronics, Gaming, Gaming Headsets Tags: Cancelling, Gaming, Gaming Headphones with Noise Cancelling Microphone, gaming headset, Headphones, Headset, LED, Light, Mic, Microphone, Noise, Professional, Professional LED Light Wired Gaming Headphones, Professional LED Light Wired Gaming Headphones with Noise Cancelling Microphone, Wired, Wired Gaming Headphones, Wired Gaming Headphones with Noise Cancelling Microphone$29.99$19.99 -
Sale!
Gaming Desk with LED Lights USB Power Outlets and Charging Ports
$349.99$249.99 Select optionsGaming Desk with LED Lights USB Power Outlets and Charging Ports
SKU: N/A Categories: Computer Desk, Gaming, Gaming Desk Tags: and Charging Ports, Charging, Desk, Desks, Gaming, gaming desk with led lights, Gaming Desks with LED Lights, Home, LED, Lights, Monitor, Office, Outlets, Port, Power, Room, Stand, USB, USB Power Outlets, White, Workstation$349.99$249.99 -
Sale!
Wired Mixed Backlit Anti-Ghosting Gaming Keyboard
$99.99$79.99 Add to cartWired Mixed Backlit Anti-Ghosting Gaming Keyboard
Categories: Electronics, Gaming, Gaming Keyboards Tags: Antighosting, Backlit, Blue, brown, Gaming, Gaming Keyboard, gaming keyboards, gaming keyboards and mouse, Keyboard, Laptop, Switch, Wired, Wired Mixed Backlit Anti-Ghosting Gaming Keyboard, Wired Mixed Backlit Anti-Ghosting Gaming Keyboards, Wired Mixed Backlit Gaming Keyboard$99.99$79.99 -
Sale!
Wireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headset
$119.99$59.99 Add to cartWireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headset
Categories: Electronics, Gaming, Gaming Headsets Tags: 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headset, ANC, Audio, Bluetooth, Cancellation, Ear, Earphone, gaming headset, Headphones, Headset, Hi-Res Over the Ear Headphones Headset, HiRes, Noise, Wireless, Wireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Headphones, Wireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headset, Wireless Bluetooth 5.3 ANC Noise Cancellation Hi-Res Over the Ear Headphones Headsets$119.99$59.99 -
Sale!
Wired Sports Gaming Headset Earbuds with Microphone
$19.99$9.99 Select optionsWired Sports Gaming Headset Earbuds with Microphone
SKU: N/A Categories: Gaming, Gaming Headsets Tags: Accessories, Earbud, Earphone, Earphones, Gaming, gaming headset with microphone, Headphones, Headset, IOS, Microphone, Sports, Wired, Wired Sports Gaming Headset Earbuds, Wired Sports Gaming Headset Earbuds with Microphone, Wired Sports Headset Earbuds$19.99$9.99 -
Sale!
150W Universal Multi USB Fast Charger 16 Port MAX Charging Station
$49.99$29.99 Add to cart150W Universal Multi USB Fast Charger 16 Port MAX Charging Station
Categories: Charging Stations, Electronics Tags: 150W, 150W Charging Station, 150W Universal Multi USB Charging Station, 150W Universal Multi USB Fast Charger 16 Port MAX Charging Station, 150W Universal Multi USB Fast Charger 16 Port MAX Charging Stations, 150W Universal Multi USB MAX Charging Station, 16 Port MAX Charging Station, 3.5A, Charger, Charging, Fast, laptop charging stations, Max, Multi, Port, Stand, Station, Universal, USB$49.99$29.99
Hey ya'll! I created a second channel with more Python content (including additional Pandas tips & tricks).
Please consider subscribing 😊
http://youtube.com/techtrekbykeithgalli
Thank you❤❤❤
Thanks for this amazing video but i just realized something.
This pandas csv stuff is kinda boring than numpy and matplotlib stuff 😭😭😭😭
let me know if i'm wrong but in the last example, when we are using concat to join chunks of grouped data , it will show an error of duplicate values… example, say first five rows have 4 grass pokemons so now the result df will have a row "Grass 4" which is now new_df… now imagine the next five rows has 1 grass pokemon.. so now the result table will have a row "Grass 1" which when concatenated to new_df will show a duplicacy error as per pandas citation.. (im assuming here that you added ['count'] in the code so there's only one column as there were in previous dfs)
great video btw, no meanness intended, just wanted to clarify 🙂
Vai tumara bouth bara fan hu❤
Vai tumara bouth bara fan hu❤
One of best video
Excellent video, good job!
are these still applicable in 2023
thanks for useful video
If anybody have a problem with calculating the mean of Type 1 grouped data, use this:
df= pd.read_csv('modified.csv')
df.groupby(['Type 1']).mean(numeric_only=True)
instand of this:
df= pd.read_csv('modified.csv')
df.groupby(['Type 1']).mean()
That way, it won't include string-type data in the mean and sum functions.
How to use grouping with arbitrary functions? I.e. not only sum, mean, etc.
This video was super helpful, thank you Keith!
In case anyone gets to the end of this video, around 48:00, Keith talks about the groupby operator and starts to go over the section "Aggregate Statistic using Groupby (Sum, Mean Counting)". You might run into errors due something that changed after Pandas version 2.0.0.
Instead of writing: df.groupby(["Type 1"]).mean()
Try writing: df.groupby(["Type 1"]).mean(numeric_only=True)
After version 2.0.0 the numeric_only value was changed to False versus True as it's default, causing errors such as "can not convert strings". Hope this is helpful, have a good one!
Regarding the groupby part where we are looking for the averages of all Pokemon using the .mean() Method. VS Code always shows me a error message, saying some of my data isn't in numeric format and therefore can't be added together. Is there anything I'm missing?
Great information, helped a lot thanks !
Thanks Keith.
I loved this video, it is great!
I dont understand why you big data size shrink by count()
Rare i had to place a sep for the to_csv cause everything was in 1 colum when i tried to open the file with the modified data
You legend!
20:30 hi Keith. For me, I can't add as the entries are strings and getting concantenated. How to convert a specific column of a drataframe to integer? Any one can reply if they have a solution. Thanks.
Hello, i'm just learning python for data science and I find your explanations so good!
i've run into a problem when adding the ".mean()" at 48:36, I get an error "Could not convert "and a single string of text containing the names of the pokemon. Can anybody help me out here?
Great video! Very well explained in details.
The video is excellent! I have one issue though.
When using an iloc function, the value needs to be a label or an integer.
But my programs uses Float, and I simply cannot find an alternative function that let's me access a specific value of the Float variant. Any nice tricks or clever workarounds?
Thank you so much for covering all important points quickly and in easily understandable language along with simple examples!!
I can't see the whole page only the top portion. Please help
This was a very helpful video, great tutorial, THANK YOU!
Do note doing it directly in jupyter notebook for groupby function will enable numberic_only = True to calculate the mean. However, if you are using any IDE, then you will need to manually select the targeted rows that do not share categorical data ie.
new_cols = df_v1.columns # create a list of the new table with total on column 4
df[df_v1[2:12].groupby(['Type 1', 'Type 2')].mean() #calculate mean by type 1 first category sub category type 2
self note: tomorrow continue from 15:43
Can you help me out ? If I write something df['name'] ….as examples……. although the " name " is present in the column name but I found everytime error…..
please help! using groupby and mean together threw an error saying cannot convert [all pokemon names] to numeric. i had to do this
df.select_dtypes(include='number').groupby(df["Type 1"]).mean()
you r so cute when someone is texting you
This was actually so informative and easy to follow. Many thanks!!
Good stuff!