You will hear this term a lot if you are studying data science and it’s a very simple concept. It’s a range between 25th and 75th percentile if the range is between min and max then it’s called a range in general you know so general range and IQR are kind of same but IQR is between 25th and 75th percentile if you want to know more about IQR I have a video where I showed you how you can use IQR to remove outliers the descript link is in the description below please check it out all right back to our example of outlier removal so how do you remove this outlier well one approach is what if I say I want to remove all the values which are more than 99 percentile so here 99 percentile will be 10 million and we’ll write the code for this same thing.
So you’ll understand why and easily we can remove that 10 million outlier using percentile okay so just to summarize we looked at two use cases of percentile here one is income data set outlier removal the other use case is very common by the way if you have given set or grew exam many exams nowadays have percentile score or relative score rather than percentage or absolute score which means if you get let’s say 75 percent in the examination but no one else got more marks than you let’s say you got the highest mark then all the absolute percent is 75 percent your percentile will be 100 because it’s a maximum so percentile is relative scoring and it’s used a lot in the exam scores.
Okay now if you’re curious about how exactly percentile is calculated here is the page this is a very good website statisticshowto.com I will put a link of this below you can read the math but even if you don’t want to go detail into math it’s fine as long as you understand the concept now what the heck is mode. well mode is very simple you guys are using that in everyday life let’s say you are working in a team and team wants to go for a team lunch your manager will tell you okay I want to conduct a restaurant survey so the people in the team will fill out the survey you know Rob and Rafi will say i want to eat Mexican, Nina Italian and so on what will your manager do well simple he will take the maximum right so the maximum is called mode so maximum votes are Mexican and that maximum votes is nothing but a mode.
So mode is the most frequently occurring value in a data set we use more for so many things in real life we just don’t know this fancy term called mode so I thought I will just give you a quick idea let’s write python code now I have the income data in csv file and I had loaded that into a pandas data frame here you can see there are seven rows in total and what we’re going to do is use percentile to remove the outlier that we have here and first function I will call is depth of describe this will give you a basic statistics of a data frame such as count, mean, standard deviation and so on.
But look at these three important numbers this is showing you the 25th percentile which was 5500 so if you remember from our presentation our 25th percentile was 5500 our 75th percentile was 7750 see 7750 so the describe function gives all those awesome statistics you can also call do dot income which will give you of course this column which is a pandas series and quantile function so quantile okay quantile if you do 0.75 it will give you the same number 7750 you give 25, gives you this number you can give any number by the way you can give 45 for example 45th quantile and it will just work as okay if you do google search on this API remember in the presentation.
I told you that there are different ways of calculating quantile for example let’s look at 25th quantile okay 550 but if you look at the documentation if you do shift tab you see this parameter interpolation is linear so there are couple of values that it can take. It can take values such as lower higher nearest midpoint so let’s try lower and higher so here if I do interpolation is equal to lower so when you do lower let me show you what it will do so when you do lower it will actually give you this 5000 points see so if I do lower it gives you 5000 if I do higher it will give you 6 000. so 5000 and 6000 or values in between but if you do interpolation linear it will just take use those two numbers okay now if you do zeroth quantile 4000 if you do one quantile the maximum number.
Mr. Elon Musk it will give you that all right now we want to use quantile to remove the outlier so how about we use let’s say for example 0.99 I want to discard anything that is more than 99th quantile so 99 quantile is 9.4 million okay so let’s store this in a variable so percentile 99 and then I will say okay in my data frame give me all data points who are greater than this percentile and it gives me Elon Musk this is an outlier and if you want to remove it and prepare a new data frame which you call it do no outlier okay then see this do no outlier so you have to reverse this conditions you to say my valid values are anything which is less than 99 percentile and I just remove the outlier my data frame is ready this is what data scientists do they do lot of data cleaning data preparation an outlier removal is a very important step in that process.
Now let’s look at the second use case which is filling a missing values so again I have this data frame and in that let me make Sophia’s income null so that I can show you how can a can be filled using a median so I will do something like okay 3 that will give me this and at 3 row number 3 my income I want to make it np dot nan actually I need to do I location I location will give you that location all right so this worked and if I do now all right something is not right all right looks like some problem I’ll do it reverse way I will take the column do income and then at number three I will store nan and now you see Sofia’s income is an n.
Let’s assume in real life when you are dealing with your data you know many times the data some of the data points will not have a value because of data transmission error because the data is not available whatever reason but this is what happens in data scientist life that some of the values are missing and you need to substitute that with the right guess so if you make your guess such that let’s say I’m saying my mean can be used as a substitute so my mean for the income is this much and it is very high because of this outlier but if I use that to fill any there is a function called fill in a and if I use that let me store this in a new variable it’s going to be disaster because on average all these guys have five thousand six thousand four thousand values but the so sonny as Sophia’s income is now 1.6 million and it became this high because of outlier.
So here is a classic case where you should not use mean you should use median instead and we can copy paste the same function here but instead of mean you do median and when you do it now the values look more reasonable you know 6500 dollar as a substitute for value that is not available sounds far more reasonable than filling it with 1.6 million so you can see that whenever you have outliers the values are not in a kind of same case range you should use median instead of mean to fill the any values that’s all I had for this tutorial the most important part of this tutorial is an exercise I have this exercise for you the link for this will be found in a video description below what you have to do is you need to use this Airbnb New York data set which is available on Kegel so download this csv file by clicking on this button and use percentile to remove the outliers that data set has some outliers you want to remove them using appropriate percentile both on the lower and right end.