You know I have 10 000 such data points where I have different people’s height and I have loaded that into my Jupiter notebook I have my data frame ready and if I do dot height so my height is a column and if I do describe it tells me I have total 10 000 rows you know my standard deviation for this height column is 3.84 I have my min max and so on now we already saw that we can use plus and minus 3 standard deviation to figure out these outliers now before we do that I would like to use c bond library to plot the bell curve and the histogram and the way you do that is you call hits plot function in that you supply your height column and you will say ked is equal to true which means it will plot this kind of curve as well if kid is false it just plots the histogram you know so now you can clearly see this is a bell curve it’s a normal distribution many times as a data scientist.
When you’re doing data exploration you want to plot this kind of histogram to figure out whether the distribution is normal or not and based on that you can make certain decision eighty percent of time spent by data scientist is in data cleaning process because when the data comes you know it is often messy it has errors and it has legitimate outliers so you want to remove those outliers before building your model and that’s exactly what we’re going to do here so to remove the outlier first you need to figure out a mean so let’s say mean for my height is 66 and these heights are in inches so 66.36 inch is my mean and my standard deviation is 3.84 now we already saw in the diagram here where was my diagram that.
I can use plus and minus 3 standard deviation to remove the outliers so let’s see if I do mean minus 3 standard deviation I will get this number and if I do mean plus 3 standard deviation I get this number so what I’m saying is any number between 54.82 277.91 is a valid number anything outside that is an outlier okay so in panda’s data frame now I can do something like okay if my do height is less than this number then that’s an outlier so I get two data points like that you know I will also say that okay it is either this condition or 77.91 right so 77.91 so I get five such data points see with who has height of 78.09 and so on.
I can combine this into one condition I will say this or that you know like do dot height if it is less than this or greater than this then that’s my outlier so I found total 7 outlier out of 10 000 data points and to remove this outlier I can create a new data frame called do no outlier and I can just you know apply a reverse condition here to just apply a reverse condition of this and what is the reverse condition well the reverse condition is this so what I’m saying is my regular my clean data set is something for which the height is greater than 54.82 and less than 77.91 and when I do the shape obviously see I find 9993 because seven data points are outlier so hurray as a data scientist you just did data cleaning you remove the outliers and now using this particular data frame when you build your machine learning model it’s going to be much better let’s now talk about z-score well I will tell you already know z-score if you know standard deviation you already know z-score it is the same concept with a little tweak okay what is that tweak here again.
I have my bell curve in the middle I have mean on the right hand side I have plus one standard deviation plus 2 standard deviation sigma is 4 is the symbol used for standard deviation if I have a data point here at 2 standard deviation then the z-score for that data point is 2. if I have a data point which is in the middle of 2 and 3 standard deviation let’s set 2.5 then the z-score for that data point is 2.5 similarly 5 data point here between minus 1 minus 2 then the z-score of that data point is minus 1.5 so z score is nothing but how many standard deviation away your data point is from mean so you understand right like the z-score is for every single data point so you can compute z-score for every single data point and that’s what we’ll do for this particular data set see here for this height column I took an average which is 5.25 again I took standard deviation for this.
I found it to be 0.61 now from every single data point I can subtract average and then divide it by standard deviation I get my z-score so the formula for g-score is every single data point minus average divided by the standard deviation okay so it’s pretty simple concept if you know standard deviation you already know z-score it’s the same concept there is no rocket science here now let’s apply z-score in our notebook and do the outlier removal here again I have my data frame with all the data points and I want to calculate z-score for every single data point and for that obviously I need to create a new column when you do this it will create a new column called z-score and what is that column well that column is df.height so you take your individual data point you subtract mean from it so you say df.height dot mean you divide that by the standard deviation so this is how you get standard deviation and then when you look at your data frame.
I have already added new column called z-score and these are my individual z-score now I want to verify how I came up with 1.9 for whatever number well it’s pretty simple see my mean is this okay my standard deviation is this for that first data point and now what I’m going to do is I will use the formula okay what is my formula let’s look at my formula okay x which is a data point minus mean divided by standard deviation so my data point is this minus mean okay mean is this and what is my standard deviation standard deviation is this and when I do that I get 1.94 see 1.94 so it’s fairly straightforward concept now once you have z-score column it becomes even more easier to remove the outlier so I will first look at all those data points whose z-score is greater than three see.
I found five such data points whose score is greater than three and less than three is two data points right so if you want to see let’s say both of these in one shot then I can say okay less than minus three or greater than three and I get same my seven data points and you can use the same technique you know I can say cuff no outlier is equal to the reverse condition you know the inverse condition so I will do this I will replace r with n and this will be like okay my z score has to be greater than minus 3 and less than 3 and that’s my no outlier and you find again 999 9993 so I removed my seven outliers I got my cleaned up data frame on which I can perform my further data analysis and even I can build machine learning model on top of it now comes the most important part of this tutorial which is an exercise.
I have given a link of this exercise page in the video description below you can read the description and work on this exercise friends working on these exercises is very important so I want you to develop the solution yourself and then you can click on this solution link to verify your answer with my answer I hope this video gave you some understanding of z-score and normal distribution outlier removal was just one use case there are many other use cases as well as we progress forward in this particular tutorial series and by the way this the link of this playlist is given in the video description below so you please check other videos as well we are learning mathematics and statistics for data science and machine learning in a very simple language and by doing a lot of practice as well so make sure you check other videos as well and if you have a friend who thinks that math and statistics is hard well try to share these this playlist with them so that they can remove that bias these things are not that hard actually you just need to have some discipline learn the concepts in a clear simple way and then practice on python.