In this article we will try to understand what is normal distribution or Gaussian distribution these two are terms for the same thing we’ll also understand what is z-score and we’ll do some python coding to understand how exactly these concepts are used in the field of data science and machine learning so let’s get started let’s say you want to do data analysis on people’s height data set you would be wondering what kind of analysis you can do say you are a data scientist working for a clothing store and to produce the clothes of certain size the clothing company has assigned you this task as a data scientist you want to do some analysis on people’s heights now when you plot these heights on a histogram.
It will look like this what is histogram is a frequency distribution for example you have three data samples between height five and 5.5 so what are those samples well 5.1 5.2 5.4 so you are just plotting the height and the counts of those samples on this simple chart which is known as a histogram and when you draw a curve that passes through this histogram this curve looks like a bell you know you might have seen a bell in a church or a temple and that’s why this curve is known as a bell curve so this is what normal distribution is in normal distribution you will have most of your data samples around average value and then you will have some data samples which are far away from average on left and right hand side for example you have some people whose heights are six or maybe around seven feet you know .
You’ll have very like small percentage of people whose height will be around seven similarly you will have small percentage of people whose height will be let’s say less than four feet so the idea is in nature we find many data sets who which follow the normal distribution for example let’s say you are looking at the prices two bedroom prices of apartment in Bangalore city most of the apartments on average they cost around 90 lakh rupees so you see let’s say I took samples of few thousand property prices and you will see like for 90 lakh rupees I have around 280 data samples similarly you will have very few data samples whose high whose price will be on a higher end okay and very few data samples whose height will be on a lower end so most of the values will be centered around average and then you as you go far away from the average the number of data samples reduces you see the similar behavior when you are examining test scores for a given classroom let’s say you took mathematics test you know and the test is giving you score out of 100 let’s say your maximum score is 85 you will find very few people.
Who will have high marks and very few people who will have very low marks most of the people will have marks in an average range another example is employee performance when they validate employees performance majority of the employees will have performance in the average category you will have few best performer and few low performers so we naturally see normal distribution in many data sets and therefore for data scientist and machine learning engineers knowing normal distribution is very important now you’ll ask me okay I understood normal distribution how can I use this in my real life how can I use this in data analysis the classical use case is during data cleaning process you can use normal distribution and standard deviation for outlier removal unless I use the same data set but.
I have added a new data point Smith whose height is nine feet now outlier is a data point which has a value that is very different from your average values here you know people’s height are usually around six feet five feet but you don’t see a person with height nine feet okay so these kind of outliers can occur because of an error in data collection process or you know in data transmission process or sometimes you might have valid data points like you can have a person whose height I think the person whose height was maximum in the history he had a height of 8.7 feet or something so you can have a valid data point as well but when you plot them on a histogram you can clearly see those data points kind of standout they are very far away from your normal regular data points during data science process you want to treat the outlier and by treating the outlier.
I mean you either want to remove them or you want to apply some other methods to treat them if you don’t treat outliers it will create problems in your data science process in your machine learning process you know your machine learning model might get skewed because of the presence of these outliers so it’s important that you treat the outliers either by removing them or applying some kind of transformation now here I have a very simple data set when I have a data set which has less a million data points just by looking at the data points I won’t know which are outliers so you need some kind of formula or you need to apply some math to figure out those outliers so what formula that is well you need to first understand what is standard deviation I made a very short video on that so please go to youtube search for code basic standard deviation watch that eight minute video once you have understanding of standard deviation now.
I will explain uh how statisticians use standard deviation to remove outliers so here I have same histogram uh the bell curve again and in the middle you will have a mean point your average point on the right hand side you have seen this sigma symbol is for standard deviation so you have plus one standard deviation plus two standard deviation plus three minus one standard deviation and so on by conducting so many tests on normal normally distributed data set mathematician and statistician found that 68.3 percent data points in any normal distribution comes in plus and minus 1 standard distribution standard deviation similarly 95.5 percent data points out of all total data points fall under plus 2 minus 2 standard deviation range similarly 99.7 percent fall under plus minus 3 standard deviation range now you can use this knowledge to remove outliers general guideline is if.
The data point is beyond three standard deviations so any data point that is greater than plus three standard deviation or minus three standard deviation can be treated as an outlier this is general guideline okay there is no fixed rule sometimes when data points are small I have seen people using two standard deviation as well so it’s an it’s a matter of you know using sense of judgment as a data scientist to figure out the correct formula but what we’ll do is we’ll now do some python coding and we’ll use standard deviation to remove the outliers from our data set I went to this particular Kegel data set I have downloaded this csv file the file has height and weight I have removed weight from that file so now in this file I have people’s height.