Using a very simple example we will discuss what is mean median we will also see how these concepts are used in the field of data science and machine learning and then we’ll write python code to practice some of these concepts and most important in the end we have an exercise for you to practice upon so let’s begin let’s say you want to open a luxurious car showroom in Monroe township, New Jersey before opening the store you will go through some data analysis process for example you will examine the income level of people in that area because if the income levels are high then people will buy BMWs if the income level is if people are living on a government aid they’re not probably going to buy BMW so it doesn’t make sense to open a showroom so you will get the statistics or the data on the people’s income and one measure you can use is the average income to figure out if you want to open the store.
Here in this case my average income for all these people is 6250 dollar per month which is not that high so I’ll probably not open that luxurious car showroom but I got lucky here what if this data set has some outlier. Outlier means some unusual values. Let’s say Mr. Elon Musk is living in my town he obviously has a very high income and if I use average now the average would be 1.43 million and if I use this average number to make that decision then it will be a bad decision because obviously people are not earning 1.43 million on average we got this number just because of some outliers so in this case using average is not a good idea so then what we should we should do here well let’s do this let’s short the values in ascending order and here are my values what if instead of using average I use the middle value so see in the data set my middle value is this 7000 so instead of average.
I’m using middle value to make that decision and this middle value is nothing but a median what if my data set has even number of data points see here I have seven total 7 so this was an easy thing it was a middle number but I have 7 data points what is my middle number it is this or that well the problem is easy to solve you take you draw a line here you take both the numbers and take an average so in this case the median is 7500 so overall if your number of data points are even then you take two middle values take the average that’s your median if your data points are R then you take the middle value and that’s your median so this is the first use case of using median in a simple descriptive statistics the other use case is handling missing values let’s say you are building a machine learning model.
which can predict if person’s loan should be approved or not. Here my features are credit score and monthly income and some of my data points have missing values for example I don’t know what Sophia’s monthly income is. In this case in a data science we try to estimate the monthly income. So here again you can use average maybe to estimate her monthly income what will be the average by the way 1.6 million wow! So I took all these values but because I have my musk living in my town he’s my neighbor and because he has 10 million dollar income it is skewing up our average number so to approximate the Sofia might be earning 1.6 million dollar a month is not a good idea here again we can use medium so you take middle two values average it out 6500 and that’s probably a better approximation okay so although 10 million is present here median allowed me to come up with a right guess so overall in the field of data science so far we looked at two cases one was a descriptive analysis for opening car showroom the second one was a loan approval model for data cleaning process there are many other use cases but I think these two use cases will give you a solid understanding of how median is used in a real life.
Now let’s look at the same income example once again and here obviously 10 million is an outlier and the definition of outlier is basically a data point that is very different from rest of the data points so now let’s look at how can you remove this outlier in order to understand that we need to understand the concept of a percentile so here again I have sorted the values in ascending order you take the middle value okay and that middle value uh what it shows is fifty percent of the data points are on the left side fifty percent of the data point are on the right hand side okay so this is your median basically and this median is nothing but a 50th percentile 50th percentile means okay what does it mean 15 percentile here is 7000 and it means that 50 of my data points have value less than seven thousand I have total seven data points.
The three data points are around fifty percent right so around though so fifty percent of data points have value less than seven thousand hence seven thousand is my fifty percentile okay now think about it what is my hundred percentile you need to think this way pause the video and think which is the value for which a hundred percent of data points are less than or equal to that value well obviously that value is ten million hundred percent of data points are less than or equal to that value which is 10 million so in this case 100 percentile is 10 million what is 0th percentile obviously 4 000. okay what is 25th percentile again I want you to pause this video right now think about it I’m pretty sure you will be able to come up with the right answer all right so total values are 7 25 percent of seven values are 1.75 which is approximately two which means I want you to draw a line after two data points okay one and two let’s draw a line here is the line.
Okay so this line this particular point is my 25th percentile this point is 5500 that is my 25th percentile now just to clarify there are different ways of calculating percentile I just showed you one but another approach is to take this 5000 data point third approach is 6000 you know so percentile could be 5500 5000 or 6000 basically either 5000 which is a lower value 6000 which is a higher value or a value in between and I will show you a page which will exactly tell you the formula for this okay what is 75th percentile again pause the video and think you should be able to come up with the answer total value seven seventy five percent of seven is five so draw a line after five data points okay one two three four five here is my line okay what is this data point 7750 so that’s my 75th percentile if you have 25th percentile and 75th percentile then the range, you know the range is called interquartile range.