We need to understand normal distribution before we move on to log normal distribution. here I have people’s highest database and if you plot that on a histogram it looks like a bell curve so if you are aware about normal distribution you know that this shape is called bell curve and this is normal distribution there are many examples of normal distribution in real life such as the test score, employee performance and so on but let’s think about people’s income database here most of the people have income around 50 000 here in US but on the higher end you could have people like Jeff Bezos, Elon Musk you know they could be earning a lot more than the regular population.
So this curve is right skewed actually it looks very different than normal distribution because on the right hand side this tail it kind of never ends you know because people might earn 1 billion 2 billion it could be like the income could be really high whereas if you’re thinking about let’s say employees performance. it could be just in a limited range test score cannot never be more than 100 that’s why these distribution form a bell curve whereas this other distribution is on a right skewed and the chart this tail can really get very long but if you apply log function to x axis so I will apply again log function to x axis then it becomes a normal distribution you see I’m adding zero between these two numbers the this is multiplied by 10 this is multiplied by 10 this is the fundamental idea behind log.
So when I do again log when I apply log function to this axis this x axis the distribution becomes normal okay so here is what i did i had this distribution i applied log of income it become a bell curve and when that happens this original distribution is called log normal distribution so again if you get a normal distribution by applying a log function to a data set then the data set is say to have a log normal distribution all right there are other examples of log normal distribution such as hospitalization days most of the people spend 5 10 days 15 days I hope you’re you don’t have to spend any days in the hospital but there are unfortunately some critically ill patients they spend 300 days 400 days.
My wife works in a hospital and she says there are people who spends many days in the hospitals so this is also a log normally distributed graph advertising budget small or mid-tier companies will not have much budget advertising budget but the big companies the you know companies who have higher revenues a lot of consumers they might have a huge budget you know 500 million 1 billion 10 billion so in this case also when you’re doing some budget analysis you will come across this type of log normal distribution how log normal distribution is used in data science well we have seen this example before but let’s say you are trying to build a machine learning model which can predict if you want to give a loan to a person or not you are doing some credit risk analysis you want to figure out if you want to approve a loan for a given person or not.
Here you can see that lady puja has a lot of income she seems to be a rich lady and this value is quite different than other values so if you’re using income as your independent variable in your building machine learning model the model might not get a higher accuracy because the general principle of machine learning model is that the numbers if they are on a similar scale then the model will perform better so you can apply a log transform on this income column comes with the new column called log income where by applying log function you will get the values in a similar lane range you see 4.9 4.8 and now puja although she has a high income after applying log you have 5.7 which is kind of in a similar range as other numbers so just to summarize log transform is a popular technique where if you are having log normal distribution you apply log transform and use that particular as a feature in building your machine learning model we’re going to show the log normal distribution using c bond library in python here.
I have U.S income data set which I got from census.gov website so these are the range people’s income and let’s say between twenty thousand and twenty five thousand dollar there are six thousand people you know and I took I used this data but I came up with a simplified version of this file where I have only two columns one is income and the count so people who have income up to five thousand dollars is this between 5000 to 10 000 is this and so on and I loaded that here into my pandas data frame see I loaded that and my data frame looks like this and I use c bond library to plot a bar plot and you can see my bar plot looks normal log normally distributed by the way I skipped all.
The all the data which is having more than two hundred thousand dollar income if I include all of that you’ll see a long tail you know very right skewed graph here but this also looks like normally distributed when you apply log to the x scale you see it becomes more like normally distributed just ignore the bar width here they are not uniform but overall if you see the c the chart looks more like a bell curve you see that all right I guess you have a pretty good understanding of log normal distribution the link of this code is given in the video description below if you like this video please share it with your friends it’s a simple concept but we see log normal distribution in our day to day life and while solving data science problem you will come across this and if it’s creating problem in your machine learning model accuracy don’t forget to apply log transform thank you.