In a very simple language I will explain you what is log or logarithmic function and what is the significance of it in data science and machine learning. We’ll also look at some Python code let’s get started let’s say I have five dollar and I am putting that in a bank which gives me five X return so if I put five it will give me five x if I put seven dollar it will give me seven x return it’s a magic bank you know very customer friendly bank after one year I will have 25 after two year I will have 125 because every year it is being multiplied by 5x to represent.
This concept in mathematics you use something called exponent. So exponent meaning 5 raised to 2 is 25 5 is to 3 is 125. Now let me ask a different question or rather a reverse question where I have 125 dollar today in my bank. I know that at some point I started with five dollar base investment but I don’t know how many years it took to get to 125 so how do you find that so the question is again simple I started with base investment of five dollar. I know that bank is giving me 5x return but I don’t know how many years will it take to make that money 125 dollar. Well that answer can be given by log function so log to the base 5 because.
I started with base investment of 5 125 is the amount I have today and that can tell me that it takes three years so you see log is basically a reverse or an inverse of an exponent function it tells you 125 dollar is a given amount how many years will it take for your base amount which is 5 dollar to become 125 and the answer is 3. The popular log is log to the base 10 and log 10 to the base 10 is 1. In general log x to the base x is 1. You have to remember this formula if you have log 100 to the base 10 that can be written as this and then 2 comes in the front and you already know log 10 to the base 10 is 1 so this becomes 2. Similarly log 1000 to the base 10 is 3. Now let’s see how long can be used in the data analysis process so here I have downloaded a company’s revenue into my pandas data frame. I have six companies and these are the revenues.
For example amazon has 386 billion dollar a year. Uber has 11 and so on now if I want to compare revenues using a bar chart you know which is what people normally do they use bar chart or different kind of charting visualization for doing data analysis. So here I am using you know by a bar chart for comparing the revenues and what I will notice is since Amazon’s bar is so high it is almost flattening the smaller bars so here I’m having hard time comparing Jindal steel and Axis bank because the bars are almost looking same see you see only one big bar everything else is kind of similar log axis solve this comparison problem so if you want to do comparison of all the smaller players in a better way you can run the same function do dot plot same exactly same function but you supply additional argument called log y is equal to true and you get this as a result now you see the bars are little comparable.
Here I can at least say Jindal stills revenues are less than Axis bank access bank is close to five you see this is 10 this is hundred so here Axis Bank is close to 5. Jindal is little less than 5. Vedanta is more than ten so you see Vedanta is what 11. X is 5.6 4.7 and so on so using log axis you can do better comparison when you have some values which are very high and other values are in average scale so you will see people using log Axis on occasions like this a lot in the Jupiter notebook when you’re performing your data analysis another use case of log is using log transform in machine learning.
Here is a classical problem of predicting if a loan should be approved for a given person for a person there are different features such as credit score income and age and based on these three columns you are deciding if loan will be approved or not and this is a standard supervised learning problem. Now if you notice income column here you will see that Puja’s income is very high 550 000 a year versus all other players are 32000, 77000 a year so you have this one data sample which is very high income you might have couple of such samples and when you train a machine learning models what happens is because of the magnitude of this data point it will negatively influence your model. So basically your machine learning model will be biased you know so to solve this problem you can create a new column called log income.
So you see the last column here and that will be a simple log of this income so log to the base 10 of this column gives you this. Now here you see puja’s income is 5.7 and rest of the people are 4.8, 4.5 so log will bring all these values on more comparable scale so you can compare these values in a better way and when you train machine learning model using these comparable values the model will not be biased and it will give you more accurate results. earthquake measuring earthquake is another classical example of a logarithm function here when you are comparing let’s say earthquake 5 versus earthquake 4 you can tell one thing immediately which is earthquake of scale 5 is 10 times more powerful than scale 4 so if you’re living in California if you’re going through earthquake and you know if it is 4 versus 5 versus 6 6 is very high so 5 is 10 times more powerful than four 6 is 10 times more powerful than five okay so.
I have been to an earthquake situation in India in the state of Gujarat where the scale was like somewhat around seven and it was devastating and I have heard about earthquakes with scale you know five and six and I always wonder like why you know our earthquake was seven and others are like 4 and 5. So why that earthquake with seven scale was more devastating and the reason was it was on a logarithmic scale. So I hope this gives you some understanding of log and the use of that and data analysis and machine learning there are many more use cases as well such as log is used in loss function as well but this gives you some base understanding.