Understanding ‘Maximum Likelihood Estimation’ the intuitive way

I will  try to decode all the terms i.e  –

‘Maximum’  ,  ‘Likelihood’  , ‘Estimation’

To begin with i’ll take a toy example as follows :

Lets take a scenario 1 :

I have two groups of randomly generated data :

1.{3,3,3,3,3,5,5,5,5,5,5}

2. {3,3,5,6,7,2,1,4,6,9,11}

and  I know for sure  that these are the only distributions I am  concerned with and  I need help with finding the parent distribution for the below randomly picked subset :

Subset = {3,5}

Can you  help me in figuring out  which of those above distribution is the subset likely to be derived from?

Take some time…

Yeah,you guessed right(assuming you did!) it seems pretty natural to go with the distribution 1,Yes the intuition is right.

But Why?
When i asked my friend to answer , he said – “Well since first set has more of 3’s and 5’s ,its more likely that the ‘subset’ belonged to 1 … common sense bro !”. I asked him again if he can show me some mathematical proofs since I am more of a maths oriented person, he further added –

“Well a simple probability concept would tell us the reason behind this likelihood –  probability of  choosing 3 is 5/11 and probability of 5 is 6/11 so the joint probability of getting our  ‘subset’ belonging to distribution ‘1’ is 30/121 and for the second it’s 2/121 .So , I went with the first option……Are you happy now?”

I was delighted , because that’s the  maths i was looking for to explain my intuition..,and what he did was exactly a simple  LIKELIHOOD ESTIMATION.

Cool..seems pretty easy , but you know what they say right ? Nothing is that easy…lets take a look at other case –

Scenario 2:

Now i just give you the subset = {

-6.70e-1  -9.60e-1
 1.20e+0   5.60e-1
-5.80e-1   1.50e+0
-8.10e-1  -1.00e+0
-7.40e-2  -3.10e-1

}

and  I ask you to estimate the source of this data , in other words which probability distribution this data came from.

What would you do?

Confused? Well you should be , I didn’t give you options 1 and 2 unlike the previous example..so how to count and calculate the probabilities , how do i find the p(x) so that i can get joint distribution of all X’s?

Well , where you stand now this is what most statisticians face when they first get their hands on data  and want to find the distribution the subset might have come from!Here is what they do and probably what you should do too..

“Since we have no choice but to estimate the parent distribution and you have to approximate it no matter what , to make your life easy…..assume that your subset  was picked from some random Gaussian Distribution (Why do people love Gaussian?) and proceed the same way , the only difference being that unlike two choices previously now you have infinite choices of σ and μ and you have to find the parameters for which probability of X’s is maximized!This is where the idea of MLE fits in..let’s see how.

I ll tell you the generic way to do this :

Lets say we have data-points, lots of them and also all the X’s are independent and identically distributed (meaning they follow same probability distribution and selection of a data point doesn’t affect the selection of other data point) , and we know they come from some probability distribution function f(θ) ( θ is a collective name for all parameters for eg : σ and μ in Gaussian distribution , and ‘p’ in Bernoulli Distribution…  you call all of them as ‘θ’), and we also assume that this f(θ) belongs to a certain family of Probability Distribution (eg : Gaussian Family of σ’s and μ’s , I already explained why this is necessary)

Now ,we want to calculate the ‘θ’ i.e the estimator of the parent distribution from where our data-points were sampled from ,and we do it the same way as our toy example –

Calculate a joint distribution(like my friend answered!) ,

f(x_{1},x_{2},\ldots ,x_{n}\mid \theta )=f(x_{1}\mid \theta )\times f(x_{2}|\theta )\times \cdots \times f(x_{n}\mid \theta ).(this ‘f(θ)’ can be Gaussain Distribution or whatever family you choose to work with..)

now let L be the likelihood , what we do in below equation is just change the subject to θ , so –

{\mathcal {L}}(\theta \,;\,x_{1},\ldots ,x_{n})=f(x_{1},x_{2},\ldots ,x_{n}\mid \theta )=\prod _{i=1}^{n}f(x_{i}\mid \theta ).

Seeing a product here which is very cumbersome to work with (eg: if you want to differentiate or whatever..)what the clever mathematician did was to take log since it has a very beautiful property of converting products to sums and retains the monotonic property (so that the increasing/decreasing nature of original function is still preserved..personally LOG function deserves a lot of credit for this ..:) ).So , lets do the same –

\ln {\mathcal {L}}(\theta \,;\,x_{1},\ldots ,x_{n})=\sum _{i=1}^{n}\ln f(x_{i}\mid \theta ),

and if we take average

{\hat {\ell }}={\frac {1}{n}}\ln {\mathcal {L}}.

This is the quantity that is being maximized(draw parallels to toy example to understand it better..) , and the ‘θ’ for which this quantity is maximum is the nearest estimate of the parent distribution . Formally we denote it as ‘Theta cap’

\{{\hat {\theta }}_{\mathrm {mle} }\}\subseteq \{{\underset {\theta \in \Theta }{\operatorname {arg\,max} }}\ {\hat {\ell }}(\theta \,;\,x_{1},\ldots ,x_{n})\},

‘argmax’ denotes θ for which  the joint distribution probability  of our data is maximum.

So we have figured out  all the pieces now :

‘Maximum’ – we have maximized the joint distribution probability estimate of our data points X’s.

‘Likelihood’ – the calculated probability estimate

‘Estimator’ – The ‘θ’ which maximized the estimate.

I have tried to keep this blog less math intensive as possible to avoid confusion for beginners however if you want to know and understand it more deeply supplemented with mathematical proofs , I would recommed you to  go through below mentioned lecture and the Wikipedia explanation wherein its described in detail –

  1. Nando de Freitas Lecture – MLE
  2. MLE Wikipedia

I hope the above post was helpful!Would love to hear your feedback!

Cheers!