Introduction to Likelihood Theory/The Basic Definitions

Formal Probability Review

edit

Let   be a set contained in  , and   is the counting measure if   is discrete, Lebesgue measure if   is continuous and Steltjes measure otherwise (if you don't know what a measure function in a   is, lookup in w:measure (mathematics) or just consider that if   is continuous the integrals below are the usual integrals from calculus, and the integrals resume to summation over   for discrete sets).

Definition.: A function   is a probability density function (abbreviated pdf) if and only if
 
and
 .
We say that a variable   has pdf f if the probability of   being in any set   is given by the expression
 
(if you don't know measure theory, consider that   is an interval on the real line).

Exercise 1.1 - Show that   is a pdf with  ,   and  .
Exercise 1.2 - Show that we can build a distribution function using the function  , if  ,   otherwise (  is any real number,   is defined in the previous exercise) by multiplying it with an appropriate constant. Find the constant. Generalize it for any pdf defined on the real line.
Exercise 1.3 - If   has distribuition   with   and  , what is the distribution of the function  ?(Calculate it, don't look it up on probability books). In statistics, the term probability density function is often abbreviated to density.

Definition.: Let   be a random variable with density  . The Cumulative Distribution Function (cdf) of   is the function defined as

 

This function is often called distribution function or simply distribution. Since the distribution determines uniquely the density, the terms distribution and density are used by statisticians as synonymous (provided no ambiguity arises from the context).
Exercise 1.4 - Prove that every cdf is nondecreasing.

Definition.: Let   be a random variable. We call the expectation of the function   the value

 

where   is the density of  . The expectation og the identity function is called expectation of  .
Exercise 1.5 - Compute the expectation of the random variables defined in Exercise 1.1.
Exercise 1.6 - Show that   for any constant  .
Exercise 1.7 - Show that   for any constant  .

In The Beginning There Were Chaos, Empirical Densities and Samples

edit

A population is a collection of objects (collection, not a proper set or class in a Logicist point of view) where each object has an array of measurable variables. Examples include the set of all people on earth together with their heights and weights and the set of all fish in a lake together with artificial marks on them, where this latter case is found in capture-recapture studies (I suggest you look into Wikipedia and find out what is a capture-recapture study). Let   be an element of a population and   be the array of measurable variables mensured in the object   (for an example,   is a man and   is his height and weight measured at some arbitrary instant, or   is a fish and   is   if he has a man-made mark on it and   otherwise). A sample of a population   is a collection   (again, not a set) where   such that  .

There are two main methods for generating samples: Sampling with replacement and Sampling without replacement. In the former, you randomly select a element   of  , and call the set   your first subsample. Define your (n+1)-th subsample as the set  , where   is a function returning a randomly chosen element of  . Any subsample you pick generated using the definitions above will be called a sample without replacement and is the more intuitive kind of sample, but also one of the most complicated to obtain in a real world situation. In the former, we have   and   defined in the same way above, but in this case we have  . Samples with replacement have the exquisite property that they have different objects with same characteristics.

TO DO: Some stuff on empirical densities and example of real-world sampling techniques.

Likelihoods, Finally

edit

Given a random vector   with density  , where   is a vector of parameters, and an observation   of  , we define the likelihood function associated with   as

 

This is a function of  , but not of  , of an observation   or any other related quantity, for   is the restriction of the function  , which is a function of  , to a subspace where the   are fixed.

In many applications we have that, for all  ,   and   are independent. Suppose that we draw a student from a closed classroom at random, record his height  , and put him back. If we repeat the proccess   times, the set of heights measured forms an observed vector  , and our   variable is the distribution of the height of the students in that classroom. Then we have our independence supposition fullfilled, as it will be for any sampling scheme with replacement. In the case where the supposition is true, the above definition of likelihood finction is equivalent to

 

where   is the probability density function of the variable  .
Exercise 3.1: Let   have a Gaussian density with zero mean and unit variance for all  . Compute the likelihood function of   and   for an arbitrary sample.

Intuitive Meaning?

edit

This function we call likelihood is not directly related to the probability of events involving   or any proper subset of it, despite its name, but it has a non-obvious relation to the probability of the sample as a whole being selected in the space of all the possible samples. This can be seen if we use discrete densities (or probability generating functions). Supose that each   has a binomial distribution with   tries and succes probability  , and they are independent. So the likelihood function associated with a sample   is

 

where each   is in  , and   means  . This function is the probability of this particular sample appear considering all the possible samples of the same size, but this trail of thought only works in discrete cases with finite sample space.
Exercise 4.1: In the Binomial case, does   has any probabilistic meaning? If the observed values are throws of regular fair coins, what can you expect of the function  ?

But the likelihood has a comparative meaning. Supose that we are given two observations of  , namely   and  . Then each observation defines a likelihood function, and for each fixed  , we may compare their likelihoods   and   to argue that the one with bigger value occurs more likely. This argument equivalent to Fisher's rant against Inverse Probabilities.

Bayesian Generalization

edit

Even if most classical statisticians (also called "frequentists") complain, we must talk about this generalization of the likelihood function concept. Given that the vector   has a density conditional on   called   and that we have a observation   of   (I said  , forget about observations of   in this section!), we will play a little with the function

 

Before anything, Exercise 5.1: Find two tractable discrete densities with known conditional density and compute their likelihood function. Relate   to  .


On to Maximum Likelihood Estimation

Thank you for reading

edit

Some comments are needed. The "?" mark in the previous section title is proposital, to show how this might be confusing. It needs more exercises and examples from outside formal probability. The way this thing is right now needs a good background in formal probability (high level) and much more experience with sampling.