Linear regression is by far the most common algorithm we learn in data science. Every practitioner has heard of it and used it. However, for some problems, it is not suitable and we need to ‘generalise’ it. This is where generalized linear models (GLMs) come in and provide greater flexibility to your regression modelling and are an invaluable tool for data scientists to know about.
As we said above, GLMs ‘generalise’ ordinary linear regression, but what do we really mean by that?
Let’s consider the simpler linear regression model:
Where β are the coefficients, x is the explanatory variable and ε are the normally distributed errors.
Let’s say we want to model how many claims calls an insurance company gets in an hour. Would linear regression be a suitable model for this problem?
The reasons are:
- Linear regression assumes normally distributed errors, and the normal distribution can take on negative values. However, we can’t get negative claim calls.
- The second point is that the normal distribution, hence linear regression, is continuous. Whereas claims calls are all integer and discrete, we can’t get 1.1 calls.
Therefore, the linear regression model can’t correctly handle this exact problem. However, we can generalise the regression model to a probability distribution that meet the requirements specified above. In this case, it would be the Poisson distribution (more on this later).
GLMs then simply provide a framework of how we can link our inputs to the desired outputs of the target distribution. They help unify many regression models together under one ‘mathematical umbrella.’
The basis of GLMs relies on three key things:
We will now run through what each of these things means.
This is the simplest one to understand. A Linear predictor, η, just means we have a linear sum of the inputs (explanatory variables/covariates), x, multiplied by their corresponding coefficients, β:
The link function, g, is literally responsible for ‘linking’ the linear predictor to the mean response of our target distribution, μ:
A requirement for GLMs is that the target distribution of the output needs to be part of the exponential family. This family of distributions contains many famous distributions that you probably heard of such as Poisson, Binomial, Gamma, and Exponential.
In the GLM framework we actually use the exponential dispersion model, which is a further generalisation of the exponential family.
This form is chosen for statistical convenience, but we don’t need to worry too much about why this is the case in this article.
Notice there are two parameters θ, which is the natural or canonical parameter that relates the inputs to the outputs, and ϕ which is the dispersion parameter.
Another cool fact is that the distributions in the exponential family all have conjugate priors. This makes them useful for Bayesian problems. If you want to learn more about conjugate priors, checkout my article on it here:
Canonical link function
There is something called the canonical link function, which is given by:
So, if we can describe θ in terms of μ, then we have derived the natural link function for our target distribution!
Mean and Variance
It can be mathematically shown that the mean, E(Y), of the exponential family, is given by the following:
Likewise, the variance, Var(Y), is given by:
If you want to see the proof for this derivation, refer to page 29 in the following linked book. In general, the solution to this is taking the derivative of the log-likelihood function with respect to θ.
The Poisson distribution is a famous discrete probability distribution that models the probability of an event happening a specific number of times with a known mean rate of occurrence. Checkout my previous post if you want to learn more about it here:
Its PMF is given by:
- e: Euler’s number (~ 2.73)
- x: Number of occurrences (≥ 0)
- λ: Expected number of occurrences (≥ 0), this is also the mean in the GLM notation μ
In Exponential Form
We can write the above Poisson PMF in exponential form by taking the natural log of both sides:
Then, we raise both sides with respect to Euler’s number:
And voila, the Poisson PMF is now in exponential form!
By matching the coefficients with the above equation and the exponential family PDF, we find the following:
Therefore, the mean and variance of the Poisson distribution is:
This is a known result for the Poisson distribution and we have just derived a different way!
The canonical link function for the Poisson distribution is then given by:
Therefore, the Poisson regression equation is:
We can verify the output of this equation can only be positive, therefore would satisfy the requirement for predicting the number of claim calls an insurance company receives problem.
You may be wondering why I have just taken you through all this arduous maths. Well let me quickly summarise the key take-home messages:
- It is paramount to check the requirements of your problem and your target distribution to avoid non-sensical results.
- GLMs provide a mathematical first-principles approach to how you can link your input to your desired output for that specific problem.
The standard linear regression model is powerful but is not suitable for all types of problems such as where the output is non-negative. For these specific problems, we must use other distributions, like the Poisson, and GLMs provide a framework for how we can carry out this process. They do this by deducing a link function from first principles which enables you to transform your input to your desired target output distribution. GLMs are a powerful modelling tool that most data scientists should at least be aware of due to their versatility.