If y can be estimated as a linear function of x does not imply that x can also be estimated as a linear function of y
Consider two real-valued variables x and y, for example, the height of a father and the height of his son. The central problem of regression analyses in statistics is to guess y by knowing x, e.g., to guess the height of the son based on the height of his father¹.
The idea in linear regression is to use a linear function of x as a guess for y. Formally, this means to consider ŷ(x) = α₁x + α₀ as our guess and find α₀ and α₁ by minimizing the mean squared error between y and ŷ. Now, let’s assume that we use a huge dataset and find the best possible values for α₀ and α₁, so we know how to find the best estimate of y based on x. How can we use these best values for α₀ and α₁ to find a guess x̂(y) about x based on y? For example, if we always knew the best guess about the son’s height based on his father’s, then what would be our guess about the father’s height based on his son’s?
Such questions are special cases of “How can we use ŷ(x) to find x̂(y)?” Even though it may sound trivial, this question appears to be really difficult to address. In this article, I study the link between ŷ(x) and x̂(y) in both deterministic and probabilistic settings and show that our intuition for how ŷ(x) and x̂(y) relate to each other in deterministic settings cannot be generalized to probabilistic settings.
By deterministic settings, I mean situations where (i) there is no randomness and (ii) each value of x corresponds always to the same value of y. Formally, in these settings, I write y = f(x) for some function f: R → R. In such cases where x determines y with complete certainty (i.e., no randomness or noise), the best choice of ŷ(x) is f(x) itself. For example, if the height of a son is always 1.05 times his father’s height (let’s ignore the impossibility of the example for now!), then our best guess about the son’s height is to multiply the father’s height by 1.05.
If f is an invertible function, then the best choice of x̂(y) is equal to the inverse of f. In the example above, this means that the best guess about the height of a father is always the height of his son divided by 1.05. Hence, the link between ŷ(x) and x̂(y) in deterministic cases is straightforward and can be reduced to finding the function f and its inverse.
In probabilistic settings, x and y are samples of random variables X and Y. In such cases where a single value of x can correspond to several values of y, the best choice for ŷ(x) (in order to minimize the mean squared error) is the conditional expectation E[Y|X=x] — see footnote². In application-friendly words, this means that if you train a very expressive neural network to predict y given x (with a sufficiently big dataset), then your network would converge to E[Y|X=x].
Similarly, the best choice for x̂(y) is E[X|Y=y] — if you train your very expressive network to predict x given y, then it converges, in principle, to E[X|Y=y]. Hence, the question of how ŷ(x) relates to x̂(y) in probabilistic settings can be rephrased as how the conditional expectations E[Y|X=x] and E[X|Y=y] relate to each other.
The goal of this article
To simplify the problem, I focus on linear relationships, i.e., cases where ŷ(x) is linear in x. A linear deterministic relationship has a linear inverse, meaning that y = αx (for some α≠0) implies that x = βy with β = 1/α — see footnote³. The probabilistic linear relationship analogous to the deterministic relationship y = αx is
where Z is an additional random variable, often called ‘noise’ or ‘error term’, whose conditional average is assumed to be zero, i.e., E[Z|X=x] = 0 for all x; note that we do not always assume that Z is independent of X. Using Equation 1, the conditional expectation of Y given X=x is (see footnote⁴)
Equation 2 states that the conditional expectation ŷ(x) is linear in x, so it can be seen as the probabilistic twin of the linear deterministic relationship y = αx.
In the rest of this article, I would ask two questions:
- Does Equation 2 imply that x̂(y) := E[X|Y=y] = βy for some β≠0? In other words, does the linear relationship in Equation 2 have a linear inverse?
- If it is indeed the case that x̂(y) = βy, then can we write β = 1/α as in the deterministic case?
I use two counter examples and show that, as counter-intuitive as it may sound, the answer to both questions is negative!
As the first example, let me consider the most typical setup of linear regression problems, summarized in the following three assumptions (in addition to Equation 1; see Figure 1A for visualization):
- Error term Z is independent of X.
- X has a Gaussian distribution with mean zero and variance 1.
- Z has a Gaussian distribution with mean zero and variance σ².
It is straightforward to show, after a few lines of algebra, that these assumptions imply that Y has a Gaussian distribution with mean zero and variance α² + σ². Moreover, the assumptions imply that X and Y are jointly Gaussian with mean zero and covariance matrix equal to
Since we have the full joint distribution of X and Y, we can derive their conditional expectations (see footnote⁵):
Hence, given the assumptions of our first example, Equation 2 has a linear inverse of the form x̂(y) = βy, but β is not equal to its deterministic twin 1/α — unless we have σ = 0 which is equivalent to the deterministic case!
This result shows that our intuitions about deterministic linear relationships cannot be generalized to probabilistic linear relationships. To more clearly see the true insanity of what this result implies, let us first consider α = 0.5 in a deterministic setting (σ = 0; blue curves in Figure 2A and 2B):
This means that, given a value of x, the value of y is half of x, and, given a value of y, the value of x is twice y, which appears to be intuitive. Importantly, we always have x < y. Now, let us again consider α = 0.5 but this time with σ² = 3/4 (red curves in Figure 2A and 2B). This choice of noise variance implies that β = α = 0.5, resulting in
This means that, given a value of x, our estimation of y is half of x, yet, given a value of y, our estimation of x is also half of y! Strangely, we always have x̂(y) < y and ŷ(x) < x — which would be impossible if the variables were deterministic. What appears to be counter-intuitive is that Equation 1 can be rewritten as
However, this can only imply that (as opposed to Equation 2)
The twist is that, while we have E[Z|X=x]=0 by design, we cannot say anything about E[Z|Y=y] and its dependence on y! In other words, what makes x̂(y) different from y/α is that observation y has also information about error Z, e.g., if we observe a very large value of y, then it means that, with high probability, the error Z has also a large value, which should be taken into account when estimating X.
This is the simple explanation for seemingly contradictory statements like ‘tall fathers have sons who are (on average) tall but not as tall as themselves, and, at the same time, tall sons have fathers who are (on average) tall but not as tall as their sons’!
To conclude, our example 1 shows that even if the probabilistic linear relationship ŷ(x) = αx has a linear inverse of the form x̂(y) = βy, the slope β is not necessarily equal to its deterministic twin 1/α.
Having an inverse of the form x̂(y) = βy is only possible if E[Z|Y=y] in Equation 4 is also a linear function of y. In the second example, I make a small modification to example 1 in order to break this condition!
In particular, I assume that the variance of the error term Z depends on the random variable X — as opposed to assumption 1 in example 1. Formally, I assume (in addition to Equation 1; see Figure 1B for visualization):
- X has a Gaussian distribution with mean zero and variance 1 (same as assumption 2 in example 1).
- Given X=x, the error Z has a Gaussian distribution with mean zero and variance σ² = 0.01 + 1/(1 + 2x²).
These assumptions effectively mean that, given X=x, the random variable Y has a Gaussian distribution with mean αx and variance 0.01 + 1/(1 + 2x²) (see Figure 1B). As opposed to example 1 where the joint distribution of X and Y was a Gaussian distribution, the joint distribution of X and Y in example 2 does not have an elegant form (see Figure 1C). However, we can still use the Bayes rule and find the relatively ugly conditional density of X=x given Y=y (see Figure 3 for some examples evaluated numerically):
where curly N denotes the probability density of the Gaussian distribution.
We can then use numerical methods and evaluate the conditional expectation
for a given y and α. Figure 2C shows x̂(y) as a function of y for α = 0.5. As counter-intuitive as it may sound, the inverse relationship is highly nonlinear — as a result of the x-dependent error variance shown in Figure 1B. This shows that the fact that y can be estimated well as a linear function of x does not imply that x can also be estimated well as a linear function of y. This is because E[Z|Y=y] in Equation 4 can have any strange functional dependence on y when we go beyond standard assumptions similar to those in example 1.
To conclude, our example 2 shows that the probabilistic linear relationship ŷ(x) = αx does not necessarily have a linear inverse of the form x̂(y) = βy. Importantly, the inverse relationship between x̂(y) and y is dependent on the characteristics of the error term Z.