1. Orthogonal Projections and Their Applications#

1.1. Overview#

Orthogonal projection is a cornerstone of vector space methods, with many diverse applications.

These include

  • Least squares projection, also known as linear regression

  • Conditional expectations for multivariate normal (Gaussian) distributions

  • Gram–Schmidt orthogonalization

  • QR decomposition

  • Orthogonal polynomials

  • etc

In this lecture, we focus on

  • key ideas

  • least squares regression

We’ll require the following imports:

import numpy as np
from scipy.linalg import qr

1.1.1. Further Reading#

For background and foundational concepts, see our lecture on linear algebra.

For more proofs and greater theoretical detail, see A Primer in Econometric Theory.

For a complete set of proofs in a general setting, see, for example, [Roman, 2005].

For an advanced treatment of projection in the context of least squares prediction, see this book chapter.

1.2. Key Definitions#

Assume x,zRn.

Define x,z=ixizi.

Recall x2=x,x.

The law of cosines states that x,z=xzcos(θ) where θ is the angle between the vectors x and z.

When x,z=0, then cos(θ)=0 and x and z are said to be orthogonal and we write xz.

_images/orth_proj_def1.png

For a linear subspace SRn, we call xRn orthogonal to S if xz for all zS, and write xS.

_images/orth_proj_def2.png

The orthogonal complement of linear subspace SRn is the set S:={xRn:xS}.

_images/orth_proj_def3.png

S is a linear subspace of Rn

  • To see this, fix x,yS and α,βR.

  • Observe that if zS, then

αx+βy,z=αx,z+βy,z=α×0+β×0=0
  • Hence αx+βyS, as was to be shown

A set of vectors {x1,,xk}Rn is called an orthogonal set if xixj whenever ij.

If {x1,,xk} is an orthogonal set, then the Pythagorean Law states that

x1++xk2=x12++xk2

For example, when k=2, x1x2 implies

x1+x22=x1+x2,x1+x2=x1,x1+2x2,x1+x2,x2=x12+x22

1.2.1. Linear Independence vs Orthogonality#

If XRn is an orthogonal set and 0X, then X is linearly independent.

Proving this is a nice exercise.

While the converse is not true, a kind of partial converse holds, as we’ll see below.

1.3. The Orthogonal Projection Theorem#

What vector within a linear subspace of Rn best approximates a given vector in Rn?

The next theorem answers this question.

Theorem 1.1 (Orthogonal Projection Theorem)

Given yRn and linear subspace SRn, there exists a unique solution to the minimization problem

y^:=argminzSyz

The minimizer y^ is the unique vector in Rn that satisfies

  • y^S

  • yy^S

The vector y^ is called the orthogonal projection of y onto S.

The next figure provides some intuition

_images/orth_proj_thm1.png

1.3.1. Proof of Sufficiency#

We’ll omit the full proof.

But we will prove sufficiency of the asserted conditions.

To this end, let yRn and let S be a linear subspace of Rn.

Let y^ be a vector in Rn such that y^S and yy^S.

Let z be any other point in S and use the fact that S is a linear subspace to deduce

yz2=(yy^)+(y^z)2=yy^2+y^z2

Hence yzyy^, which completes the proof.

1.3.2. Orthogonal Projection as a Mapping#

For a linear space Y and a fixed linear subspace S, we have a functional relationship

yY its orthogonal projection y^S

By the Theorem 1.1, this is a well-defined mapping or operator from Rn to Rn.

In what follows we denote this operator by a matrix P

  • Py represents the projection y^.

  • This is sometimes expressed as E^Sy=Py, where E^ denotes a wide-sense expectations operator and the subscript S indicates that we are projecting y onto the linear subspace S.

The operator P is called the orthogonal projection mapping onto S.

_images/orth_proj_thm2.png

It is immediate from the Theorem 1.1 that for any yRn

  1. PyS and

  2. yPyS

From this, we can deduce additional useful properties, such as

  1. y2=Py2+yPy2 and

  2. Pyy

For example, to prove 1, observe that y=Py+yPy and apply the Pythagorean law.

1.3.2.1. Orthogonal Complement#

Let SRn.

The orthogonal complement of S is the linear subspace S that satisfies x1x2 for every x1S and x2S.

Let Y be a linear space with linear subspace S and its orthogonal complement S.

We write

Y=SS

to indicate that for every yY there is unique x1S and a unique x2S such that y=x1+x2.

Moreover, x1=E^Sy and x2=yE^Sy.

This amounts to another version of the Theorem 1.1:

Theorem 1.2 (Orthogonal Projection Theorem (another version))

If S is a linear subspace of Rn, E^Sy=Py and E^Sy=My, then

PyMyandy=Py+Myfor all yRn

The next figure illustrates

_images/orth_proj_thm3.png

1.4. Orthonormal Basis#

An orthogonal set of vectors ORn is called an orthonormal set if u=1 for all uO.

Let S be a linear subspace of Rn and let OS.

If O is orthonormal and spanO=S, then O is called an orthonormal basis of S.

O is necessarily a basis of S (being independent by orthogonality and the fact that no element is the zero vector).

One example of an orthonormal set is the canonical basis {e1,,en} that forms an orthonormal basis of Rn, where ei is the i th unit vector.

If {u1,,uk} is an orthonormal basis of linear subspace S, then

x=i=1kx,uiuifor allxS

To see this, observe that since xspan{u1,,uk}, we can find scalars α1,,αk that verify

(1.1)#x=j=1kαjuj

Taking the inner product with respect to ui gives

x,ui=j=1kαjuj,ui=αi

Combining this result with (1.1) verifies the claim.

1.4.1. Projection onto an Orthonormal Basis#

When a subspace onto which we project is orthonormal, computing the projection simplifies:

Theorem 1.3

If {u1,,uk} is an orthonormal basis for S, then

(1.2)#Py=i=1ky,uiui,yRn

```{prf:proof}  Fix $y \in \mathbb{R}^n$ and let $P y$ be  defined as in {eq}`exp_for_op`.

Clearly, $P y \in S$.

We claim that $y - P y \perp S$ also holds.

It suffices to show that $y - P y \perp u_i$ for any basis vector $u_i$.

This is true because

$$
\left\langle y - \sum_{i=1}^k \langle y, u_i \rangle u_i, u_j \right\rangle
= \langle y, u_j \rangle  - \sum_{i=1}^k \langle y, u_i \rangle
\langle u_i, u_j  \rangle = 0
$$

(Why is this sufficient to establish the claim that yPyS?)

1.5. Projection Via Matrix Algebra#

Let S be a linear subspace of Rn and let yRn.

We want to compute the matrix P that verifies

E^Sy=Py

Evidently Py is a linear function from yRn to PyRn.

This reference is useful.

Theorem 1.4

Let the columns of n×k matrix X form a basis of S. Then

P=X(XX)1X

Proof. Given arbitrary yRn and P=X(XX)1X, our claim is that

  1. PyS, and

  2. yPyS

Claim 1 is true because

Py=X(XX)1Xy=Xawhena:=(XX)1Xy

An expression of the form Xa is precisely a linear combination of the columns of X and hence an element of S.

Claim 2 is equivalent to the statement

yX(XX)1XyXbfor allbRK

To verify this, notice that if bRK, then

(Xb)[yX(XX)1Xy]=b[XyXy]=0

The proof is now complete.

1.5.1. Starting with the Basis#

It is common in applications to start with n×k matrix X with linearly independent columns and let

S:=spanX:=span{col1X,,colkX}

Then the columns of X form a basis of S.

From the Theorem 1.4, P=X(XX)1Xy projects y onto S.

In this context, P is often called the projection matrix

  • The matrix M=IP satisfies My=E^Sy and is sometimes called the annihilator matrix.

1.5.2. The Orthonormal Case#

Suppose that U is n×k with orthonormal columns.

Let ui:=coliU for each i, let S:=spanU and let yRn.

We know that the projection of y onto S is

Py=U(UU)1Uy

Since U has orthonormal columns, we have UU=I.

Hence

Py=UUy=i=1kui,yui

We have recovered our earlier result about projecting onto the span of an orthonormal basis.

1.5.3. Application: Overdetermined Systems of Equations#

Let yRn and let X be n×k with linearly independent columns.

Given X and y, we seek bRk that satisfies the system of linear equations Xb=y.

If n>k (more equations than unknowns), then the system is said to be overdetermined.

Intuitively, we may not be able to find a b that satisfies all n equations.

The best approach here is to

  • Accept that an exact solution may not exist.

  • Look instead for an approximate solution.

By approximate solution, we mean a bRk such that Xb is close to y.

The next theorem shows that a best approximation is well defined and unique.

The proof uses the Theorem 1.1.

Theorem 1.5

The unique minimizer of yXb over bRk is

β^:=(XX)1Xy

Proof. Note that

Xβ^=X(XX)1Xy=Py

Since Py is the orthogonal projection onto span(X) we have

yPyyz for any zspan(X)

Because Xbspan(X)

yXβ^yXb for any bRk

This is what we aimed to show.

1.6. Least Squares Regression#

Let’s apply the theory of orthogonal projection to least squares regression.

This approach provides insights about many geometric properties of linear regression.

We treat only some examples.

1.6.1. Squared Risk Measures#

Given pairs (x,y)RK×R, consider choosing f:RKR to minimize the risk

R(f):=E[(yf(x))2]

If probabilities and hence E are unknown, we cannot solve this problem directly.

However, if a sample is available, we can estimate the risk with the empirical risk:

minfF1Nn=1N(ynf(xn))2

Minimizing this expression is called empirical risk minimization.

The set F is sometimes called the hypothesis space.

The theory of statistical learning tells us that to prevent overfitting we should take the set F to be relatively simple.

If we let F be the class of linear functions, the problem is

minbRKn=1N(ynbxn)2

This is the sample linear least squares problem.

1.6.2. Solution#

Define the matrices

y:=(y1y2yN),xn:=(xn1xn2xnK)=n-th obs on all regressors

and

X:=(x1x2xN):=:(x11x12x1Kx21x22x2KxN1xN2xNK)

We assume throughout that N>K and X is full column rank.

If you work through the algebra, you will be able to verify that yXb2=n=1N(ynbxn)2.

Since monotone transforms don’t affect minimizers, we have

argminbRKn=1N(ynbxn)2=argminbRKyXb

By our results about overdetermined linear systems of equations, the solution is

β^:=(XX)1Xy

Let P and M be the projection and annihilator associated with X:

P:=X(XX)1XandM:=IP

The vector of fitted values is

y^:=Xβ^=Py

The vector of residuals is

u^:=yy^=yPy=My

Here are some more standard definitions:

  • The total sum of squares is :=y2.

  • The sum of squared residuals is :=u^2.

  • The explained sum of squares is :=y^2.

TSS = ESS + SSR

We can prove this easily using the Theorem 1.1.

From the Theorem 1.1 we have y=y^+u^ and u^y^.

Applying the Pythagorean law completes the proof.

1.7. Orthogonalization and Decomposition#

Let’s return to the connection between linear independence and orthogonality touched on above.

A result of much interest is a famous algorithm for constructing orthonormal sets from linearly independent sets.

The next section gives details.

1.7.1. Gram-Schmidt Orthogonalization#

Theorem 1.6

For each linearly independent set {x1,,xk}Rn, there exists an orthonormal set {u1,,uk} with

span{x1,,xi}=span{u1,,ui}fori=1,,k

The Gram-Schmidt orthogonalization procedure constructs an orthogonal set {u1,u2,,un}.

One description of this procedure is as follows:

  • For i=1,,k, form Si:=span{x1,,xi} and Si

  • Set v1=x1

  • For i2 set vi:=E^Si1xi and ui:=vi/vi

The sequence u1,,uk has the stated properties.

A Gram-Schmidt orthogonalization construction is a key idea behind the Kalman filter described in A First Look at the Kalman filter.

In some exercises below, you are asked to implement this algorithm and test it using projection.

1.7.2. QR Decomposition#

The following result uses the preceding algorithm to produce a useful decomposition.

Theorem 1.7

If X is n×k with linearly independent columns, then there exists a factorization X=QR where

  • R is k×k, upper triangular, and nonsingular

  • Q is n×k with orthonormal columns

Proof. Let

  • xj:=colj(X)

  • {u1,,uk} be orthonormal with the same span as {x1,,xk} (to be constructed using Gram–Schmidt)

  • Q be formed from columns ui

Since xjspan{u1,,uj}, we have

xj=i=1jui,xjuifor j=1,,k

Some rearranging gives X=QR.

1.7.3. Linear Regression via QR Decomposition#

For matrices X and y that overdetermine β in the linear equation system y=Xβ, we found the least squares approximator β^=(XX)1Xy.

Using the QR decomposition X=QR gives

β^=(RQQR)1RQy=(RR)1RQy=R1Qy

where the last step uses the fact that (RR)1R=R1 since R is nonsingular.

Numerical routines would in this case use the alternative form Rβ^=Qy and back substitution.

1.8. Exercises#

Exercise 1.1

Show that, for any linear subspace SRn, SS={0}.

Exercise 1.2

Let P=X(XX)1X and let M=IP. Show that P and M are both idempotent and symmetric. Can you give any intuition as to why they should be idempotent?

Exercise 1.3

Using Gram-Schmidt orthogonalization, produce a linear projection of y onto the column space of X and verify this using the projection matrix P:=X(XX)1X and also using QR decomposition, where:

y:=(133),

and

X:=(100622)