TOC

Introduction

Algebra is a branch of mathematics that deals with the study of mathematical symbols and the rules of manipulation of these symbols. It is a broad area of mathematics that covers topics such as equations, functions, polynomials, matrices, and vectors. Algebra is used extensively in various fields such as science, engineering, economics, and computer science. The fundamental concepts of algebra include operations such as addition, subtraction, multiplication, and division, as well as properties of numbers and variables.

The basics

Linear algebra let us write a system of linear equations compactly in matrix notation:

\[4x_1 - 5x_2 = -13\] \[-2x_1 + 3x_2 = 9\]

into Ax = b, with

\(A = \begin{bmatrix} 4 & -5 \\ -2 & 3 \\ \end{bmatrix}\) to be a matrix

and \(b = \begin{bmatrix} -13 \\ 9 \\ \end{bmatrix}\) to be a column vector

We can denote \(a_{ij}\) to be the entry of A in the \(i^{th}\) row and \(j^{th}\) column.

After having the matrix notation, we have the product of two matrices \(A \in R^{mxn}\) and \(B \in R^{nxp}\) to be \(C = AB \in R^{mxp}\) where \(C_{ij} = \sum_{k=1}^n A_{ik} B_{kj}\).

Similarly, the inner product or dot product of two vectors \(x, y \in R^n\) is the quantity \(x^T y \in R = {[x_1 x_2 ... x_n ]} = \begin{bmatrix} y_1 \\ y_2 \\ ... \\ y_n \\ \end{bmatrix} = \sum_{i=1}^n x_i y_i\)

We have \(x^T y = y^T x\).

For the outer product of two vectors \(x \in R^m\) and \(y \in R^n\), the result is a matrix \(xy^T \in R^{mxn}\) with \((xy^T)_{ij} = x_i y_j\)

\[xy^T \in R^{mxn} = \begin{bmatrix} x_1 \\ x_2 \\ ... \\ x_m \\ \end{bmatrix} {[ y_1 y_2 ... y_n ]} = \begin{bmatrix} x_1 y_1 , x_1 y_2 ... x_1 y_n \\ x_2 y_1 , x_2 y_2 ... x_2 y_n \\ ... \\ x_m y_1 , x_m y_2 ... x_m y_n \\ \end{bmatrix}\]

There are some properties we need to remember:

  • Matrix multiplication is associative (AB)C = A(BC)

  • Matrix multiplication is distributive A(B+C) = AB+AC

  • Matrix multiplication is generally not commutative \(AB \neq BA\)

Definitions

The identity matrix is denoted \(I \in R^{nxn}\) is a square matrix with 1s on the diagonal and 0s everywhere else.

\[I_{ij} = \begin{cases} 1, i = j \\ 0, i \neq j \\ \end{cases}\]

There is the property regarding identity matrix: \(\forall A \in R^{mxn}: AI = A = IA\)

A diagonal matrix is a matrix where all non diagonal elements are 0s. This is denoted \(D = diag(d_1, d_2, .. d_n)\) with \(D_{ij} = \begin{cases} d_i, i = j \\ 0, i \neq j \\ \end{cases}\). We can say that I = diag(1,1..,1)

In matrix algebra, transposing is a technique to flip rows and columns. Given a matrix \(A \in R^{mxn}\), then its transpose \(A^T \in R^{nxm}\) is: \((A^T)_{ij} = A_{ji}\). Some properties of transposes are:

  • \[(A^T)^T = A\]
  • \[(AB)^T = B^T A^T\]
  • \[(A+B)^T = A^T + B^T\]

A square matrix \(A \in R^{nxn}\) is symmetric if \(A = A^T\). It is anti-symmetric if \(A = - A^T\). So, \(\forall A \in R^{nxn}, A + A^T\) is symmetric and \(A - A^T\) is anti symmetric. Hence, any square matrix \(A \in R^{nxn}\) can be represented as a sum of a symmetric matrix and an anti symmetric matrix, since \(A = \frac{1}{2} (A + A^T) + \frac{1}{2} (A - A^T)\).

Let’s denote the set of all symmetric matrics of size n as \(S^n\), and look at different properties of them.

The trace of a square matrix \(A \in R^{nxn}\), is the sum of diagonal elements in the matrix, \(tr(A) = \sum_{i=1}^n A_{ii}\). The trace has the following properties:

  • For \(A \in R^{nxn}, tr(A) = tr(A^T)\)

  • For \(A, B \in R^{nxn}, tr(A+B) = tr(A) + tr(B)\)

  • For \(A \in R^{nxn}, t \in R, tr(tA) = t tr(A)\)

  • For A, B such that AB is square, tr(AB) = tr(BA)

  • For A, B, C such that ABC is square, tr(ABC) = tr(BCA) = tr(CAB)

A norm of a vector \(\mid\mid x \mid\mid\) is the “length” of that vector. For example, the Euclidean (\(l_2\) norm is \(\mid\mid x\mid\mid_2 = \sqrt{\sum_{i=1}^{n} x_i^2}\). Note that \(\mid\mid x \mid\mid_2^2 = x^T x\).

Mathematically, a norm is function \(f: R^n \to R\) that satisfies:

  • \(\forall x \in R^n, f(x) \geq 0\) (non negativity)

  • f(x) = 0 if and only if x = 0

  • \(\forall x \in R^n, t \in R, f(tx) = \mid t\mid f(x)\) (homogeneity)

  • \[\forall x, y \in R^n, f(x+y) \leq f(x) + f(y)\]

Apart from Euclidean norm, we have the \(l_1\) norm \(\mid\mid x\mid\mid_1 = \sum_{i=1}^n \mid x_i \mid\) and the \(l_{\infty}\) norm \(\mid\mid x\mid\mid_{\infty} = max_i \mid x_i \mid\)

The generalization of the family of \(l_p\) norm, parameterized by real number \(p \geq 1\) is: \(\mid\mid x\mid\mid_p = (\sum_{i=1}^n \mid x_i\mid^p)^{1/p}\).

For matrices, we have the Frobenius norm:

\[\mid\mid A\mid\mid_F = \sqrt{\sum_{i=1}^m \sum_{j=1}^n A_{ij}^2} = \sqrt{tr(A^T A)}\]

Properties

Next we would study the linear independence of vectors. A set of vectors \(\{ x_1, x_2, ..., x_n \} \subset R^m\) if no vector can be represented as a linear combination of the remaining vectors. So if a vector can be represented as a linear combination of the remaining vectors, then the vectors are linearly dependent: \(x_n = \sum_{i=1}^{n-1} \alpha_i x_i\). The column rank of matrix \(A \in R^{mxn}\) is the size of the largest subset of columns of A that is linearly independent. Similarly, the row rank is the number of linearly independent rows. Since \(\forall A \in R^{mxn}\) the column rank is equal to the row rank, we call it rank(A). Here are some properties of the rank:

  • For \(A \in R^{mxn}, rank(A) \leq min(m,n)\). At equal, A is full rank.

  • For \(A \in R^{mxn}, rank(A) = rank(A^T)\)

  • For \(A \in R^{mxn}, B \in R^{nxp}, rank(AB) \leq min(rank(A), rank(B))\)

  • For \(A, B \in R^{mxn}, rank(A+B) \leq rank(A) + rank(B)\)

The inverse of a square matrix \(A \in R^{nxn}\) is denoted \(A^{-1}\) and is the unique matrix \(A^{-1} A = I = A A^{-1}\). Not all matrices have inverse, such as non square matrices and some square matrix. A non invertible matrix is called singular. For a square matrix A to be invertible, it mush be full rank. Here are some properties for invertible matrices:

  • \[(A^{-1})^{-1} = A\]
  • \[(AB)^{-1} = B^{-1} A^{-1}\]
  • \[A^{-T} = (A^{-1})^T = (A^T)^{-1}\]

If A is invertible then the roots x of the system Ax=b would be \(x = A^{-1} b\).

Two vectors \(x, y \in R^n\) are orthogonal if \(x^T x = 0\). A vector \(x \in R^n\) is normalized if \(\mid \mid x \mid \mid_2 = 1\). A square matrix \(U \in R^{nxn}\) is orthogonal if all the columns are orthogonal to each other and are normalized. The columns are then orthonormal. Then \(U^T U = I = U U^T\). The inverse of an orthogonal matrix is its transpose. The equality doesn’t hold if U is not square. Since the matrix is orthogonal, multiply it with a vector will not change its Euclidean norm \(\mid \mid Ux \mid \mid_2 = \mid \mid x \mid\mid_2 \forall x \in R^n, U \in R^{nxn}\) orthogonal.

The span of a set of vectors \(\{x_1, x_2..x_n\}\) is the set of all vectors that is a linear combination of them. \(span(\{x_1,..x_n\}) = \{ v: v=\sum_{i=1}^n \alpha_i x_i, \alpha_i \in R \}\). If \(x_i \in R^n\) then the span is \(R^n\). The projection of a vector \(y \in R^m\) onto the span is the v such that it is the closest to y in Euclidean norm.

\[Proj(y; \{x_1, ..x_n\}) = argmin_{v \in span} \mid\mid y - v\mid\mid_2\]

The range of a matrix \(A \in R^{mxn}\) is the span of the columns of A:

\[R(A) = \{ v \in R^m: v = Ax, x \in R^n \}\]

The nullspace of a matrix \(A \in R^{mxn}\) is the set of all vectors that equal 0 when multiplied by A:

\[N(A) = \{x \in R^n: Ax = 0 \}\]

The determinant of a square matrix \(A \in R^{nxn}\) is a function \(det: R^{nxn} \to R\).

\(\mid A \mid = \sum_{i=1}^n (-1)^{i+j} a_{ij} \mid A_{\backslash i, \backslash j} \mid\) for any \(j \in 1,2...,n\)

\(= \sum_{j=1}^n (-1)^{i+j} a_{ij} \mid A_{\backslash i, \backslash j} \mid\) for any \(i \in 1,2,...,n\)

For a 2x2 matrix

\[A = \begin{bmatrix} 1 3 \\ 3 2 \\ \end{bmatrix}\]

the rows are [1, 3] and [3, 2]. The determinant of this matrix is the area of the parallelogram of the two row vectors.

Screen Shot 2023-04-24 at 17 09 08

Here are some properties of the determinant:

  • The determinant of the identity is 1, \(\mid I \mid = 1\)

  • If we multiple a row of matrix \(A \in R^{nxn}\) then the determinant increase by t factor.

  • If we exchange any two row of A, then the determinant inverts into \(-\mid A\mid\)

  • For \(A \in R^{nxn}, \mid A \mid = \mid A^T \mid\)

  • For \(A, B \in R^{nxn}, \mid AB \mid = \mid A\mid \mid B\mid\)

  • For \(A \in R^{nxn}, \mid A \mid = 0\) if and only if A is singular (it doesn’t have a full rank and the columns are linearly dependent).

  • For \(A \in R^{nxn}\) and A non singular \(\mid A^{-1} \mid = 1 / \mid A \mid\)

The adjoint of a matrix \(A \in R^{nxn}\) is \(adj(A) \in R^{nxn}, (adj(A))_{ij} = (-1)^{i+j} \mid A_{\backslash j,\backslash i} \mid\). For any nonsingular \(A \in R^{nxn}, A^{-1} = \frac{1}{\mid A\mid} adj(A)\)

Given a square matrix \(A \in R^{nxn}\) and a vector \(x \in R^n\) the scalar value \(x^T A x\) is a quadratic form.

\[x^T A x = \sum_{i=1}^n x_i(Ax)_i = \sum_{i=1}^n x_i (\sum_{j=1}^n A_{ij}x_j) = \sum_{i=1}^n \sum_{j=1}^n A_{ij} x_i x_j\]
  • A symmetric matrix \(A \in S^n\) is positive difinite (PD) if for all non zero vectors \(x \in R^n, x^T A x > 0\)

  • A symmetric matrix \(A \in S^n\) is positive semidefinite (PSD) if for all vectors \(x^T A x \geq 0\)

  • A symmetric matrix \(A \in S^n\) is negative definite (ND) if for all non zero \(x \in R^n, x^T A x < 0\)

  • A symmetric matrix \(A \in S^n\) is negative semidefinite (NSD) if for all \(x \in R^n, x^T A x < 0\)

  • A symmetric matrix \(A \in S^n\) is indefinite (neither positive semidefinite nor negative semidefinite) if there exists \(x_1, x_2 \in R^n\) such that \(x_1^T A x_1 > 0\) and \(x_2^T A x_2 < 0\)

If A is positive definite then -A is negative definite and vice versa. If A is positive semidefinite then -A is negative semidefinite and vice versa. If A is indefinite then so is -A. For positive definite and negative definite matrices, they are always full rank, hence invertible.

Eigen decomposition

For the square matrix \(A \in R^{nxn}\) we say \(\lambda \in C\) is an eigenvalue of A and \(x \in C^n\) is the corresponding eigenvector if \(Ax = \lambda x, x \neq 0\). This simply means that when we multiply A by x we have a new vector of the same direction as x but scaled by a factor \(\lambda\). For any eigenvector \(x \in C^n\) and scalar \(t \in C, A(cx) = cAx = c\lambda x = \lambda(cx)\), so cx is also an eigenvector. So usually we normalize eigenvector to have length 1: \((\lambda I - A) x = 0, x \neq 0\). The equation has non zero solution to x if and only if \((\lambda I - A)\) has a non empty nullspace, which is only the case if \((\lambda I - A)\) is singular, i.e. the determinant is 0: \(\mid (\lambda I - A) \mid = 0\). The equation is a polynomial in \(\lambda\), where \(\lambda\) has the maximum degree n. n (possibly complex) roots of this polynomial are our n eigenvalues. Then we substitute \(\lambda\) to find the vector. Here are some properties of eigen values and their vectors:

  • The trace of A is the sum of its eigenvalues: \(tr(A) = \sum_{i=1}^n \lambda_i\)

  • The determinant of A is the product of its eigenvalues: \(\mid A\mid = \prod_{i=1}^n \lambda_i\)

  • The rank of A is the number of its non zero eigenvalues

  • If A is nonsingular then \((\frac{1}{\lambda_i})\) is an eigenvalue of \(A^{-1}\) with the associated eigenvector \(x_i\)

  • The eigenvalues of a diagonal matrix D are the diagonal entries.

For the matrix A and eigenvector matrix V and eigenvalue vector \(\lambda\) we have \(A = V diag(\lambda) V^{-1}\). Similar to eigen decomposition, the singular value decomposition (SVD) would provide another way to factorize a matrix into singular vectors and singular values. This method is more applicable, since every real matrix has a SVD. The SVD is \(A = U D V^T\). If A is m x n then U is m x m, D is mxn and V is n x n.

For matrices that are not invertible, there is the Moore-Penrose pseudoinverse: \(A^+ = lim_{\alpha \to 0} (A^T A + \alpha I)^{-1} A^T\). Or it can practically be \(A^+ = V D^+ U^T\) with U, D, and V are the SVD of A. The pseudoinverse of a diagonal matrix D is done by taking the reciprocal of its nonzero elements when taking the transpose of the resulting matrix.

Matrix calculus

Let \(f: R^{mxn} \to R\) is a function that takes as input a matrix A and return a real value. Then the gradient of f is the matrix of partial derivatives:

\[\nabla_A f(A) \in R^{mxn} = \begin{bmatrix} \frac{\delta f(A)}{\delta A_{11}} ... \frac{\delta f(A)}{\delta A_{1n}} \\ ... \\ \frac{\delta f(A)}{\delta A_{m1}} ... \frac{\delta f(A)}{\delta A_{mn}} \\ \end{bmatrix}\]

The Hessian is the matrix of partial derivatives:

\[\nabla_x^2 f(x) \in R^{nxn} = \begin{bmatrix} \frac{\delta^2 f(x)}{\delta x_1^2} ... \frac{\delta^2 f(x)}{\delta x_1 \delta x_n} \\ ... \\ \frac{\delta^2 f(x)}{\delta x_n \delta x_1} ... \frac{\delta^2 f(x)}{\delta x_n^2} \\ \end{bmatrix}\]