Variational calculus


A function space, \mathfrak F, is a set topological vector space (a set) whose elements are functions with a common domain. We assume that functions in the space are all differentiable to any order as needed.

A functional is a function \mathcal F: \mathfrak F \to \mathbb R.

It should be noted that a functional eats a function and return a real value rather than eating a value of a function and returning a number. So, if \mathcal F is a functional and f(x)\in \mathfrak F a function, we write \mathcal F[f] which is a real number. It doesn’t matter what value of f(x) is at any x; a functional sees the function f:D \to C, i.e. the rule.

Remark: A functional is not a compositions of functions like h(x):=fog(x) as f acts of the value of g(x) not on the rule g.

Example: The integral F[f]=\int_x f(x)p(x)\mathrm d x is a functional; the same for the sum F[f]=\sum_i f(x_i)p(x_i). Note that f can be a constant function like f=c\in \mathbb R or the identity function f(x)=x. We can also write as follows to better denote that a functional eats a function: F[\cdot]=\int_x (\cdot)p(x)\mathrm d x.

Adding two functions in a function vector space can be interpreted two ways. Let f,g \in \mathfrak F and \varepsilon\in \mathbb R. Then, f + \varepsilon g can be interpreted as 1) perturbing f with \varepsilon g, or 2- h:=f + \varepsilon g is a function produced by moving along the direction of g in the function space and reaching the function h. The functions are the member of the vector space, hence f + \varepsilon g = f + \varepsilon g/\|g\|\|g\| = f + \varepsilon \|g\| \hat g such that the norm is defined based on an inner product in the function space. The term \hat g is considered as the unit vector of g and its direction can be relatively calculated with respect to \hat f = f/\|f\| when using the space’s inner product.

Variation of a function

Definition: For a functions f, \eta \in \mathfrak F, and \varepsilon \in \mathbb R, the term \delta f(x):=\varepsilon \eta (x) is called a variation of f for an arbitrary function \eta(x).

Let u:D\subset \mathbb R \to \mathbb R and F:\mathbb R \to \mathbb R be two (fixed) functions. Then, the variation of the composition F(u) is defined as,

    \[\delta F := F(u(x)+\delta u(x)) - F(u(x)) \forall x \in D\]

where \delta u = \varepsilon \eta is a function of \varepsilon. Note that \delta F is a function of x and \varepsilon on the domain of F.

Observing that \delta F is a function of \varepsilon for a fixed \eta, the variation of F can be linearized for small variations of u. Letting \varepsilon_0 =0 and \varepsilon close to zero, we can use the Taylor series and write,

    \[\delta F := F(u(x)+\varepsilon \eta(x)) - F(u(x)) = F(u) + \frac{\mathrm d F(u+\varepsilon \eta)}{\mathrm d \varepsilon}\bigg|_{\varepsilon =\varepsilon_0=0}\varepsilon + \mathcal O(\varepsilon^2) - F(u(x)) \approx  \frac{\mathrm d F(u+\varepsilon \eta)}{\mathrm d \varepsilon}\bigg|_{\varepsilon =\varepsilon_0=0}\varepsilon \]

letting y(\varepsilon):=u+\varepsilon \eta lead to,

    \[\delta F =\frac{\mathrm d F(y)}{\mathrm d y}\bigg|_{y(0)}\frac{\mathrm dy}{\mathrm d \varepsilon}\bigg|_{\varepsilon=0}\varepsilon=\frac{\mathrm d F(y)}{\mathrm d y}\bigg|_{y(0)}\eta\varepsilon=\frac{\mathrm d F(y)}{\mathrm d y}\bigg|_{u(x)}\delta u = \frac{\mathrm d F(u)}{\mathrm d u}\delta u\quad \forall x \in D \qquad (1)\]

which is called the linearized or the first variation of the function F(u) due to variation in its argument u.

Similarly if F:\mathbb R^n \to \mathbb R with F=F(u_1, u_2,\cdots, u_n) and u_i:\mathbb R\to \mathbb R, then,

    \[\delta F =\sum_{i=1}^n \frac{\partial F}{\partial u_i}\delta u_i \quad \forall x\in R\]

Lemma 1: If F,G:\mathbb R\to \mathbb R and u:D\subset \mathbb R \to \mathbb R, then \delta (FG)=F\delta G + G\delta F. Proof is as follows.

    \[\delta (FG)=\frac{\mathrm d (FG)}{\mathrm d u}\delta u =F\frac{\mathrm dG}{\mathrm d u}\delta u + G\frac{\mathrm dF}{\mathrm d u}\delta u = F\delta G + G\delta F\]

Lemma 2: If u:D\to \mathbb R, then \delta \frac{\mathrm d u}{\mathrm dx}=\frac{\mathrm d }{\mathrm dx} \delta u. Because \text {LHS}=\frac{\mathrm d }{\mathrm d \varepsilon}\frac{\mathrm d (u+\varepsilon \eta)}{\mathrm d x}\bigg|_{\varepsilon=0}\varepsilon= \frac{\mathrm d \eta(x)}{\mathrm d x}\varepsilon=\text{RHS}

Variation of a functional

Variation of a functional should show its instant variation when there is a infinitesimal change in its argument, i.e. a function.

Definition: For a functional \mathcal F, the term \delta \mathcal F := \mathcal F[f + \delta f] - \mathcal F[f] is called the variation of \matcal F due to/in the direction of the variation of f.

To evaluate \delta \mathcal F, we observed that \mathcal F[f+\varepsilon \eta] is a function of \varepsilon for a fixed f and \eta, i.e. \mathcal F_{f,\eta}(\varepsilon):=\mathcal F[f+\varepsilon \eta] for a fixed f and \eta is now a real valued function on \mathbb R.

It should be noted that \mathcal F[f(x)+\varepsilon \eta(x)] is a function of \varepsilon when f and \eta are fixed. Therefore, this rule, \mathcal F_{f,\eta}, is not a functional anymore; instead, it is a function. The cause of change in the argument of the function, f+\varepsilon \eta, is the change in \varepsilon. Hence, we can write the following limit (if it exists).

    \[\lim_{h \to 0} \frac{\mathcal F[f +(\varepsilon_0 + h) \eta]-\mathcal F[f +\varepsilon_0 \eta]}{h}=\frac{\mathrm d \mathcal F[f+\varepsilon \eta]}{\mathrm d \varepsilon}\bigg|_{\varepsilon=\varepsilon_0}\]

With that limit defined, the Taylor expansion of the function \mathcal F_{f,\eta}(\varepsilon) about zero is,

    \[\mathcal F_{f,\eta}(\varepsilon) = \mathcal F_{f,\eta}(\varepsilon)\bigg|_{\varepsilon = 0} + \frac{\mathrm d \mathcal F_{f,\eta}(\varepsilon)}{\mathrm d \varepsilon}\bigg|_{\varepsilon =0} \varepsilon +\matcal O(\varepsilon^2)\]

or with keeping in mind that f and \eta are fixed and not functions of \varepsilon, above can be written as,

    \[\matcal F[f + \varepsilon \eta] = \mathcal F[f + \varepsilon \eta]\bigg|_{\varepsilon = 0} + \frac{\mathrm d\mathcal F[f+\varepsilon \eta]}{\mathrm d \varepsilon}\bigg|_{\varepsilon =0} \varepsilon +\matcal O(\varepsilon^2)\]

With this regard, the variation of the functional for small \varepsilon becomes,

(1)   \[\delta \mathcal F= \frac{\mathrm d\mathcal F[f+\varepsilon \eta]}{\mathrm d \varepsilon}\bigg|_{\varepsilon =0} \varepsilon \]

which is called the first variation of the functional or the Gateaux derivative of the functional. The function \eta is referred to as a test function.

The formula for the variation of a functional cannot extended as partial derivatives as in Eq. 1 because d\mathcal F/du for u:=f+\varepsilon \eta is not defined in the function space (it’s nonsense). However, the variation of a functional in the form of an integral (or sum) operator leads to the variation of the function inside the operator. In other word, the variation operator gets inside the integral operator.

Example: let \mathcal F[y] =\int_I (\frac{\mathrm dy}{\mathrm dx})^2 -wy \mathrm dx, where I=[a,b], y:\mathbb R \to \mathbb R, w:\mathbb R \to \mathbb R, be a functional. Then, the variation of \mathcal F is as follows (\delta y = \varepsilon \eta).

    \[\begin{split} \delta \mathcal F[y] &=\delta \int_I (\frac{\mathrm dy}{\mathrm dx})^2 -wy \mathrm dx =\frac{\mathrm d}{\mathrm d\varepsilon} \int_I \left((\frac{\mathrm d(y+\delta y)}{\mathrm dx})^2 -w(y+\delta y) \mathrm dx\right)\bigg|_{\varepsilon=0}\varepsilon\\&=\int_I \frac{\mathrm d}{\mathrm d\varepsilon}\bigg|_{\varepsilon=0}(\frac{\mathrm d(y+\delta y)}{\mathrm dx})^2\varepsilon -w\frac{\mathrm d}{\mathrm d\varepsilon}\bigg|_{\varepsilon=0}(y+\delta y)\varepsilon \mathrm dx =\int_I \delta(\frac{\mathrm d y}{\mathrm dx})^2 -w\delta y \mathrm dx\\&\text{by Eq. 1 }= \int_I 2(\frac{\mathrm d y}{\mathrm dx})\delta(\frac{\mathrm d y}{\mathrm dx}) -w\delta y \mathrm dx\end{split}\]

As noted in the example the variation of the functional is now transferred to the variations of functions.

The basic problem of variational calculus

Let \mathcal F: \mathfrak F \to \mathbb R, i.e. a real-valued functional, then the basic problem of variational calculus is expressed as finding y^* \in \mathfrak F for which \mathcal F attains a minimum (or -\mathcal F attains a maximum), i.e.

    \[y^* :=\arg \min_{y\in \mathfrak F}(\mathcal F[y]) = \arg \max_{y\in \mathfrak F}(-\mathcal F[y]) \]

To solve this problem, a necessary condition can be established. Let y^* be a minimum of \mathcal F and v\in \mathfrak F be another function. Defining y^*+\varepsilon v, we can write

    \[\mathcal F[y^*+\varepsilon v] \ge \mathcal F[y^*] \ \forall v\in\mathfrak F\]

For all v, when fixed, the term \mathcal F[y^*+\varepsilon v] becomes a function of epsilon. This function has a minimum at \varepsilon = 0, because \mathcal F[y^*+0\cdot v]=F[y^*]\le \mathcal F[y^*+\varepsilon v]. This means

    \[\frac{\mathrm d \mathcal F[y^*+\varepsilon v]}{\mathrm d \varepsilon} \bigg|_{\varepsilon=0}=0 \iff \frac{\mathrm d \mathcal F[y^*+\varepsilon v]}{\mathrm d \varepsilon} \bigg|_{\varepsilon=0} \varepsilon =0 \ \forall \varepsilon \in \mathbb R\]


Because v is an arbitrary function and \varepsilon v=:\delta y^* is called a variation of y^* (by the definition), we can write,

Proposition: If a functional \mathcal F[y] attains a minimum at y^* \in \mathfrak F, then the variation of \mathcal F at y^* is zero for all variations of y^*. In notations,

    \[ y^* =\arg \min_{y\in \mathfrak F}(\mathcal F[y]) \implies \delta \matcal F[y^*]=0 \ \forall \delta y^*\]

Definition: If for a functional \mathcal F[y], the y_s such that \delta \mathcal F[y_s]=0 is called an stationary point of the functional. \delta \mathcal F[y_s]=0 means,

    \[\delta \mathcal F[y_s]=\frac{\mathrm d \mathcal F[y_s+\delta y_s]}{\mathrm d \varepsilon} \bigg|_{\varepsilon=0} \varepsilon =0 \ \forall \varepsilon \in \mathbb R \text{ and }\delta y_s\in\mathfrak F \]

The proposition says that if y^* minimizes a functional, then it is an stationary point as well.

Normality check


Definition: Quantiles are cut-off points partitioning the domain of a probability distribution which is the support/range of the corresponding random variable into intervals with equal probabilities. The probability distribution can be in the form of a pmf or a pdf. In case of a sample (observed data), which is discrete, the quantiles are similarly defined, however, the probability is define based on the relative frequency of occurrence.

Definition: q-quantiles partition the domain of a distribution or the sorted (domain of) data into q\in \mathbb N parts with equal probabilities. For a q-quantile, there are q-1 quantiles or partitioning points. So, the probability of each partition should be 1/q because the range of the probability measure (set function) is [0,1].

Example: The 4-quantile of of a normal distribution uses 3 points to partition the domain of the distribution as (-\infty,Q_1], (Q_1, Q_2], (Q_2,Q3], (Q_3, +\infty) where P(-\infty,Q_1]=P(Q_1, Q_2]=P(Q_2,Q3]=P (Q_3, +\infty)=1/4.

In the above example, Q_1, Q_2 and Q_3 are the first, second, and the third 4-quantiles.

Definition [quantiles of a population]: Let i\in \mathbb N. The i-th q-quantile of a distribution of a random variable X is defined as x_i such that

    \[P[X\le x_i]=F_X(x_i)\ge i/q \text{ and } P[X<x_i]\le i/q\]

or equivalently

    \[x_i = \inf \{x\in \mathbb R: F_X(x) \ge i/q\}\]

The above definitions uses two conditions or equivalently the infinitum to make sure the quantity is well defined for non-continues distributions.

Remark: The sequence of quantile values are increasing since the interval of the domain of the distribution is so.

Definition: p-quantiles are the same as q-quantiles but p\in \mathbb R. The mathematical definition is the same as q-quantiles with i/q replaced by the real number q.

Example: The 0.95-quantile of the length of some object is 4cm. The probability of the length, being the random variable here, less than or equal 4cm is 0.95. In other words,, in a long run or with the frequency interpretation of probability, 95% of the measurements/data of the length are below or equal 4cm.

Example: The only 2-quantile is called the median. So, the median partitions the support of a random variable (i.e domain of its distribution) into two sub sets having the same probability of 0.5.

Remark: Quantiles partition the area under the pdf graph of a distribution into regions with equal area.

Quantiles of a sample/data: Let s_1, \dots, s_N be a (finite and countable sequence) of observed values of a random variable (iid random variables constructing the sample). If the values are sorted in the ascending order and they have equal probability, i.e. P({s_i})=1/N, then then the index of the i-th q-quantile value is calculated as,

    \[I_p\frac{1}{N} = i/q \implies I_p= N\frac{i}{q}\]

If I_p is not an integer, then round up to the next integer to get the appropriate index. Therefore, the i-th q-quantile value is s_{I_p}.

Q-Q Plot

Lemma: Let X and Y be two random variables on the same probability space. If Y=aX+b, i.e. a linear transformation then the i-th q-quantiles in the supports (range) of X and Y are related as

    \[y_i = ax_i+b\]


    \[\begin{split}& F_Y(y)=P[Y\le y]=P[aX+b\le y]=P[X\le \frac{y-b}{a}]=F_X(\frac{y-b}{a}) \\& \text{ if } F_X(x_i)=F_Y(y_i)=\frac{i}{q} \implies F_X(x_i)=F_X(\frac{y_i-b}{a}) \\& \therefore ax_i + b = y_i\end{split}\]

The converse of this lemma also holds and can be readily proved. The random variables X and Y are related as Y=aX+b if for any i-th q-quantiles x_i and y_i of their distributions y_i = ax_i+b.

Q-Q plot: if the values of q-quantiles of the distributions of two random variables are plotted against each other in a Cartesian coordinate system and the graph shows a linear relation/function, then the distributions are the same up to a linear transformation of the random variables.

Finding out whether a sample of a random variable, i.e. observed data, is from a particular theoriticsl distribution or not is as follows.

1- The discrete values, \{v_i, i=1,\cdots, N\} are sorted in the ascending order and each value is regarded as a quantile cut-off point. This means (N+1)-quantiles are produced as P(X\le v_i)=i/N (since P(X=v_i)=1/N. Note that as the values get larger the quantiles do as well since the sequence of quantiles is increasing and the sample values are sorted ascendingly,

2- The (N+1)-quantiles of the theoretical distribution are found as \{q_i: P(Y\le q_i) = i/N), i=1,\cdot,N\}.

3- The scattered graph of the pairs \{(q_i,v_i): i=1,\cdots, N\} is plotted.

4- Using a regression line, show whether the graph follows a linear line or not. If yes, the sample is from the same distribution as the theoretical one up to a linear transformation.

Sometimes, like when dealing with a normal distribution, the domain of the distribution is unbounded. Therefore, the last quantile corresponding to 100th percentile (P(Y\le q_i) = N/N)=1) will be infinite. In this case, the quantiles of the theoretical distributions are found according to \{q_i: P(Y\le q_i) = (i-0.5)/N), i=1,\cdot,N\}. There are also other formulas in this regard.

Normality check can be done using a Q-Q plot assuming a normal distribution for the theoretical distribution. The following is how to do it by Python.

import statsmodels.api as sm

# data: 1D np.array
fig = sm.qqplot(data, fit=True, line='r', )

Geometry and Euclidean space


1- Axioms of Geometry

Euclidean space, denoted by \mathbb E, is a space that contains the elements of Euclidean geometry and satisfies the axioms (postulates) of Euclidean geometry. A geometry or a geometric system is a axiomatic system being a collection of the following entities.

  1. Undefined/abstract terms or primitives: These are abstract elements that can be interpreted based on context.
  2. Defined terms (if necessary to have): Terms that are defined using the primitives.
  3. Axioms: The statements that are accepted without proof. They set relations within and between the primitives.
  4. A system of logic.
  5. Theorems: The statements that can be proved using the axioms and the system of logic.

A mathematical model/representation can fit to a geometry/ a geometric system. A model contains elements that are explicit interpretations of the undefined/abstract terms of the geometry and compatible with the geometric axioms. One model for the Euclidean geometry is to define a point as (the imagination of) an exact location which has no dimension/size in the space. A line is defined as a straight line which is an infinitely long object in any (two) directions and has no width or thickness, and it uniquely exists (or can be defined/constructed) between any two points. In the materialistic space/world we can approximately visualize/consider points and lines according to what we see. In other words, we model physical objects or fit mathematical models to them. For example we can consider a point as a relatively small physical dot, a computer pixel or a light spot (modeling a physical dot by the notion of a point defined as a location in the space). The infinite trajectory of any object (light, trace of a pen, a long edge/ridge, etc) whose length, i.e. its longest dimension, follows the definition of the straight line can be considered/imagined as a straight line.

There are two main axiomatic systems for Euclidean geometry.

1- Hilbert’s Postulates.

The primitives are sets of points, lines, and planes. These are not generally and necessarily subsets of each other. For example a line may not considered as a set of points. Any physical object or non-physical notion that satisfies the axioms of Hilbert’s system can be recognized or interpreted as the primitives. However, notions like “a point lying on a line” is defined.

In Hilbert’s system, the notions of a ray, a line segment, vector as directed line segment, an angle, and polygons are also defined based on the primitives and axioms. Hilbert’s system is purely geometrical, in that nothing is postulated concerning numbers and arithmetic. Although this system has notions for comparing line segments, and comparing angles with each other, it does not have a metric for distance between two points and a measure of angle. If the axioms of real numbers are considered along with the continuity axiom of Hilbert’s system [1], then it can be proved that there is a bijective map between points on a line and real numbers. Thereby, a distance function between two points and a measure of length can be defined for line segments. A measure for angles can also be constructed. Theses measures inject numbers and arithmetic into the Euclidean geometry founded on Hilbert’s axioms.

2- Birkhoff ’s postulates (axioms of a metric geometry)

Birkhoff’s axioms [2] of Euclidean geometry directly has an axiom on the existence of a map between real numbers and points on a line. This brings real numbers and arithmetic into the system and quantifies notions and facilitates proofs. This is because axioms of real number can be used in relation with geometric terms. The Birkhoff’s system is as follows.

Primitive/undefined objects: the abstract geometry of Birkhoff, \mathcal A consists of a set \mathcal P whose elements are called points together with a collection \mathcal L of non-empty subsets of \mathcal P, called lines. So, L\in\mathcal L \iff l\subset \mathcal P.

Primitive terms: point, line, coordinate function of a line, half-line, bundle of half lines (BHLs), and coordinate functions of BHLs.

Note that line should not be considered as a straight line. It can be interpreted as a straight line but not limited to.

1- Axiom on lines: If A and B are two distinct points, then there exists one and only one line containing A and B. I.e A,B\in \mathcal P and A\ne B then \exists L\in\mathcal L | A,B\in L.

Definition: A set of points is said to be collinear if this set is a subset of a line. Two sets are collinear if the union of these two sets is collinear.

2- Axiom on coordinate function of a line: There exists associated with each line L, a nonempty class X of one-to-one mappings x of L onto the field \mathbb R of real numbers. If x\in X and if x^*:L\to \mathbb R is any one-to-one mapping, then x^*\in X if and only if for all A,B\in L,

    \[|x(A)-x(B)| = |x^*(A)-x^*(B)|\]

For a line L, the elements of X are called coordinate functions or ruler of L.

The above axiom is called the ruler placement axiom and indicate that different maps x:L\to \mathbb R are like rulers along a line. It does not matter where to put a ruler; all rulers gives the same value of |x(A)-x(B)| for two fixed points A and B. This guarantees that the members of X are well-defined.

As the result of the axiom, a line (and later a ray and a segment) has an infinitely uncountable number of points.

precedence relation: Because the coordinate function is one-to-one, any sequence of points is mapped to a monotonic sequence of real numbers. Therefore, we can define the precedence relation between to different points A,B of a line as A\prec B \iff x(A) < x(B).

Lemma: For any two points A,B on a line with a fixed ruler, either A \prec B or B \prec A.

Lemma: Assuming a precedence relation for two points of a line determines a class of rulers for the line such that the precedence relation (for all points) are the same for any of the rulers.

Definition: A precedence relation on a line can define a direction for a line. Let A\prec B. Then we can say the direction or the sense of direction of the line is from the point A to the point B. Similarly, the direction of the line is from B to A if B\prec A. Therefore, two different directions can be defined for a line. The direction of a line can be represented by an arrow, or a directed line segment which latter is called a geometric vector.

The distance between two point A,B\in L is denoted by |AB| or |BA| and is defined to be the unique non-negative number |x(A)-x(B)| where x is an arbitrary member of X. As a result, the distance between two arbitrary point is calculated through constructing a line between them and using the coordinate function of the line. Obviously, for any pair of points and their line in the geometry we should use the same ruler if we want to be consistent.

Betweennees relation: For A,B,C\in L, i.e. on the same line, the point B is between the points A and C if either x(A)<x(B)<x(C) or x(C)<x(B)<x(A). It can be proved that this relation is independent of the coordinate function. As a result, |AB|+|BC|=|AC|.

if O and A are two distinct points of a line, we call the set of points P on that line such that O is not between A and P, a half-line or a ray with end-point O. In speaking of a ray (O,A), the first point O always represents the end point. Two rays are collinear if their points belong to the same line.

For A,B\in L, the sets of points P containing A and B and all between them is called a segment of L. A segment without its end point is called an interval. A segment is denoted as \overline{AB} and its length is |AB|.

Definition: Two line segments \overline{AB} and \overline{CD} of a line or two different lines are called congruent if |AB|=|CD|. The relation being a binary relation is denoted as \overline{AB} \cong \overline{CD}.

3- Axioms on bundles and coordinates of the rays of bundles

Definitions: Certain subclasses (sub sets) of the class (set) of all rays with the same end point are called bundles. The common end point O of the rays of a bundle is called the vertex of the bundle. A bundle with a vertex O is denoted as B_o. An angle is an unordered couple of rays with the same end point O which is called the vertex of the angle, and the rays are called the sides of the angle. An angle is straight if the sides are distinct and collinear (on the same line, i.e. being the subset of the same line).

The word certain in the above definition is important. It says only specific subsets of the set of ALL RAYS with THE SAME end point/vertex is categorized as bundles. Firstly, it means different bundles can have/share the same vertex. Secondly, by the following axiom on the coordinate functions of bundles and later the definition of a plane, it will be clear that the rays of a bundle must all belong to the same plane.

Axiom on bundles: If a and b are two non-collinear rays with the same endpoint O, then there exists one and only one bundle B_o containing these rays. If they are collinear then they (their points) can belong to other bundles with different vertices.

Axiom on the coordinate functions of bundles (protractor axiom): There exists, associated with each bundle B_o, a nonempty class \phi of one-to-one mappings \phi of B_o onto the equivalence classes of real numbers modulo 2\pi (i.e. x=y \text { mod } 2\pi \iff x = y + k2\pi, k\in \mathbb Z. If \phi_i is a member of \phi and if \phi^* is any one-to-one mapping of B_o onto the equivalence classes of real numbers modulo 2\pi, then \phi^*, is a member of \phi if and only if for all l,m\in B_o |\phi_i(l)-\phi_i(m)| = |\phi^*(l)-\phi^*(m)| modulo 2\pi. The elements of \phi are called the coordinate functions of B_o. If x\in \mathbb R then [x] is defined to denote the equivalence class modulo 2\pi containing x, and \bar x is the real number of class [x] such that 0\le \bar x < 2\pi.

Measure of an angle: if l,m\in \B_o, A\in l and B\in m, the the measure of the angle AOB is denoted by \angle AOB and defined as the minimum of \overline{\phi(l)-\phi(m)} and \overline{\phi(m)-\phi(l)} where \phi is the coordinate function associated with the bundle B_o. The number \angle AOB is independent of the coordinate function and It can be proved that the measure of an angle is independent of the bundle it belongs to.

Continuity axiom: If B_o is a bundle of vertex O, and if A, B are distinct nonvertex points of noncollinear rays of the bundle, then to every point P on the segment AB, there exists a ray OC of B_o containing P such that [\angle AOP + \angle POB]=[\angle AOB]. Conversely if a ray OC of the bundle B_o is such that [\angle AOC+ \angle COB] = [\angle AOB] then there exist a point P belonging simultaneously to the ray OC and to the segment AB.

Theorem: The measure of an angle is \pi if and only if this angle is straight.

Theorem: If m and n are two noncollinear rays of a bundle B_o, then there exists one and only one coordinate function\phi such that \bar \phi (m) = 0 and \bar \phi(n)<\pi. This theorem says that the measurement of angles can be intuitively obtained in the usual way as the application of a protractor (plain or half-disk) in such a way that one side of the angle coincides with the zero of the protractor and the other side corresponds to a number less than \pi (the measure of the angle).

Corollary: If OA, OB and OC are three distinct rays of a bundle and if OA and OB are collinear, then \angle AOC + \angle COB =\pi.

Corollary: If l is a ray of a bundle B_o and if 0<a<\pi then there exist two and only two distinct rays m and n of the bundle such that \angle Im = \angle In = a.

Lemma: Let l be an element of a bundle B_o and let 0<\alpha, \beta<\pi, If m\in B_o such that \angle Im = \alpha, then there exists a ray n\in B_o with the following properties: (a) \angle In = \beta, (b) for all points A \in m and for all points B \in n such that A\ne B the segment AB has a point P in common with the unique line containing the ray l.

Two distinct lines having a point in common determine six angles. Two have \pi for the measure and we the four remaining ones form two sets, each set consisting of two distinct angles with the same measure.

Definition: Two distinct lines having a point in common are said to be perpendicular if the four angles with measures not equal to 0 or \pi have the same measure i.e., \pi/2.

Lemma: If l is a line and P is a point not on l, then there exists one and only one line containing P and perpendicular to l.

Definition: A plane is defined to be the set of all points belonging to the rays of a bundle B_o; this set will be denoted by \{B_o\}.

Theorem: If two distinct points of a line are in a plane, then the whole line is in the plane.

Theorem: A plane is uniquely defined by 3 non-collinear points. In other words, two planes coincide if and only if they have three non-collinear points in common.

Theorem: If two planes have two collinear points in common, they have a line in common.

Definition: If two lines of a plane have no point in common, then they are parallel.

Theorem: In a plane, from a given point not on a given line there exists one and only one perpendicular to that line, and from a given point not on a given line,there exists one and only one parallel to that line.

4- Axiom and theorems on triangles

A triangle is an unordered set of three distinct points. The points are the vertices of the triangle. The three segments defined by the vertices of a triangle are the sides of the triangle. The three angles defined by the sides of a triangle are the angles of the triangle. A triangle is assumed to be a proper one meaning that the vertices are non-collinear. triangle. In the context of triangles, for instance a triangle ABC, the measure of an angle, say angle ABC with vertex B, will be denoted \angle B. Two triangles are similar if the vertices can be labelled A, B, C and A', B', C' in such a way that, AB/A'B'=BC/B'C' = CA/C'A' AND \angle A=\angle A', \angle B = \angle B', \angle C =\angle C'. The notation / denotes the ratio of the lengths.

Axiom of similarity: If two triangles ABC and A'B'C' are such that AB/A'B'=BC/B'C' and \angle B = \angle  B', then they are similar.

Birkhoff wrote the above statement as an axiom. But, I think it can be proved; see Ref 1. Based on the above axiom, several theorems on the similarity of triangles can be obtained. They are skipped here; see Ref 2.

Two important theorems on triangles are as follows.

Theorem (Euclidean geometry) : The sum of the measures of the angles of a triangles is equal to \pi.

Theorem: If a triangle ABC is such that \angle  A = \pi/2, then (AB)^2+ (AC)^2 = (BC)^2 (proof based on similarity of triangles).

5- Axiom of 3-dimensional euclidean space: There exists a point not on a given plane.

Geometric forms: A geometric form in a geometric space is a subset of points satisfying a certain condition. This subset with its condition is also called a locus of points. For example lines, triangles, and planes are geometric forms. Other geometric forms are circles, polygons, or any imaginable form. As an example, an Euclidean circle is a set of points in a Euclidean plane such that they all have the same distance from a fixed point. A sphere is a set of points in a Euclidean 3D space such that they all have the same distance from a fixed point.

2- Some definitions and theorems

2-1 general stuff

Definition: For a straight line l intersecting a plane \alpha at a point O. The line is said to be perpendicular to the plane if it is perpendicular to all straight lines lying on the plane and passing through the point O.
Theorem: A line intersecting a plane at a point O is perpendicular to this plane if and only if it is perpendicular to some two distinct straight lines lying on the plane and passing through the point O (for a proof see [1] page 84).

2-2 Congruent transformation (translation)*

* Congruent transformation is referred to as congruent translation in Ref 1.

Definition: Let a,b be two (different) lines. Then, a mapping f:a\to b is called a congruent transformation of the straight line a to the straight line b if for any two points X and Y on the line a the condition of congruence \overline{f(X)f(Y)} \cong \overline{XY} is fulfilled. Congruent transformation can also be composed of general rotation and pure translation as explained in page 134 Ref 1.

Assume two lines with particular directions and points an shown. We can define two mappings f^+ and f^- of congruent transformation of the first line (black) to the second line (blue) as

    \[ f^+(O) = Q\quad f^+(E_+)=F_+ \quad\tex{ and } \quad f^-(O) = Q\quad f^-(E_+)=F_-\]

Theorem: For any point O on a straight line a with a distinguished direction and for any point Q on another straight line b with a distinguished direction there are exactly two mappings f: a\to b performing congruent transformation of the line a to the line b. The first of them {f^+}_{OQ} preserves the precedence of points, i. e. X \prec Y implies {f^+}_{OQ}(X) \prec {f^+}_{OQ}(Y). The second mapping inverts the precedence of points, i. e. X \prec Y implies {f^−}_{OQ}(Y) \prec {f^−}_{OQ}(X).

2-2-1 Translation and inversion

Let a be a line and two points O,Q \in a forming a vector \vec {OQ}. Fixing a direction for the line using a vector \vec {AB}, we can define a mapping \rho_{OQ}:a \to b as \rho_{OQ}:= {f^+}_{OQ}. This mapping is called congruent translation by the vector \vec {OQ}. The vector \vec {AB} determines the order of points on the line. The mapping preserves the precedence of the points and is uniquely defined by its action being mapping the initial point O of the vector \vec {OQ} to its terminal point Q. Therefore, it maps any point X on the line to another point Y on the line such that X and Y construct a vector \vec {XY} and \overline{OQ} \cong \overline{XY}.

The proof is as follows. Given that \rho_{OQ}(O)=Q, and an arbitrary point X\in a such that \rho_{OQ}(X)=Y, we can write \overline{OX} \cong \overline{QY}. Without loss of generality we can assume O\prec X therefore Q \prec Y. Now assume that Q\prec X. Therefore, \overline{OX} = \overline{OQ} \cap \overline{QX}. Since \overline{OX} \cong \overline{QY}, we should have X\prec Y. Otherwise, Y is between Q and X and hence \overline{OX} = \overline{OQ} \cap \overline{QY} \cap \overline{XY} contradicting with \overline{OX} \cong \overline{QY}. Now, we can write \overline{QY} = \overline{QX} \cap \overline{XY}. Hence,

    \[\overline{OQ} \cap \overline{QX}  \cong \overline{QX} \cap \overline{XY} \implies \overline{OQ} \cong  \overline{XY}\]

which completes the proof.

This mapping is called congruent translation by a vector \vec_{OQ}because it models the translation of a particle that moves from a point O to a point Q along a line.

Lemma: Because \rho_{OQ} preserve the precedence over the line, it will preserve the initial and terminal points marked on a vector. Note that the direction of a line is set independently; the initial and terminal points of a vector on a line does not show their precedence.

Lemma: if O=Q, then \rho_{OO} is the identity map.

Remark: Later that we define the affine space, any map that take a point X and add it to a fixed free vector and returns another point is a congruent translation along the line passing through the point X and defined (including its direction) by the free vector.

If we let \eta_{OQ}:= {f^-}_{OQ}, the this map does not preserve the precedence and is called inversion.

3- Geometric vectors and Euclidean geometric vector space

A geometric vector is a directed line segment (abstract) in a sense that one of the segment’s end point is distinguished/marked with respect to the other. A line segment is then can define two distinct geometric vectors. For a line segment AB, its two vectors are denoted as \vec{AB} and \vec{BA}. The first and last points of a vector are respectively called the initial and the terminal points.

Definition: The/a zero vector is a vector whose initial and terminal points coincides.

Note: Saying that a vector belongs to or lies on a line or a plane means its end points belong to/lies on the line/plane.

Definition [EV1]: Two vectors \vec {AB} and \vec {CD} lying on one line are called equal if they are co-directed and if the segment AB is congruent (pure coincident if translated) to the segment CD.

There can be 3 types of vectors:

1- Position vectors or (pure) geometric vectors: A position vector is a vector whose position is fixed in the space.

2- Sliding vectors: A sliding vector is a vector that is free to slide along its line. In other words, a sliding vector is a class of mutually equivalent vectors in the sense of the definition EV1. A slipping vector has infinite representatives lying on a given line. They are called geometric realizations of this slipping vector.

3- Free vectors: The classes of mutually equal vectors in Euclidean geometry space are called free vectors. Geometric vectors composing a class (of free vectors) are called geometric realizations of a free vector. The concept of equal vectors in space is as follows.

Definition: Two vectors \vec {AB} and \vec {CD} NOT lying on one line are called codirected if they lie on parallel lines, and the segment BD connecting their end points does not intersect the segment AC connecting their initial points.
Definition: Two vectors\vec{AB} and \vec{CD} in the space are called called equal if they are codirected and if the segment AB is congruent to the segment CD. The congruency of straight line segments can be inferred by their lengths. The equality of vectors in the space is reflexive, symmetric, and transitive.

Noe that the concept of parallel lines are in a plane, i.e. two different lines are parallel if they are in a plane and have no intersection.

ِFor any free vector \vec {a} and for any point A there is a geometric realization \vec {AB} of \vec {a} with the initial point A.

Free vectors, lines and planes: By realization of a free vector with a particular initial point, a directed line segment (with the initial and the terminal points) is born and hence the corresponding line containing the point and the vector. By realization of two non-collinear vectors with the same initial point, a plane containing the point and the two vectors is defined.

Euclidean geometric vector space: Generally a vector is not limited to geometric vectors. A vector is in fact a member of a vector space. Geometric vectors defined as directed line segments also belong to a vector space. A vector space is a set on which addition operation and multiplication by a scalar (or any other field) are defined and the set is closed under these operations (some other axiom also exist in the definition of a vector space). Geometric vectors also belong to a space called the Euclidean geometric vector space denoted as \mathbb E.

Remark: since \mathbb E is closed, its members are free vectors, i.e. the classes of mutually equal vectors. This means the space is not restricted to a set of realizations of geometric vectors.

Remark: When working with \mathbb E, i.e the space of free geometric vectors, their arbitrary realizations in the geometric space are recruited. In this regard, they are treated as directed line segments. Also, we have all the properties of the geometric space (points, lines, planes, axioms, theorems, etc). So, we can develope theorems and show that they are independent of a particular realization.

In the geometric vector space the multiplication by a scalar scales the length of the vector (line segment) in the same direction if the scalar is positive. The direction of the vector is reversed and then scaled if the scalar is negative. The addition of vectors \vec{AB} and \vec{CD} is by treating them as free vectors and translating \vec{AB} such that the end point B get coincided with the end point C, then the new vector \vec{AD} is the result of \vec{AB} + \vec{CD}.

Like any other vector spaces, the Euclidean geometric vector space have bases and hence vectors can be written in terms of the basis vectors. This generates coordinates for the vectors. The dimension of the Euclidean vector space is at most 3, i.e. three dimensional physical space denoted as \mathbb E^3.

Other than properties and operations regarding vectors, the definition of angle also exists in the Euclidean geometric vector space. The angle between two vectors is the angle between they lines when both vectors are treated as free vectors and their geometric realizations share the same starting point.

3- Conditionally geometric vectors:

If we considers the locations in the physical space as the notion of points and construct the Euclidean geometry, then a vector from a spatial point to another is called a displacement vector. This vector binds some two points of the geometric space. The lengths of this vector is a geometric length defined based on the ruler placement axiom. The unit of the length is defined by a length scale like meter.

A displacement vector is a geometric vector as it’s in the geometric space. A geometric vector has direction and a relative orientation/angle (with other geometric objects). Some physical phenomena like velocity, acceleration, force, etc have also the senses of geometric directions and geometric orientations in the physical space. Moreover, they are bound to some points in the space, i.e. the physical effect is bound or recognized at a point. For example we say “force at a point/location” or “velocity of an object at a point/location”. Therefore, in the physical space recognized as a model of the Euclidean geometric space, the bounding point, orientation, and direction of this type of vectors have geometric representations, i.e. a point, angle, sense of direction. However, these vectors do not have geometric length, in other words, their bounding point can be recognized as either the initial or the terminal point of a vector. Therefore, there is not a direct geometric representation of such a vector unless a geometric length through choosing some scale factor is assumed. Any non-geometric vector that can have a geometric representation upon setting a geometric orientation, direction, and length is called a Conditionally geometric vector.

3-1 Coordinates of a vector in the spacial Euclidean geometric vector space

Let \vec{u}, \vec{v}, \vec{w} be an ordered set (sequence) of different independent geometric (free) vectors in \mathbb E^3; i.e. they are not co-planar all together. Then, they form a basis for the space and any vector \vec{r} is a linear combination of them, i.e. \vec{r}=x\vec{u} +y\vec{v}+z\vec{w}. The scalars [x,y,z] are called the coordinates of \vec{r}. For any vector and a basis \mathcal B, there a unique coordinate function defined as [\cdot]_{\mathcal B}:\matbb E^3 \to \mathbb R^3.


3-2 Dot/scalar product

Noting that there are a definition/notion of angle and definition of triangular functions in \mathbb E^3, we can define a function, called the dot or scalar product, as \bullet : \mathbb R^3 \times \mathbb R^3 \to \mathbb R acting as \vec{a} \cdot \vec{b}:=\|a\|\|b\|\cos (\theta), where \| \vec{v}\| returns the length of the vector, i.e. the line segment of the vector. As a result perpendicularity of vector implies \vec a \cdot \vec b = 0.

3-2-1 Orthogonal projection onto a line

Theorem: For any nonzero (geometric) vector \vec a and for any vector \vec b, there are two unique vectors \vec b_\parallel and \vec b_\perp such that \vec b_\parallel is collinear to \vec a, and \vec b_\perp is perpendicular to \vec a, and they both satisfy \vec b = \vec b_\parallel +  \vec b_\perp.

Proof is by geometric realization of the vectors, and using the theorem that says there is a plane containing \vec a and \vec b, and the theorem that states there is perpendicular line to another line l from any point not on l.

Definition: For any vectors \vec b and \vec a\ne \vec 0, the mapping \pi_{\vec a}: \vec b \mapsto \vec b_{\parallel} is called the orthogonal projection of \vec b onto the direction of \vec a or onto the line containing (collinear with) \vec a.

Theorem [OP1]: For each vector \vec a \ne \vec 0 and for a vector \vec b:

    \[ \vec b_{\paralel} = \pi_{\vec a}(\vec b) = \frac{\vec b \cdot \vec a}{\|\vec a\|^2}\vec a\]

Proof: to prove that two geometric are equal, we should prove that they lengths are equal and also they are co-directed. The first part can be proved by the definition of dot product and \vec b_{\paralel} = \|\vec b\| = |\cos \phi| as in the previous figure. The second part is proved by showing that \vec b_{\parallel} is in the same direction as the RHS for 0\le \phi <\pi/2, \phi = \pi/2, and \pi/2 < \phi \le \pi.

Lemma [OP2]: For any \vec a\ne \vec 0 the sum of two vectors collinear to \vec a is a vector collinear to \vec a and the sum of two vectors (each) perpendicular to \vec a is a vector perpendicular to \vec a.

Proof: The first part is straight forward. For the second part we assume \vec b and \vec c are not either collinear or parallel as the proof becomes straight forward. So, we assume \vec b \nparallel \vec c. Now, build a realization for \vec b as \vec{OB} at an arbitrary point O and a the realization of \vec c as \vec {BC}; Since the realizations are not parallel, they construct a plane. The resultant vector \vec {OC} = \vec {OB} + \vec {BC} is in the plane and it is the realization of \vec b + \vec c. Now consider another realization of \vec a = \vec {BD} at B. Each realization corresponds to a line; since the line OA\perp OB and BD \perp BC (since OA \parallel BD), then OA \perp BC. Therefore, OA is perpendicular to two lines in a plane, and hence it is perpendicular to all lines including OC in the plane. This implies that \vec a = \vec{OA} is perpendicular to \vec b + \vec c = \vec{OC}.

Theorem: \pi_{\vec a} is a linear mapping, i.e. \pi_{\vec a}(\vec b + \vec c) = \pi_{\vec a}(\vec b) + \pi_{\vec a}(\vec c) and \pi_{\vec a}(\alpha \vec b) = \alpha \pi_{\vec a}(\vec b).

Proof: Let \vec d:= \vec b + \vec c. Using the orthogonal projection, we can write, \vec b = \vec b_{\parallel} + \vec b_{\perp}, \vec c = \vec c_{\parallel} + \vec c_{\perp}, and \vec d = \vec d_{\parallel} + \vec d_{\perp} with respect to \vec a. Therefore,

    \[\vec d = \vec b + \vec c = (b_{\parallel} + c_{\parallel}) + (b_{\perp} + c_{\perp})\]

By the lemma (OP2), (b_{\parallel} + c_{\parallel})\parallel \vec a and (b_{\perp} + c_{\perp}) \perp \vec a. Therefore, d_{\parallel} = b_{\parallel} + c_{\parallel} and d_{\perp} = b_{\perp} + c_{\perp}. Using theorem OP1,we can deduce,

    \[\pi_{\vec a}(d) = \pi_{\vec a}(b+c)= d_{\parallel} = b_{\parallel} + c_{\parallel} = \pi_{\vec a}(b) + \pi_{\vec a}(c)\]

The second property can be proved by theorem OP1.

3-2-2 Properties of the dot product (theorem)

1- \vec a\cdot \vec b = \vec b\cdot \vec a
2a- (\alpha \vec a)\cdot \vec b = \alpha (\vec a \cdot \vec b)
2b- \vec a \cdot (\alpha)\vec b = \alpha (\vec a \cdot \vec b)
3a- (\vec a + \vec b)\cdot \vec c = \vec a \cdot \vec c + \vec b \cdot \vec c
3b- \vec a \cdot (\vec b + \vec c) = \vec a \cdot \vec b + \vec a \cdot \vec c
4- \vec a \cdot \vec a = \|\vec a\|^2 \ge 0 and \vec a \cdot \vec a=0 \implies \vec a = \vec 0

Proof: Only the 3rd one will be proved as the rest can be readily proved through using the definition.

If \vec c =\vec 0, then it is a trivial case. If not, we can use theorem OP1 and write

    \[\pi_{\vec c}(\vec a + \vec b) - \pi_{\vec c}(\vec a) -\pi_{\vec c}(\vec b) = \frac{(\vec a + \vec b)\cdot \vec c - \vec a \cdot \vec c - \vec b \cdot \vec c }{\|\vec c\|^2}\vec c\]

By the linearity of the orthogonal projection, the LHS of the above equation is zero, hence the proof is implies by setting the RHS equal to zero. The property 3b is then proved by the using the first property and 3a which is already proved.

In terms of type, the dot product is a bilinear map.

3-2-3 Calculation of the scalar product through the coordinates of vectors in a skew-angular basis

Definition: a skew-angular basis (SAB) for \mathbb E^3 is an orered set of non-coplanar vectors. The angle between each pair is not necessarily a right angle.

Let \vec e_1, \vec e_2, \vec e_3 be a SAB. Also consider two free vectors \vec a and \vec b. Each vector can then be written in terms of the basis vectors as,

    \[\vec a = \sum a_i\vec e_i \quad \text{ and } \vec b = \sum b_i\vec e_i\]

where a_i and b_i are the coordinates of \vec a and \vec b respectively.

Using the Einstein notation and the properties of the dot product, the dot product \vec a \cdot \vec b is written as

    \[\vec a \cdot \vec b = (a_ib_j)\vec e_i\cdot \vec e_j\]

The terms g_{ij}:=\vec e_i \cdot \vec e_j do not depend on the vectors \vec a and \vec b; but on the lengths of \vec e_is and the angles between them. If the values g_{ij} are collected in a symmetric matrix, the matrix is called the Gram matrix of the basis \vec e_1, \vec e_2, \vec e_3. The above equation can be written using matrix multiplications as

    \[\vec a \cdot \vec b = a^TGb \quad \text{ or } = b^TGa\]

where a^T =[a_1\ a_2\ a_3], G=[g_{ij}] and b = [b_1\ b_2\ b_3]^T. This equation expresses the dot product through the coordinates of vectors in a skew-angular basis.

Proposition: If the basis of the space is an orthonormal basis, i.e. for any basis vector \|\vec e_i\|=1 and for any pair of the basis vectors \vec e_i\cdot \vec e_j=0, then G becomes the identity matrix and \vec a \cdot \vec b = a^Tb = b^Ta

3-3 Concept of orientation and the cross/vector product

ِDefinition: An ordered triple of non-coplanar vectors \vec a,\vec b, \vec c is called a right triple if, when they are geometrically realized with the same initial point and observed from the end/terminal of the third vector, the shortest rotation from the first vector toward the second vector is seen as a counterclockwise rotation. If the rotation is seen clockwise, then the triple is called a left triple.

Definition: The property of ordered triples of non-coplanar vectors to be right or left is called their orientation.

Definition: The cross/vector product is defined as \times : \mathbb E^3 \times \mathbb E^3 \to \mathbb E^3 acting as \vec{a} \times \vec{b}:=\big (\|a\|\|b\|\sin (\theta) \big ) \hat{n} where 0 \le \theta \le \pi and \hat n is a vector perpendicular to both \vec a and \vec b and pointing in the direction defined by the right hand rule, i.e. the vectors \vec a, \vec b, \vec c form a right triple.

Definition: The the orthogonal complement a free vector \vec a is the set of all free vectors \vec x perpendicular to \vec a; i.e. \alpha_{\vec a}=\{\vec x: \vec x \perp \vec a\}.

The orthogonal complement \alpha_{\vec a} of a vector \vec a can be visualized as a plane or vectors in a plane if one of the geometric realizations of the vector is considered as \vec {OA}. Indeed, all vectors starting from the initial point O and perpendicular to \vec {OA} has ending points in (belongs to) the plane \alpha.

In 1-1-3, it was shown that if a vector \vec a \ne vec 0 is given, then any vector \vec b can be written as

    \[\vec b = \vec b_{\parallel} + \vec b_{\perp} = \pi_{\vec a}(\vec b) + \vec b_{\perp}\]

where \vec b_{\parallel} =  \pi_{\vec a}(\vec b) is the orthogonal projection of \vec b onto the direction of \vec a and \vec b_{\perp} is the component of \vec b perpendicular to \vec a and hence \vec b_{\parallel}. The vector \vec b_{\perp} is called the orthogonal projection onto a plane perpendicular to \vec a, or the orthogonal projection onto the orthogonal complement of \vec a.

Denoting \vec b_{\perp} as \pi_{\perp \vec a}(\vec b), we can write,

\vec b = \pi_{\vec a}(\vec b) + \pi_{\perp \vec a}(\vec b) \tag{OP1}

Theorem [OPP]: The function \pi_{\perp \vec a} is a linear mapping, i.e. \pi_{\perp \vec a}(\vec b + \vec c) = \pi_{\perp \vec a}(\vec b) + \pi_{\perp \vec a}(\vec c) and \pi_{\perp \vec a}(\alpha \vec b)= \alpha \pi_{\perp \vec a}(\vec b).

Proof: Use Eq. OP1 and write (\vec b + \vec c)= \pi_{\vec a}(\vec b + \vec c) + \pi_{\perp \vec a}(\vec b + \vec c) and use the linearity of \pi_{\vec a} and then arrange the terms and use Eq. OP1 again. The same procedure can be done for the second condition.

Rotation about an axis:

Let \vec a, \vec b be two non-zero vectors. Lay the vector \vec b at some arbitrary point B and get the realization \vec {BO}. Then lay the vector \vec a = \vec {OA}. This non-zero vector defines a line OA. We take this line as a rotation axis. For this setup, we define a function \theta_{\vec a}^\varphi: \vec{BO} \mapsto \vec {CO} and for now we call it the function of rotation of \vec {BO} about the axis OA. The rotation function works as follows.

\theta_{\vec a}^\varphi(\vec {BO}) = \vec {BO} \iff B \in OA. Otherwise,

    \[\theta_{\vec a}^\varphi(\vec {BO}) = \vec{CO} \ne \vec {BO} \]

such that:
1- B and C belong to a plane \alpha being perpendicular to OA.
2- \|\vec {CO}\| = \|\vec {BO}\|
3- The angle between the vectors \vec {PB} and \vec {PC} where P is the intersection point of OA and the plane \alpha. The angle is a signed angle with respect to the counterclockwise direction.

The vector \vec {CO} is a realization of any vector \vec c being equivalent to \vec {CO} by parallel translation. This is also true for \vec {BO} being the realization of \vec b. Moreover, the parallel translation of \vec a indicates that the rotation function does not depend on the realization of the rotation axis vector. Therefore, we can re-define the rotation function over the vector space as:

    \[\theta_{\vec a}^\varphi: \mathbb E^3 \to \mathbb E^3\]

Remark: If \vec b \parallel \vec a, then \theta_{\vec a}^\varphi(\vec b)=\vec b.

Theorem [RF]: The rotation function \theta_{\vec a}^\varphi: \mathbb E^3 \to \mathbb E^3 is a linear mapping.

The relation of the cross product with projections and rotations:

For two non-colinear vectors \vec a and \vec b we can write \|\vec c\|=\|\vec a\|\|\vec b\| \sin \var phi and sketch the following realization.

The resultant vector \vec c lies in the plane \alpha perpendicular to \vec b (because \vec c\perp \vec b therefore there is a unique plane that contains \vec c AND perpendicular to \vec b). The orthogonal projection of \vec a onto the plane is \vec a_{\perp} = \pi_{\perp\vec b} (\vec a) whose length is \|\vec a_{\perp}\| = \|\vec a\|\sin \varphi

3-3-1 Properties of the cross product (theorem)

2- Affine geometric space

A geometric vector is defined as a directed line segment which has two points. Here, the final produced object, vector, belongs to a vector space like \mathbb E^3, but its fundamental components which are the endpoints belong to the geometric space, like \mathcal G. The line segment can be implied by the two end points because of the axiom of the Euclidean geometry stating that there is a unique line for every two distinct points. So, there should be a connection between the geometric space and the vector space. This is defined by the concept of an affine space as follows.

Definition: Let \mathbb V be a vector space over the field \mathbb F, and let \mathcal S\ne \emptyset be a set. The addition +:\mathcal S \times \mathbb V \to \mathcal S written as p + \vec v is defined for any p\in \mathcal S and \vec v \in \mathrm V with the following conditions:

1- p + \vec 0 = p
2- (p + \vec v) + \vec u = p + (\vec v + \vec u)
3- For any q\in \mathcal S there exists a unique vector \vec v \in \mathbb V such that q=p + \vec v.

Then, \mathcal S, is called an affine space.

With the above abstract definition, the affine Euclidean geometric space is defined by letting the \mathbb V = \mathbb E^3 and \mathcal S = \mathcal G where the geometric space \mathcal G is inherently a set of points (an axioms). The notion is clear that a point plus a vector takes us to another points; these points become the end points of the realization of the geometric vector.

3- Models of Euclidean geometry

One model for Euclidean geometry is to define a point as (the imagination of) an exact location which has no dimension/size in the space. A line is defined as a straight line which is an infinitely long object in any (two) directions and has no width or thickness, and it uniquely exists (or can be defined/constructed) between any two points. A flat surface that is infinitely wide and has zero thickness is a model of a plane. A flat surface is an object whose any of its two points defines a line segment that is completely contained in the surface. This model of geometry can be referred to as the physical space geometry.

Another model for Euclidean geometry is as follows. Let S=\mathbb R^2 =\{(x,y)| x,y\in \mathbb R\} be the set of points, i.e. a point is now defined as an ordered pair in \mathbb R^2. Define lines as: 1- a vertical line \mathbb R^2 \supset L_a:=\{ (x,y)\in \mathbb R^2 |x=a\}, 2- a non-vertical line as \mathbb R^2 \supset L_{m,b}:=\{(x,y)\in \mathbb R^2 |y=mx+b\}. Then, it can be proved that the structure \{S, \mathcal L:=\{L_a\} \cup \{L_{m,b}\}\} satisfies the axioms of lines in Euclidean geometry. The model \{\mathbb R^2, \mathcal L\} is called the Euclidean plane. A coordinate function (axiom of ruler) of a line in this space is constructed as follows.

For a line l, assume that there is a distance function d on l. As previously defined, a function f: l \to \mathbb R is called a ruler or a coordinate system for l if (1) f is a bijection, and (2) for each pair of points P and Q on l, |f(P) - f(Q)| = d(P,Q). This equation is called the ruler equation and the value and f(P) is called the coordinate of P with respect to f.

For this space, if the Euclidean distance defined as d_E(P,Q)=\big( (x_1-x_2)^2 + (y_1-y_2)^2 \big)^{1/2} is considered as the distance function, the following rulers can be obtained for lines in the space.

Case 1: If l = L_a and P\in l, i.e. P=(a,y). Then, for P,Q\in l: d_E(P,Q) = ((a-a)^2 + (y_P-y_Q)^2)^{1/2}=|y_P-y_Q|. Therefore, by the definition of the ruler function we can define f:l\to \mathbb R as f(P)=y_P as the coordinate function of a vertical line (the function can be proved to be injective).

Case 2: If l = L_{m,b} and P\in l, i.e. P=(x,y), where y=mx+b. Then, d_E(P,Q) =\sqrt{1+m^2}|x_P-x_Q|. This implies that a ruler function for this type of line is f(P)=x_P\sqrt{1+m^2}.

The model \{\mathbb R^2, \mathcal L, d\} is called the Euclidean metric geometry.

The only main axiom left to make \{\mathbb R^2, \mathcal L, d\} a complete model of the Euclidean geometry is the axiom of coordinate functions of bundles. Therefore, a measure on angles is needed. If l_1 and l_2 are two intersecting lines at a point O=(x_o,y_o), with points A=(x_1,y_2)\in l_1, and B=(x_2, y_2)\in l_2, then a measure of the angle \angle AOB satisfying the axioms of this measure is defined as:

    \[ \theta = m(\angle AOB) :=\cos^{-1} \frac{\vec {OA} \cdot \vec{OB}}{\|OA\| \|OB\|}\]

where \vec {OA}:=(x_1-x_o,y_1-y_o), \vec {OB}:=(x_2-x_o,y_2-y_o), and \vec {OA} \cdot \vec{OB} = (x_1-x_o)(x_2-x_o) + (y_1-y_o)(,y_2-y_o). Moreover, \|OA\|=d_E(O,A),  \|OB\|=d_E(O,B).

A remark with this formula is that it uses the inverse of a trigonometric function \cos which is not defined yet. Also, \vec OA might be interpreted as a vector which is not defined either. The dot operation is also familiar and can be interpreted as a vector dot product. So, far we have not defined these terms as they are conventionally defined. We just wanted to show that such a measure exists for the geometric model; hence we have a mode of the Euclidean geometry in which points are interpreted as pairs of real numbers. In this geometry, vectors can be defined, i.e. A,B\in \mathbb R^2 then \vec{AB} =B-A:=(x_B-x_A, y_B-y_A) is a vector which is a member of a vector space, here \mathbb R^2 when has the structure of a vector space. The function \cos and its inverse need the definition of a circle to get defined. Although a circle (as per its notion) can be defined in this geometry, there is a pure algebraic way to define \cos ^{-1} as per chapter 5 in Ref 3.

The motivation to this geometry which can also be expanded to \mathbb R^3 and the above formulations is the concept of coordinates of points in the physical space geometry. This notion is born as the result of the notion of coordinate systems explained in next section.

2- Coordinate systems—

Let \mathcal G represents the physical Euclidean geometric space consisting of points, lines, planes, line segments, vectors, etc.

2-1 Cartesian coordinate system

2-2 Other coordinate systems

[1] Sharipov R.A. – Foundations of geometry for university students and high-school students.
[3] Richard S. Millman – Geometry. A Metric Approach with Models.

An introductory note to FEM

Notations and preliminaries

Conventions: Here, I write the indices of second order tensors as subscripts for convenience. I use the vector-and-covector related index position when needed. Dot is a dot product. A vector in \mathbb R^n is (isomorphically) considered as a single-column matrix for matrix operations. By \mathbb R^n, we mean the Euclidean vector space \mathbb E^n. For a function f:\mathbb R^n \to \mathcal S, we actually mean a function from the Euclidean vector space \mathbb E^n to a set \mathcal S. This means f(x_1,\cdots x_n) receives the coordinates of a vector in \mathbb E^n. In a Cartesian coordinate system for points, any point can be located by an Euclidean vector and hence the unit vectors of the coordinate vector field and the unit vectors of the coordinate system are the same. Therefore, f:R^n \to \mathcal S can be interpreted as a function on the points of the geometric space. In general, points in a geometric space can be located by any curvilinear coordinate systems, e.g. spherical or polar coordinate systems, and f:D\subet\mathbb R\to \mathcal S can be defined on the coordinates of the points in those coordinates systems. In these cases, the coordinate of a point represented for example as (r, \beta, \gamma)\in \mathbb R^3 is not a vector (i.e. member of a vector space), therefore, we should distinguish between the coordinates of a point and the coordinate vector field. This is necessary when defining total derivatives, directional derivatives, curls, etc.

Convention: \{e_i\}_{i=1}^n is an orthonormal basis of \mathbb R^n.

P1) Tensor contraction: tensor Contraction refers to the process of summing over a pair of repeated indices. This reduces the order of a tensor by 2.

N1) Matrix multiplication: C=AB in Einstein notation is C_{ij}=A_{ik}B_{kj} which is written in tensor notation convention as C=A\cdot B where the central dot denotes the single contraction. Note the position of the dummy index before and after the dot.

N2) Double contraction: double contraction is denoted as c = A:B and refers to c =A_{ij}B_{ji}.

N3) For a vector a=a_ie_i \in \mathbb R^n (a vector space) and f:S\subset\mathbb R^n \to \mathbb R defined as f:(a_1,\cdots,a_n)\mapsto y, the gradient is defined as \nabla f = \frac{\partial f}{\partial a_i}e_i \in \mathbb R^n. The gradient vector is denoted by a short form as \frac{\partial f}{\partial a}.

N4) Dyadic (tensor) product of two vectors: for two vectors a=a_ie_i and b=b_ie_i their dyadic or tensor product is defined as C = a\otimes b := a_ib_je_i\otimes e_j. The coordinates of the tensor C is then C_{ij} = a_ib_j\in \mathbb R^{n\times n} if \{e_i\otimes e_j\}_{i=1}^n is set as a basis. The coordinates can also be written as [C] = [a][b]^T where [\cdot] returns and collects the coordinates into a matrix.

P2) For a vector valued function F:S\subset \mathbb R^n \to \mathbb R^n, such that \mathbb R^n is a vector space and n>1, the divergence is defined as \nabla \cdot F = \frac{\partial F_i}{\partial  x_i} \in \mathbb R where F = F_ie_i.

P3) Gauss’ divergence theorem: let F:S\subset \mathbb R^{2 \text{ or } 3}\to  \mathbb R^{2 \text{ or } 3}, and V\subset \mathbb R^{2 \text{ or } 3} be a region with a (smooth or piecewise smooth) boundary \partial V which is a manifold having a dimension one less than that of V. Then,

    \[\int_V \nabla\cdot F \mathrm dV = \int_{\partial V} F\cdot \hat n \mathrm dS\]

where \hat n is the outward normal vector field on S. The outward direction is arbitrary defined but first \partial V, i.e. a surface or a bounding curve, should be orientable.

P4) A second-order tensor-valued function is a function L:\mathbb R^n \to \mathcal L where \mathcal L is the space of linear maps. The components of the resultant tensor are L_ij which are in fact the elements of the linear map’s matrix formed under a chosen bases for the vector spaces of the domain and range of the linear map

P5) For a second-order tensor-valued function, i.e. a function T:\mathbb R^n \to \mathcal L where \mathcal L is the space of linear maps the divergence operator is defined as \nabla \cdot T = v \iff v_i = \frac{\partial T_{ij}}{\partial x_j}, where T_ij are components (coordinates) of the tensor T. The components of the tensor are in fact the elements of the linear map’s matrix formed under a particular choice of a basis of the vector space.

Shape modes

A shape mode can be explained in the following simple and plain way.

Consider the following population of n shapes; each is a line segment with a circular bump or a pit.

Fig. 1. A population of shapes

The location of the bump/pit is measured by its distance from one end of the line as shown in Fig. 1. The status of the bump/pit can be described by a binary variable being either B or P. Assume that the mean location of the bumps/pits falls at the middle point of the line segment, and on average it is more probable that a line segment has a bump rather than a pit. Therefore, on average the following shape is expected, i.e. a bump at the middle.

Fig. 2. The mean shape.

Considering the mean shape and other shapes in the population, two kind of shape variations can be deduced: a) there will be either a bump or a pit, b) a bump or a pit located on a line can be at any distance to the left or right of the middle point of the line. So, any shape in the shape population can be described by a combination of these two shape variations from the mean shape; this is the notion of shape modes. So, simply, a shape mode is a shape variation from the mean shape (of a population) such that it is independent from other possible shape variations. (e.g. the location of the bump/pit vs the status of the bump/pit).

For sophisticated biological shapes, shape modes are more interesting and informative, also difficult to capture. For example see Figure 11 of my publication at investigating the mean shape and shape modes of the talus.

Python code snippets

Reading a CSV of numbers to a list

import csv
with open(fileName, 'r') as read_obj:
	csv_reader = csv.reader(read_obj)
    list_of_csv = list(csv_reader)
# list_of_csv is a list[list[String]] where each list contains a row of the csv. 

Writing a list to a CSV

import csv
with open(fileName, 'w', newline='') as f:
  writer = csv.writer(f) # create the csv writer
  # write a row to the csv file
  write.writerow(fields)  # fileds is a list indicating column names or a string as a header
  writer.writerows(listOfRows) # each row is list/iterable List[List]
  # Or use writer.writerow(aRows) with a loop over a list/iterable of rows

Writing a File

 with open(filename, 'w') as f:
        for element in an_iterator:  # list, dict, etc...
        	f.write(rowString(element))  # rowString is a function or expression that 
          								 # makes string out of the content of element

Sorting a list of lists based on a specific index

z = [['a',2], ['v', 0.256]]
sorted(z, key = lambda item:  item[1])

Matrix decomposition

Spectral (Eigenvalue) decomposition (EVD)

A matrix A\in \mathbb R^{n\times n} is diagonalizable if and only if it has n independent eigenvectors. Therefore, A = U\Gamma U^{-1} where the columns of U are the eigenvectors of A and \Gamma is a diagonal matrix of the corresponding (to the eigenvectors) eigenvalues \lambdas. Consequently, the square matrix can be written as the following sum.

    \[A= \sum_{i=1}^n \lambda u_iv_i^T\]

where (\lambda_i \in \mathbb R, u_i\in \mathbb R^n) is the i-th eigen pair, v_i^T is the i-th row of U^{-1}. Here, we only considered cases with real eigenvalues.

If A is a symmetric matrix, always having real eigenvalues and being diagonalizable, we can write A = Q\Gamma Q^T where Q contains columns of orthonormal eigenvectors of A. Therefore,

    \[A = \sum_{i=1}^n \lambda u_iu_i^T\]

This representation is called the spectral decomposition or eigenvalue decomposition (EVD) of A. As it can be observed the decomposition expresses A as a linear combination of n rank 1 matrices in which the coefficients of the combination are the eigenvalues of A Note that Q is not unique.

Singular value decomposition (SVD)

Theorem [SVD1]: If A\in \mathbb R^{n\times n} of rank k, then A can be factored as

    \[A =U\Sigma V^T\]

where U and V are n\times n orthogonal matrices and \Sigma is an n\times n diagonal matrix whose main diagonal has k positive entries and n-k zeros.


We can prove that the matrix B:= A^TA is symmetric. So, it has an EVD as

    \[B:=A^TA = VDV^T\]

where the columns of V contains orthonormal eigenvectors of B, and D contains eigenvalues of B and they are all non-negative. Because if (\lambda, u) is an eigen pair of B

    \[\|Au\|^2 = Au\cdot Au = u\cdot A^TAu =  u\cdot Bu = u\cdot \lambda u= \lambda u\cdot u = \lambda \|u\|^2 \ge 0 \implies \lambda \ge 0\]

Therefore, entries of D, being eigenvalues of B, are all non-negative. Note that u cannot be the zero vector as it is an eigenvector of B, but Au can lead to the zero vector, therefore we used \|Au\| \ge 0.

As by a rank theorem \text{rank} (A) = \text{rank} (A^TA)=\text{rank} (D), and by the assumption \text{rank}(A)=k, therefore, there are k positive entries and n - k zeros on the main diagonal of D.

Let’s consider the sets of eigen pairs of B:=A^TA as

    \[\{(\lambda_i, v_i)| \lambda_1 \ge \lambda_2 \ge \cdots \ge \lambda_k > 0,\ i=1,\cdots, k\} \text { and } \{(\lambda_{k+j}, v_{k+j})| \lambda_{k+1}=\lambda_{k+2}=\cdots=\lambda_{n} = 0,\ j=1,\cdots, n-k \} \]

where v_is are orthonormal (they exist as B is symmetric), i.e. i\ne j \iff v_i\cdot v_j=0 and \|v_i\|=1.

Considering the set of image eigenvectors \{ Av_i\}_{i=1}^n we can write, Av_i\cdot Av_j =\lambda_j (v_i\cdot v_j) =0 \text{ for } i\ne j. Therefore, Av_is are orthogonal.

Also, \|Av_i\|^2 = \lambda_i\|v_i\|^2=\lambda_i. Therefore, \|Av_i\| = \sqrt{\lambda_i} for i=1\cdots n.

Consequently, for \{\lambda_i\}_{i=1}^k, the set \{ Av_i\}_{i=1}^k is an orthogonal set of non-zero vectors in the column space of A (because for any vector x and matrix M, the term b=Mx is \sum x^iA_{,i}, i.e. a linear combination of the columns of A). Because \text{rank}(A)=k, the column space of of A has dimension of k, and hence \{ Av_i\}_{i=1}^k is an orthogonal basis for the column space of A. Normalizing these vectors as u_i:=Av_i/\|Av_i\| = 1/\sqrt{\lambda_i} Av_i, i=1,\cdots,k, we can find an extended set of orthonormal basis vectors for \mathbb R^n as,

    \[\{u_1, \cdots, u_k, u_{k+1}, \cdots, u_n | u_i\in \mathbb R^n,  \|u_i\|=1, u_i\cdot u_j = 0 \iff i\ne j\}\]

Note that Av_i=0 if i=k+1, \cdots, n because \|Av_i\| = \sqrt{\lambda_i} and \lambda_i=0 for those is. Therefore, we needed to extend \{u_1, \cdots, u_k\} to a set of n orthonormal vectors spanning \mathbb R^n.

Now defining a matrix U\in \mathbb R^{n\times n} as (u_i is in a column vector form)

    \[ U :=\begin{bmatrix} u_1 & u_2 & \cdots & u_k & u_{k+1} & \cdots & u_n \end{bmatrix} \]

and a diagonal matrix \Sigma\in \mathbb R^n as

    \[ \Sigma_{ij}:= \sqrt{\lambda_i}\delta_{ij}\]

where the diagonal entries \lambda_i are zero for i>k, we can write

    \[\begin{split} U\Sigma &= \begin{bmatrix} \sqrt(\lambda_1)u_1 & \cdots && \sqrt(\lambda_k)u_k & 0 && \cdots & 0 \end{bmatrix}\\ &= \begin{bmatrix} Av_1 & \cdots & Av_k & Av_{k+1} & \cdots & Av_n \end{bmatrix} \\ &= AV \implies U\Sigma = AV \end{split}\]

where the columns of V collect v_is. So, since V is orthogonal, then

    \[A = U\Sigma V^T\]

Note that the entries on the main diagonal of \Sigma are NOT the eigenvalues of A but rather the square roots of the eigenvalues (being non-negative) of A^TA. These values are called the singular values of A though. Also, u_is and v_is are respectively called the left and the right singular vectors of A.

Remark: Singular values of a square matrix A is the square roots of the eigenvalues of A^TA. Singular vectors of A is the eigenvectors of A^TA

Theorem [singular value decomposition of square matrices]: A square matrix A\in \mathbb R^n of rank k as a singular value decomposition as

    \[A = U\Sigma V^T =\begin{bmatrix} u_1 & \cdots & u_k & u_{k+1} & \cdots & u_n \end{bmatrix}\begin{bmatrix} \sigma_1  \\ & & \ddots \\ & & & \sigma_k \\ & & & & 0 \\ & & & & & \ddots \\ & & & & & & 0\end{bmatrix} \begin{bmatrix} v_1^T \\ \vdots \\ \vdots \\ v_n^T \end{bmatrix} \]

where \{v_i\}_1^n are orthonormal eigenvectors of A^TA, \sigma_i=\sqrt{\lambda_i} associated with the eigenpair (v_i,\lambda_i). The column vectors of V are ordered according to \sigma_1 \ge \sigma_2 \ge \cdots \ge \sigma_k > 0. Also, u_i = 1/(\sigma_i) Av_i for i\le k, and \{u_{k+1},\cdots, u_n\} is such that it extends \{u_i,\cdots,u_k\} to an orthonormal basis for \mathbb R^n.

Remark: U and V are not unique. If we use THE SVD of A, we refer to a specific decomposition not the uniqueness of the decomposition.

SVD of invertible matrices

If A\in \mathbb R^{n\times n} is invertible then k=\text{rank} (A) = n and hence all the elements on the diagonal of \Sigma are non-zero, and no need to do extension on u_is as there are already n independent orthonormal vectors.

SVD of symmetric matrices

If A\in \mathbb R^{n\times n} is symmetric, then A^T=A and hence A^TA = AA =: A^2. Therefore, if \text{rank}(A) = k, then A has k non-zero eigenvalues. If (\lambda_i, v_i) is an eigenpair of A, then (\lambda_i^2, v_i) is an eigenpair of A^2, because,

    \[Av_i = \lambda_i v_i \implies A^2v_i=\lambda_i Av_i =\lambda_i^2 v_i\]

Hence, in the SVD of the symmetric matrix A:

1- singular values of A are \sigma_i = \sqrt{\lambda_i^2}=|\lambda_i|.
2- left singular vectors of A are as v_is being the normalized eigenvectors of A.
3- right singular vectors of A are as u_i=Av_i/|Av_i|=\lambda_i/|\lambda_i|v_i=\pm v_i for i=1,\cdots,k

Consequently, the SVD and EVD of a symmetric matrix is the same if the eigenvalues of the matrix are non-negative, otherwise the decompositions are up to the sign of a left (or a right) singular vectors.

Reduced SVD

The zero rows of a SVD are redundant and can be removed, therefore we can write the following reduced SVD,

    \[A =U\Sigma V^T \equiv \tilde U_{n\times k} \tilde \Sigma_{k\times k} \tilde V^T_{k\times n} =\begin{bmatrix} u_1 & \cdots & u_k  \end{bmatrix}\begin{bmatrix} \sigma_1  \\ & & \ddots \\ & & & \sigma_k \end{bmatrix} \begin{bmatrix} v_1^T \\ \vdots  \\ v_k^T \end{bmatrix} \]

Also the corresponding reduced SVD expansion of A is,

    \[A = \sum_{i=1}^k \sigma_i u_i v_i^T \]

Reduced rank SVD

Considering a reduced SVD expansion of A with rank k, suppose that the singular values \sigma_{r+1}, \cdots, \sigma{k} are sufficiently small (relative to \sigma_i,  \text{ for } i\le r) that dropping them would lead to an approximation of A with an intended error. Therefore, we can define a reduced rank approximation or rank r approximation of A as,

    \[A\approx A_r = \sum_{i=1}^r \sigma_i u_i v_i^T \]

It can be proved that \text{rank}(A_r) = r.

In numerical computations, the reduced expansion of A_r needs much less space to be saved in the memory compared to its full rank SVD expansion.

Polar decomposition

A shallow summary of the probability theory

1- Measure space

1-1 Sigma algebra

Definition: an algebra on a set \Omega is a a collection of subsets, \mathcal F, that are
1) closed under complement, i.e. E\in\mathcal F\implies E^c\in \mathcal F.
2) closed under finite unions.
3) \emptyset \in \mathcal F.
4) as a result of (1) and (2): closed under finite intersection.
5) as a result of above \Omega\in\mathcal F.

Definition: Given a set \Omega, a sigma algebra (σ-algebra) on \Omega is a collection \mathcal{A}\subset 2^\Omega such that \mathcal{A}\neq\emptyset and\mathcal{A} is,
1) closed under complement, i.e. E\in\mathcal{A}\implies E^c\in \mathcal{A}.
2) closed under countable unions (i.e infinite unions).
3) \emptyset \in \mathcal A
4) as a result of (1) and (2): closed under countable intersection.
5) as a result of above \Omega\in\mathcal{A}.

Example: let E\subset \Omega then \sigma (E):=\mathcal{A}=\{\emptyset,E,E^c,\Omega\} is a sigma algebra generated by E.

Definition: The σ-algebra generated by a collection (set), E, of sets is denoted as \sigma (E) and is defined as the intersection of all sigma-algebras containing E. It is the smallest sigma algebra which contains all of the sets in E.

Example: Borel σ-algebra is generated by (the collection of) open sets of a topological space. For instance if we consider the standard metric topology on \mathbb{R}, i.e. open intervals are open sets, then the following example sets are in the Borel σ-algebra generated by the open sets:
(a,b),\ [a,b],\ \{a,b,c,..,z\},\ \{a,b,c, ...\},\ \{a\}\cup (c,d),\ (a,b],\ (a,b)\cup (c,d), and any other combinations of open, half open, closed, and single tones.

For a topological space (\Omega, \mathcal O), the Borel sigma algebra is denoted by \mathcal B := \sigma(\mathcal O), and any subset of which is called a Borel set.

Proposition: The intersection of σ-algebras will be a σ-algebra. Their union may not be a σ-algebra.

1-2 Measurable space

For a set \Omega and \mathcal{A} on \Omega, the pair (\Omega,\mathcal{A}) is called a measurable space.

Any set in \mathcal{A} is said to be a measurable set.

1-3 Non-negative countably additive set function

Give a set \Omega and a collection/family of subsets R of \Omega, i.e. R\subset 2^\Omega, we say \mu is a set function on R if \mu assigns to every A\in R a number \mu(A) of the extended real number system \mathbb{R}\cup+\{\infty\}\ or\  \{-\infty\} . For a non-negative set function, we consider \mu:R\to[0,+\infty]. The symbol \infty is included in the range of the function as we are considering the extended real number system (\mathbb{R}^+\cup\{+\infty\}).

The function \mu is countably additive iff \forall A_i\in R, A_i\cap A_j = \empty s.t. i\neq j, i.e. disjoint sets, implies that

    \[\mu (\bigcup\limits_{i=1}^{\infty} A_i) = \sum_{i=1}^\infty \mu(A_i)\]

Properties of non-negative countably additive set functions, i.e. \mu:R\to[0,+\infty]

1- \mu (\bigcup\limits_{i=1}^{n} A_i) = \sum_{i=1}^n \mu(A_i) iff i\neq j \implies A_i\cap A_j = \emptyset.

2- \mu (\emptyset) = 0.

3- If B\subset A then \mu(B) \le \mu(A) (i.e being monotonic ); and also \mu(A\setminus B) = \mu(A)-\mu(B). Note that the results are not necessarily correct if B\not\subset A.

4- The countably subadditive property (or inclusion-exclusion principle): \mu (\bigcup\limits_{i=1}^{n} A_i) \le \sum_{i=1}^n \mu(A_i) This also holds for n\to \infty.

5- Let A_1\subset A_2 \subset \cdots be an increasing sequence of nested sets such that A=\bigcup\limits_{i=1}^{\infty} A_i, then i\to \infty \implies \mu(A_i)=\mu(A). This is a limit of a sequence.

6- Let A_1\supset A_2 \supset \cdots be a decreasing sequence of nested sets such that A=\bigcap\limits_{i=1}^{\infty} A_i and \mu(A_1)<\infty, then i\to \infty \implies \mu(A_i)=\mu(A).

1-4 Measure (non-negative)

A measure \mu on a measurable space (\Omega,\mathcal{A}) is a non-negative countably additive set function
\mu: \mathcal{A} \to[0,+\infty].

1-5 Measure space

The triple (\Omega,\mathcal A, \mu) is called a measure space.

1-5-1 Counting measure

For a finite set A, \mu_c (A) is defined as the number of members inside the set.

1-5-2 Lebesgue measure

Definition: A cube or rectangular box in \mathbb  R^n is a Cartesian product of closed intervals in \mathbb R as,
Q = [a_1,b_1]\times \cdots \times [a_b, b_n].

The volume of Q is then defined as \mathrm V (Q) := \prod_{i=1}^n(b_i-a_i). This is a positive number.
For example, this volume becomes a length of an interval, area of a rectangular region, and volume of a cubic region in \mathbb R^n for n equals 1,2 and 3.

Definition: The Lebesgue outer/exterior measure of a set A\subset \mathbb R^n is defined as,

    \[\mu^*(A) = \inf \{ \sum_i \mathrm V (Q_i)| A\subset \bigcup_i Q_i \} \]

A term S:=\bigcup_i Q_i is called a covering of A when A\subset S

The infimum in the definition means we want to take the tightest outer approximation/fit.

The properties of the outer measure:
0- 0 \le \mu^* \le \infty
1- \mu^*(\emptyset) = 0
2- \mu ^* is monotone.
3- countably subadditive

Definition: A set A\in \mathbb R^n is Lebesgue measurable iff for any \varepsilon > 0, there exists an open set U\supset A such that \mu^* (U\setminus A)\le \varepsilon. If a set is Lebesgue measurable, we say it is measurable and write its measure as \mu(A)=\mu^*(A).

Lebesgue measure satisfies all the properties of a measure, i.e. properties of non-negative countably additive set functions as above.


1- By Lebesgue measure, every n-dimensional cube in \mathbb R^n is measurable and satisfies \mu (Q) = \mu(\mathring Q)=\mathrm V(Q), and \mu(\partial Q)=0 where \mathring Q and \partial Q denote the interior and the boundary of Q.

Definition: A subset A\subset \mathbb R^n is said to have Lebesgue measure of zero if for every \varepsilon > 0 there exit cubes Q_1, Q_2, \cdots such that A\subset \bigcup\limits_{i=1}^{\infty} Q_i and \sum_{i=1}^{\infty} \mu(Q_i) = \sum \mathrm V (Q_i) < \varepsilon.

By this definition, any k-dimensional subspace A of \mathbb R^n with k < n and hence any subset of A have measure zero. For example, when using the Lebesgue measure in \mathbb R^3, then the measure returns zero for singletons (and any countable set of real numbers), lines, and surfaces. The measure is not zero only for sets representing cubes in \mathbb R^3 i.e. volumes.

Any countable union of sets having measures zero has measure zero.

2- Lebesgue measure is countably additive for disjoint sets. However, this condition can be relaxed as the boundary of a n-dimensional cube/interval has measure zero and hence it does not matter whether boundaries of boxes intersects or not, as long as their interiors are disjoint. By definition, two cubes are non-overlapping iff their interiors are disjoint, i.e. \mathring Q_1 \cap \mathring Q_2 = \emptyset.

If \{Q_1, \cdots, Q_n\} are a finite collection of non-overlapping n-dimensional cubes (closed intervals), then \mu (\cup Q_i) = \sum \mu(Q_i)

3- Every open set A \subset \mathbb R^n can be written as countably infinite union of non-overlapping cubes/closed intervals (Q_is) in \mathbb R^n. In other words, A=\bigcup\limits_{i=1}^{\infty} Q_i. Therefore,
\mu (A)=\mu (\bigcup\limits_{i=1}^{\infty} Q_i)=\sum_{i=1}^{\infty} \mu(Q_i)=\rm V(A).

Above proposition is based on the fact that only finite union of closed sets is closed, however, countably infinite union of closed sets may not be closed. An open set can be produced by countably infinite union of closed sets.

1-6 Cartesian product spaces and measures

Definition: For a finite set of measurable spaces \{(\Omega_i, \mathcal A_i)\}_{i=1}^n, then the product σ-algebra \mathcal A on \Omega_1\times\cdots\times\Omega_n is the σ-algebra generated by all the sets of the form A_1\times\cdots\times A_n where A_i\in \mathcal A_i. Note that A_i refers to any set in \mathcal A_i.

Theorem: [Product measure] Let \{(\Omega_i, \mathcal A_i, \mu_i)\}_{i=1}^n be a set of measure spaces with \mu_i<\infty and (\Omega,\mathcal A) be the product measurable space. Then there exist a unique product measure \mu creating the product measure space (\Omega,\mathcal A,\mu) such that \mu(E=A_1\times\cdots\times A_n )= \prod_1^n \mu_i(A_i) for A_i\in \mathcal A_i.

Example: Consider a measure space (\mathbb R, \mathcal B, \mu_I) where \mu is the Lebesgue measure. For I\in \mathcal B, being a real interval [a_1,a_2], we have l(I):=\mu_I(I)=a_2-a_1 which is the length of the interval. The following holds for product spaces. For \mathbb R^n the product σ-algebra contains n-dimensional cubes Q = I_1\times \cdots \times I_n and hence \mu(Q)=\mathrm V (Q) = l(I_1)l(I_2)\cdots l(I_n). Therefore, we can consider a measure space as (\mathbb R^n, \mathcal B^n, \mathrm V).

2- Probability space

2-1 Probability measure

A probability measure P is a measure (non-negative countably additive set function) on a measurable space (\Omega,\mathcal{A}) such that P(\Omega)=1.

The properties of a measure (1 to 6 above) plus this property P(A^c)=1-P(A) which can be proved are called Kolmogorov’s axioms of probability theory. Since P(\Omega)=1 then \forall E \subset \mathcal A,\ P(E)\le 1. Hence, 0 \le P \le 1.

2-2 Probability space

Definition: A random experiment is an experiment or a process (measurement, action, etc) for which we cannot certainly (deterministically) predict its outcome. It does not mean that the process does not follow any rule and it randomly produces outcomes; but, the rule or factors affecting the outcome is not explicitly vivid/known to us.

A probability space is a measure space as (\Omega,\mathcal A, P). In this case, \Omega is called the sample space and contains all possible direct outcomes of a random experiment (trial). Each set in σ-algebra \mathcal A is a collection of outcomes and is called an event. In other words, an event is a measurable subset of \Omega; and \mathcal A is the collection of all the events.

A simple event is an event that comprises only a single (direct) outcome (a member of sample space). It means, \{\omega\} where \omega \in \Omega is a simple event.

Note that an event is a member of \mathcal A (and not necessarily a subset of \Omega). An outcome can belong to one or several particular event (set) or events in addition to being a member of \Omega. Occurring an event (realistically or by assumption ) means that the (observed) outcome of a trial of the experiment belongs (\in) to that event, i.e. a set. For instance, \Omega is the event that always occurs and \emptyset never occurs (meaning that a trial of an experiment always has an outcome).

Occurrence of an event (set) consisting of union of events means the occurrence of either (or all) of the events in the union. For example, we denote this as E=E_1\cup E_2 \cup E_3. This implies OR.

The intersections of events implies AND.

Example (discrete): An experiment of tossing a perfect coin can have either of two outcomes tail or head. Therefore, \Omega = \{T,H\}. A σ-algebra on \Omega can be a set like,
\mathcal A = \sigma(\Omega) = \{\{T,H\}, \emptyset\}\subset 2^\Omega or a set like
\mathcal A = \{\{T\}, \{H\}, \{T,H\}, \emptyset \}.
Then, P:\mathcal A \to [0,1].

The probability measure can be defined by setting P(\{T\})=a and P(\{H\})=1-a where 0\le a \le 1. If occurrence of tail or head has the same chance, then they have the same probability (measure) being equal to 0.5.

Example (continuous): An experiment of measuring the weather temperature. Outcome of a trial of this experiment can be any number in an interval \Omega = [a,b]; for example [-50,70]. The set of events, i.e. the σ-algebra, is the Borel σ-algebra \mathcal B on \mathbb R or [a,b] for this experiment. Not only, a single measurement is an event, but also any open/closed interval subset of \Omega is an event. Note that \Omega = [a,b] is an open set in the sub-space topology of [a,b].

A probability measure P:E\in \mathcal B \to [0,1] such that P=l(E)/l([a,b]) for E\in \mathcal B

Example: When an experiment contains trials or sub-experiment (performed sequentially or at the same time), the sample space is a set of tuples i.e. finite sequences. As a remark and example, if the sample space contains ordered pairs i.e. \Omega=\{(\omega_i,\omega'_j)| i=1\cdots n , j=1\cdots m\} then for an event like \{(\omega_2,\omega'_7)\} we can write,

    \[P(\{(\omega_2,\omega'_7)\})= P(\{(\omega_2,\omega'_j)| j=1\cdots m\} \cap \{(\omega_i,\omega'_7)|i=1\cdots n\})\]

Example: A process of independently tossing two coins (simultaneously or one after another) or tossing a coin twice.

    \[\Omega =\{(T,T), (T,H), (H,H), (H,T)\}\]

    \[\mathcal A=\{ \{(T,T)\}, \{(T,H)\}, \cdots \{(T,T), (T,H)\}, \cdots, \{(T,T), (T,H), (H,T)\}, \cdots, \Omega, \emptyset\}=2^\Omega\]

Note that an event like E = \{(T,T), (T,H)\} means E = \{(T,T)\} \cup \{(T,H)\} which is interpreted as OR, i.e either of the outcomes (are considered).

A probability measure can be constructed based the counting measure as,

    \[P(E)= \frac{\mu_c(E)}{\mu_c(\Omega)}\]

Example: Repeating an experiment/process until success, i.e a predetermined desired outcome.

    \[\Omega = \{ S, (F,S), (F,F,S), \cdots, (F,F,F,\cdots, FS), \cdots \}\]

    \[\matcal A =  \{ \Omega, \emptyset, \{S\}, \{(F,S)\}, \{(F,F,S)\}, \{S, (F,F,F,S)\}, \cdots\}\]

An event like \{S, (F,F,F,S)\} means we considered either observing the success at the first trial, or the success at the 4th trial of the experiment.

We cannot directly use a counting measure to set a probability measure for this experiment because \mu_c (\Omega)=\infty. However, we note that each simple event can be interpreted as an event of repeating the same (sub-) experiment (trial). For example \{(F,F,F,F,S)\} is an event of repeating the experiment for five times. If we show its sample space by \Omega_5 and observe that \mu_c(\Omega_5)=2^5, we can conclude that P(\{(F,F,F,F,S)\})=\frac{1}{2^5}. Therefore, if \omega\in \Omega and S = \{\omega\} is a simple event, then P(S)=\frac{1}{2^n} where n is the size of the n-tuple representing the event. To see that P(\Omega)=1 we write,

    \[P(\Omega)=P(\bigcup_{i=1}^\infty S_i)=\sum_{i=1}^\infty P(S_i)= \sum_{i=1}^\infty \frac{1}{2^i} = 1\]

2-2-1 Product probability space

Proposition: Let \{(\Omega_i, \mathcal A_i, P_i)\}_{i=1}^n be a set of probability spaces with and (\Omega,\mathcal A) be the product probability space. Then, there exist a unique product probability measure P creating the product probability space (\Omega,\mathcal A),P such that P(E=A_1\times\cdots\times A_n )= \prod_1^n P_i(A_i) for A_i\in \mathcal A_i.

This proposition explains the probability space of an experiment consisting of independent trials or sub-experiments.

Example (trials of the same experiment): an experiment which is repeated n times with the same possible outcomes. Each repetition is called a trial. We assume independence of the trials.

Since all the trials have the same possible outcomes, \Omega_1=\cdots = \Omega_n. Also, P_1=\cdots = P_n because the trials are independent. We have P(E=A_1\times\cdots\times A_n )= \prod_1^n P_i(A_i) for A_i\in \mathcal A_i.

Example (sub-experiment): an experiment of tossing a coin followed by/or simultaneously (independently) rolling a dice.

    \[\begin{split}\Omega &= \{(T,1), (T,2), \cdots, (H,6)\}\\\Omega_1 &= \{T,H\},\quad (\Omega_1, \mathcal A_1, P_1)\\\Omega_2 &= \{1,2,3,4,5,6\}, \quad (\Omega_2, \mathcal A_2, P_2)\\&\implies \Omega = \Omega_1 \times \Omega_2\\&\mathcal A =\{\{(T,1)\}, \cdots, \{(H,6)\}, \{(,),(,),\cdots (,)\} \Omega, \emptyset\}\end{split}\]

Using the proposition we can, for example, write

    \[P(\{(T,1), (H,6)\})= P(\{(T,1)\}) + P(\{(H,6)\}) = P_1(\{T\})P_2(\{1\}) + P_1(\{H\})P_2(\{6\}) \]

2-3 Random variable

Definition: Let (\Omega, \mathcal A) and (S, \mathcal G) be measurable spaces. A function f:\Omega \to S is called a measurable function (with respect to the σ-algebras) if f^{-1}(X)\in \mathcal A for every X\in \mathcal G. This reads f is measurable if every (measurable) set in \mathcal G has a (measurable) pre-image under f.

Using the definition of a subset and properties of pre-image , above can be written as f is measurable if f^{-1}(\mathcal G)\in \mathcal A.

Above function can be referred to as a map between measurable spaces and denoted as f:(\Omega, \mathcal A)\to (S, \mathcal G).

Definition: If f:\Omega \to \mathbb R is measurable from (\Omega, \mathcal A) to (\mathbb R, \mathcal B), such that \mathcal B is the Borel σ-algebra on \mathbb R, then f is called a Borel (measurable) function or a random variable. It can be proved that the following functions are random variables (with respect to \mathcal B).

1- All continuous or piecewise continuous functions f:\Omega \to \mathbb R.

2- If f,g:\Omega \to \mathbb R are random variables, then f+g, fg, f^g, f/g, \min(f,g) and \max(f,g) are random variables.

3- For f_i measurable, the limits \sum_i^\infty f_i(x) and \lim_{i\to\infty} f_i(x) are random variables if the limits exist.

4- If f and g are random variables, then their compositions are also random variables.

6- f is a random variable if and only if f^{-1}(a,\infty)\in \mathcal A for all a\in \mathbb R.

Definition: the value of a random variable for a specific input is called the realization of the random variable. For a random variable X and its value x at a particular \omega \in \Omega, we write x=X(\omega).

Definition: the range of a random variable X is R_X:=X(\Omega)\subset \mathbb R and is called the support of X.

Example: The experiment of tossing a perfect coin. Let’s define the following measurable sample space,

    \[(\Omega=\{H,T\}, \mathcal A=2^\Omega = \{\{H\}, \{T\}, \{H,T\}, \emptyset \})\]

and a function f:\Omega \to \mathbb R as f(H)=0.7 and f(T)=1.0. Then, this function is a random variable because considering the Borel measurable space (\mathbb R, \mathcal B), the expression f^{-1}(S)\subset \mathcal A holds for any S\in \mathcal B. Note that \emptyset\in \mathcal A and f^{-1}(S)=\emptyset if S is not a subset of the range of f; here, the range/support is \{0.7, 1.0\}. Also note that f is a continuous function if the subspace topology for the domain is considered. Therefore, f is a random variable in this regard as well.

In the above example T and H are symbols, they can be anything like numbers, or sets of numbers or symbols. For example, in an experiment of measuring heights of people the sample space can be set as (\mathbb R^+, \mathcal B), and we can define a random variable X:\mathbb R^+ \to \mathbb R as,

    \[X(\omega) = \begin{cases} 1 &\text {if } \omega \ge 1.7\\0 &\text {if } \omega < 1.7\end{cases}\]

Example: The experiment of measuring/reading some physical variable, e.g. temperature or force. The measurable sample space can be defined as (\Omega = \mathbb R, \mathcal B). Then, the identity function f:\mathbb R \to \mathbb R as f(\omega)=\omega is a random variable with respect to (\Omega = \mathbb R, \mathcal B) and (\mathbb R, \mathcal B) .

Example: The experiment of selecting 2 objects (O_i) out of n objects. The selection is without replacement and the order of selection does not matter. The sample space is \Omega = \{ \{O_i, O_j\} | i,j\in \{1,2,\cdots n\}, \quad i\ne j \}. This set contains C_r = n!/(2(n-2)!) members. The event space is as \mathcal A = 2^\Omega. A random variable, X:\Omega \to \mathbb R, can be defined as the total weight of the (pair) selected objects. Note that the codomain of the random variable is \mathbb R but its range is a discrete set of numbers. The measurable space regarding the random variable is (\mathbb R, \mathcal B) which itself can be considered as a sample space. In this regard, the experiment can be re-defined as measuring the weight of a pair of objects selected from the collection.

2-4 Random vector

Let (\Omega, \mathcal A) and (\mathbb R^n, \mathcal B^n) be a measurable spaces with \mathcal B^n being the Borel σ-algebra on \mathbb R^n. A random vector (X_1, X_2, \cdots, X_n) is a multivariable function/map

    \[\begin{split}&(X_1, X_2, \cdots, X_n):\Omega \to \mathbb R^n\\& \omega \mapsto (X_1(\omega), X_2(\omega), \cdots, X_n(\omega))\end {split}\]

such that for any set B\in \mathcal B^n,

    \[(X_1, X_2, \cdots, X_n)^{-1}(B)=\{\omega \in \Omega| \big(X_1(\omega), X_2(\omega), \cdots, X_n(\omega)\big)\in B \}\in \mathcal A\]

Note that (\mathbb R^n, \mathcal B^n) is a measurable space where \mathbb R^n can be regarded as a sample space and \mathcal B^n as the event set.

For a measurable space (\Omega, \mathcal A), it can be shown that if Y_i is a random variable Y_i:\Omega \to \mathbb R, then a multivariable function (Y_1, Y_2, \cdots Y_n):\Omega \to \mathbb R^n is a random vector.

Example: Consider the experiment of selecting a single object (or a person) from a collection of n objects (the population), each denoted and indexed as \omega_i. Regarding the measurable space, the sample space is \Omega = \{\omega_1,\cdots, \omega_n\} and the event set, i.e. σ-algebra is 2^\Omega. A random vector X: \Omega\to \mathbb R^4 can be defined as,

    \[(X_1(\omega), X_2(\omega),X_3(\omega), X_4(\omega)):=(\text{length of object},\text{height of object},\text{weight of object},  \text{color of object})\]

The color of an object is already quantified by a color code or RGB values.

Note that the set S = \{(x_1,x_2,x_3,x_4) | x_i\in \mathbb R \}, where x_i is the value/realization of X_i, is itself a sample space of all possible outcomes of the set of the measurements. Therefore, we can equivalently re-define the experiment as: the experiment of measuring the length, weight, length * weight, and color (code) of an object in a population. By this approach, the sample space is not the discrete one, i.e. all the n objects, but it is all possible 4-tuple of real numbers i.e. the entire \mathbb R^4. The σ-algebra or the event set in this case is \mathcal B^n.

Example (Sampling from a population): randomly selecting k objects or measuring/observing k quantities out of a countable (finite or infinite) or uncountable collection of objects or quantities. For instance selecting k individual out of a population, or measuring the length of k objects. In this case the measurable sample space is,

    \[(\Omega =\{(O_1,\cdots, O_k)\}, \mathcal A=2^\Omega)\]

In the case of objects, and

    \[(\Omega =\{(q_1,\cdots, q_k)| q_i\in \mathbb R^n\}, \mathcal A=\mathcal B^k)\]

in the case of measurements.

We can define a random vector X:\Omega \to \mathbb R^k as, (X_1(\omega), \cdots, X_k(\omega)) where,

    \[X_i(\omega)=f_i(\omega_i)\quad \text{s.t}\quad f_i:\omega_i\mapsto y\in \mathbb R\]

Remark: When an experiment is about measurement(s) of real quantity(ies) or it consists of repetition of the measurement(s), the sample space can be considered as the Cartesian product of real spaces \mathbb R or its subsets. Then, each member of the sample space can be reached by a random variable/vector being the identity function.

Example: measuring n different quantities or repeating a measurement of the same quantity n times has a sample space \Omega = I_1\times\cdots\times I_n \equiv \{(\omega_1, \cdots, \omega_n)| \omega_i\in I_i\subset\mathbb R\}; if the same quantity is being measured then I_i=I_j \forall i,j

Definition: the range of a random vector X is R_X:=X(\Omega)\subset \mathbb R^n and is called the support of X.

Note that if S_X is the support of a random vector/varible X, and S_Y is the support of another random vector/varible Y, then the suport of the random vector Z:=(X,Y) is S_Z= S_X\times S_Y.

2-5 Distributions of random variables and vectors

Definition: Let X: (\Omega, \mathcal A, P)\to (\mathbb R, \mathcal B) be a random variable, i.e. a measurable function X:\Omega \to \mathbb R. Then, the function P \circ X^{-1} is a probability measure on (\mathbb R, \mathcal B) and is called the distribution of X under P.

The term P \circ X^{-1} is also referred to as the induced probability measure or the push forward measure on (\mathbb R, \mathcal B). To prove that this term is a probability measure, we can show that it satisfies the properties of a probability measure. Note that because of the properties of the measure P and the pre-image X^{-1}, the function P \circ X^{-1}() is a measure, i.e. a set function therefore,

    \[P \circ X^{-1} :\mathcal B\to [0,1]\]

As a result, given a probability space (\Omega, \mathcal A, P) and a random variable X, we can construct a probability space (\mathbb R, \mathcal B, P_X) with P_X:=P \circ X^{-1}(E) for E\in \mathcal B. For the sake of notation, we can write:

    \[S\in\mathcal B,\quad P_X(S)=P\circ X^{-1}(S)=P(\{\omega|X(\omega)\in S\})\equiv P(\{X\in S\})\equiv P(X\in S)\]

Note that \{\omega|X(\omega)\in S\} \in \mathcal A.

Definition: Let (\Omega, \mathcal A, P) be a probability space and X:=(X_1, X_2, \cdots, X_n):\Omega \to \mathbb R^n be a random vector. The induced probability measure P\circ X^{-1} on (\mathbb R^n, \mathcal B^n) is called the distribution of the random vector or the joint distribution of X_1, X_2, \cdots and X_n.

Note that for X:=(X_1, X_2, \cdots, X_n):\Omega \to \mathbb R^n, the inverse image X^{-1} means,

    \[S\in\mathcal B^n,\quad X^{-1}(S)=\{\omega|X(\omega)\in S\}\]

If S is as S=\{(s_1,\cdots , s_n)| s_i\in I_i \text{ and } I_i\subset \mathbb R\}\equiv S=I_1\times\cdots\times I_n, then,

    \[X^{-1}(S)=X_1^{-1}(I_1)\cap\cdots \cap X_n^{-1}(I_n)\]

Remark: For any set S\subset R^n,\ n\ge 1, when we write P(X\in S) we imply using the distribution of X and recruiting P_X(S) := P\circ X^{-1}(S).

Remark: Note that P_X(S) := P\circ X^{-1}(S) is a probability measure; by the properties of an inverse image, if S_1 and S_2 are disjoint then X^{-1}(S_1) \cap X^{-1}(S_2) = \emptyset, i.e disjoint, hence, P_X(S_1 \cup S_2)=  P_X(S_1) + P_X(S_2). This is true for any countable union of sets in the range/codomain of X.

2-6 Distribution functions of random variables and vectors

The distribution of a random variable is a probability measure, i.e. a set function. Here, the distribution function or the cumulative probability distribution function of the random variable is defined.

Definition: Let X(\Omega, \mathcal A, P)\to (\mathbb R, \mathcal B) be a random variable. The function F: \mathbb R \to [0,1] defined as,

    \[F(x):=P\circ X^{-1}\big ( (-\infty, x]\big ) = P_X((-\infty, x])\equiv P(X \le x)\]

is called the cumulative distribution function (CDF) of X.

Properties of the distribution function

  1. Bounded as 0\le F(x) \le 1.
  2. Non-decreasing (monotone), i.e. x_1 < x_2 \implies F(x_1)\le F(x_2).
  3. F(-\infty)=0 and F(+\infty)=P(\Omega)=1.
  4. Right-continuous, i.e F(x^+):=\lim_{x\to x^+}=F(x).
  5. From left, F(x^-)=P\circ X^{-1}\big ( (-\infty, x] \setminus {x} \big ) =P_X((-\infty, x]) - P_X({x}) = F(x) - P_X({x}) holds.

By above we can conclude,

  1. P(X\in (a,b]):=P_X((a,b]) = F(b) - F(a).
  2. P(X\in (a,b)):=P_X((a,b)) = F(b^-) - F(a).
  3. P(X\in [a,b)):=P_X((a,b)) = F(b^-) - F(a^-).
  4. P(X\in [a,b]):=P_X((a,b)) = F(b) - F(a^-).

Proofs: use B\subset A \implies \mu(A \setminus B)=\mu (A) - \mu (B) and,

  1. use P_X((a,b])=P_X((-\infty,b]\setminus (-\infty, a]).
  2. use P_X((a,b))=P_X((-\infty,b)\setminus (-\infty, a]) and note that P_X((-\infty,b)) = F(b^-).
  3. use P_X([a,b))=P_X((-\infty,b)\setminus (-\infty, a)).
  4. use P_X([a,b])=P_X((-\infty,b]\setminus (-\infty, a)).

Example: Let ([0,1), \mathcal B([0,1), P) be a probability space. The probability measure for this space is defined as,

    \[\forall E\in \mathcal A,\quad P(E)=\frac{\mu(E)}{\mu([0,1))}=\mu(E)\]

where \mu is the Lebesgue measure (of intervals in \mathbb R).

Consider a random variable X: [0,1)\to \mathbb R as X(\omega)=-\log_e(1-\omega). Its distribution function is determines as follows.

    \[F_X (x)= P\circ X^{-1}((-\infty, x])\]

Because X(\omega) is a strictly increasing function (\frac{\mathrm{d}X}{\mathrm d \omega} > 0) in its domain, it has an inverse function (which is also a strictly increasing function). Therefore, X^{-1}:  R \to [0,1) is \omega = X^{-1}(x)=1-e^{-x} where R = [0, +\infty) is the range of X(\omega). This leads to,

    \[\begin{split}F_X &= P\circ X^{-1}((-\infty, x])= P\circ X^{-1}((-\infty, 0)\cup [0, x])=P (X^{-1}((-\infty, 0))\cup X^{-1}([0, x]))\\&=P\circ X^{-1}((-\infty, 0)) + P\circ X^{-1}([0, x]))=P(\emptyset) + P([0, 1-e^{-x}])\\&=0 + \mu([0, 1-e^{-x}])= 1-e^{-x}\quad\text{s.t}\quad x\in [0,\infty)\\&\therefore F_X(x)=\begin{cases}0 & x<0 \\ 1-e^{-x} & x\ge 0 \end{cases}\end{split}\]

Definition: Let (\Omega, \mathcal A, P) be a probability space and X:=(X_1, X_2, \cdots, X_n):\Omega \to \mathbb R^n be a random vector, and P_X := P\circ X^{-1} be the induced probability measure on (\mathbb R^n, \mathcal B^n). Then, the function F_X: \mathbb R^n \to [0,1] defined as,

    \[\begin{split}F_X(x)&:=P\circ X^{-1}\big ( (-\infty, x_1]\times\cdots\times (-\infty, x_n] \big ) = P_X\big ( (-\infty, x_1]\times\cdots\times (-\infty, x_n] \big ) \\ &= P(X_1^{-1}((-\infty, x_1])\cap\cdots \cap X_n^{-1}((-\infty, x_n]) = P(\{\omega\in \Omega| X_1(\omega)\le x_1, \cdots, X_n(\omega)\le x_n \})\\&\equiv P(X_1\le x_1, \cdots, X_n\le x_n)\end{split}\]

is called the (multivariate) CDF of the random vector X or the joint CDF of X_1, X_2, \cdots and X_n.

Properties of a multivariate CDF

  1. Bounded as 0\le F(x) \le 1.
  2. Monotonically non-decreasing for each of its variables.
  3. Right-continuous in each of its variables.
  4. \lim_{x_1,\cdots,x_n\to +\infty} F(x_1,\cdots,x_n)=1 and \lim_{x_i\to -\infty} F(x_1,\cdots,x_n)=0 for all i.

2-6-1 Marginal distribution function

Setting a probability space (\Omega, \mathcal A, P), let X:=(X_1, \cdots , X_n) be a random vector and F_{X_1, \cdots, X_n } :=F_X be the CDF of X. Also, F_{X_i} denotes the CDF of X_i for any i=1,\cdots , n. Then, for any particular random variable, X_i, we can write,

    \[\begin{split}&\lim_{\underset{j\ne i}{(x_1,\cdots,x_j,\cdots,x_n)\to +\infty} } F_{X_1, \cdots, X_n }(x_1, \cdots, x_i, \cdots, x_n)\\&=P\circ X^{-1}\big ( (-\infty, \infty)\times\cdots\times (-\infty, x_i] \times\cdots \times (-\infty, \infty) \big )\\&=P\circ X^{-1}(\mathbb R\times\cdots\times (-\infty, x_i]\times \cdots\times \mathbb R)=P(X_1^{-1}(\mathbb R)\cap\cdots\cap X_i^{-1}((-\infty,x_i])\cap\cdots\cap X_n^{-1}(\mathbb R))\\&=P(\Omega\cap\cdots\cap X_i^{-1}((-\infty,x_i]) \cap\cdots\cap \Omega)=P\circ X_i^{-1}((-\infty,x_i])\\&=F_{X_i}(x_i)\end{split}\]

Remark: For any subset of random variables, a joint marginal CDF can be obtained through the limit. For example, F_{X_1,X_3}(x_1,x_3)=\lim_{(x_2,x_4)\to \infty} F_{X_1,X_2,X_3,X_4}(x_1,_2,x_3,x_4).

2-7 Probability density function

For the sake of abbriviation, we define the following notation for x=X(\omeg) as X:\Omega\to \mathbb R^n,

    \[ X<x \equiv X_1<x_1, \cdots X_n<x_n\]

For a random variable/vector X:(\Omega, \mathcal A, P)\to (\mathbb R^n, \mathcal B^n), in some cases, there is a non-negative continuous function f:\mathbb R^n \to \mathbb R, such that,

    \[P(X \le x) = F(x) = \int_{(-\infty, x_1]\times\cdots\times (-\infty,x_n]} f(\tilde x) \mathrm d \tilde x \]

where \mathrm d \tilde x pertains to the differential volume of \mathbb R^n, and f(x) is called the probability density function (pdf) of the random variable/vector X. This allows us to write

    \[\frac{\mathrm dF}{\mathrm dx}=f(x) \equiv \frac{\partial F}{\partial x_1 \cdots  \partial x_n} = f(x_1,\cdots,x_n) \]

By this definition,

    \[P(X \in S) \equiv P_X(S) \equiv P\circ X^{-1}(S)= \int_{S} f(x) \mathrm dx \]

Definition: For a random vector, the pdf is referred to as the joint pdf.

Proposition: Let f(x):\mathbbb R^n \to \mathbb R be the distribution of a random vector X, then,

    \[ f(x) = \lim_{\Delta x \to 0}\frac{P(x < X < x+\Delta x)}{\| \Delta x \|}\]

Proof: use P(x < X < x+ \Delta x)= \int_x^{(x + \Delta x)} f(x) \mathrm dx \approx f(x)\|\Delta x\| + O(\|\Delta x\|)^2.

Remark: a random variable/vector X:\Omega \to \mathbb R^n whose probability distribution P_X is determined by a pdf (on its support), defines a probaility space as (\mathrm R^n, \mathcal B^n, P_X) or (S,\mathcal B(S), P_X) where S is the support of X, \mathcal B(S) is the Borel sigma algebra on S, and P_X(E)=\int_E f_X(x) \mathram d x. Note that if E is a set of measure zero, e.g. a point or a line in a 2D surface or 3D volume, then P(E)=0.

Example: Let X be a real-valued random variable and Y=(Y_1, Y_2) be a random vector. If f_{X,Y}(x,y)\equiv f_{X,Y_1,Y_2}(x,y_1,y_2) be their joint pdf, then we can write,

    \[P(X\in A\subset \mathbb R, Y\in B\subset \mathbb R^2)=\int_A \int_B \ f_{X,Y}(x,y) \mathram dx \mathrm dy = \int_S  f_{X,Y_1,Y_2}(x,y_1,y_2) \mathrm dx \mathrm d y_1 \mathrm d y_2\]

where S=\{(x,y_1,y_2) | x\in A, (y_1,y_2)\in B\}=A\times B. Note that \mathrm d y_1 \mathrm d y_2 is the product Lebesgue measure of an element of area in B.

Remark: The support of a random vector X\in \mathbb R^n is not always a Cartesian region. It can be any measurable set S\subset \mathbb R^n

Calculating P(X\in A, Y\in B)

For two random variables/vectors X:\Omega \to \mathbb R^m and Y:\Omega \to \mathbb R^n collected in a random vector (X,Y):\Omega \to \mathbb R^{m+n} having a (joint) pdf, we can write,

    \[P(X\in A, Y\in B)=P\circ (X,Y)^{-1}(\{(x,y)| X\in A, Y\in B\})=\underset{\{(x,y)| X\in A, Y\in B\}}{\int} f_{X,Y}(x,y)\mathrm dx \mathrm dy\]

If the sets A\subset R and B\subset R do not depend on eachother, i.e. related, then the the support of the integration is a rectangular/box region and can be expressed as A\times B.

2-7-1 Marginal pdf

Let X:=(X_1,\cdots,X_n) be a random vector with the following CDF,

    \[ F_{X_1,\cdots, X_n}(x_1, \cdots,x_n)=\int_{-\infty}^{x_1}\cdots  \int_{-\infty}^{x_n}  f_{ x_1,\cdots, x_n}(\tilde x_1, \cdots, \tilde x_n)\mathrm d\tilde x_1\cdots \mathrm d x_n \]


    \[ \begin{split}F_{X_j}(x_j)&=\int_{-\infty}^{\infty}\cdots \int _{-\infty}^{x_j}\cdots   \int_{-\infty}^{\infty}  f_{ x_1,\cdots, x_n}(\tilde x_1, \cdots, \tilde x_n)\mathrm d\tilde x_1\cdots \mathrm d x_n \\F_{X_j}(x_j)&= \int _{-\infty}^{x_j} \left (  \int_{-\infty}^{\infty}\cdots  \int_{-\infty}^{\infty}  f_{ x_1,\cdots, x_n}(\tilde x_1, \cdots, \tilde x_n)\mathrm d\tilde x_1\cdots  \mathrm d\tilde x_{i\ne j} \cdots  \mathrm d x_n\right ) \mathrm d x_{j} \\& \implies  F_{X_j}(x_j) =  \int _{-\infty}^{x_j} g(x_j) \mathrm d x_j\end{split}\]

indicating that f_{X_j}(x_j) =  g(x_j), i.e,

    \[ f_{X_j}(x_j) =  \int_{-\infty}^{\infty}\cdots  \int_{-\infty}^{\infty}  f_{ x_1,\cdots, x_n}(\tilde x_1, \cdots, \tilde x_n)\mathrm d\tilde x_1\cdots  \mathrm d\tilde x_{i\ne j} \cdots  \mathrm d x_n\]

which is referred to as the marginal pdf.

Remark: we can obtain the joint marginal pdf of a subset of random variables within a sequence/vector (tuple) of random variable by integrating the joint pdf with respect to the random variables that are not in the subset.

Definition: a random variable/vector is called discrete if the support (i.e the range) of X is a countable set in \mathbb R^n. In this case, the integration becomes a (discrete) sum. Also the probability density function is referred to as the probability mass function.

When the joint pdf or cdf of random variables/vectors (X_1,\cdots, X_n):\Omega \to \mathrm R^n are considered (calculated or assumed), the underlying probability space (\Omega, \mathcal F, P) is the combined probability space of the random variables if they have different probability spaces. There is no need to know the underlying probability space and measure of each random variable/vector to find out the pdf/cdf of each random variable/vector,but, we can use marginal pdfs and cdfs to determine the pdf and cdf of each random variable/vector or any subset of them.

2-8 Independence of events and random variables/vectors

2-8-1 Conditional probability

Let (\Omega, \mathcal E, P_{\Omega}) be a probability space and A, B\in \mathcal E be two events (sets). The occurrence of A (or B) means observing any member(s) of the set A (or B) in a random experiment. Therefor, the occurrence of A\cap B, i.e. A \text{ AND } B means observing any member(s) of A\cap B; or occurrence of both A and B. The probability of A\cap B can be determined by the measure P_{\Omega} as P_{\Omega}(A\cap B).

Now assume that the outcome of the experiment were going to be out of the set A. In other words, event A is considered or were observed/occurred, or just would be assumingly observed. Given this, we want to determine P_{\Omega}(B). This is referred to as conditional probability and written as P_{\Omega}(B|A) in a mathematical notation; which is read as the probability of the event B given the event A is already observed. To determine P_{\Omega}(B|A), two points should be considered:

1- Saying given or conditional on A means partitioning the sample space \Omega into A and \Omega\setminus A, and therefore considering only A as the set of possible outcomes. Consequently, a probability space as (A, \mathcal A, P_A) can be established. Note that \mathcal A \subset \mathcal E.

2- P_{\Omega}(B|A) measures the probability of A\cap B as P_A(A\cap B), i.e. P_{\Omega}(B|A)=P_A(A\cap B). In other words, the conditional probability is a measure on the restricted sample space A.

For (A, \mathcal A, P_A), we must have P_A(A)=1 and 0\le P_A(S)\le 1 for any S\in \mathcal A. To satisfy these requirements of the probability measure, P_A is defined as,

    \[\forall S\in \mathcal A,\quad  P_A(S):=\frac{P_\Omega(S)}{P_\Omega(A)}\]

This satisfies the aformentioned conditions and also other properties of a measure as P_\Omega(A) is a constant and P_\Omega is already a legitimate (probability) measure.


    \[P_{\Omega}(B|A)=P_A(A\cap B)=\frac{P_\Omega(A\cap B)}{P_\Omega(A)}\]

With respect to the above, conditional probability is defined.

Definition: Let (\Omega, \mathcal F, P) be a probability space, and let A, B \in \mathcal F be two events such that B\ne \emptyset, i.e P(B)\ne 0. The conditional probability of A given B is

    \[P(A|B)=\frac{P(A\cap B)}{P(B)}\]

Remark: the conditional operator \cdot | C , partitons the sample space and sets the conditoinal sample space, C, as the current sample space to be considered.

The conditional probability/measure P(\cdot | \cdot) can be viewd as a set function of either its first or the second argument:

1- The function P(\cdot | B): \mathcal F \to [0,1] is a probability measure on \Omega and hence the following holds:
a) P(A|B) \ge 0 For every A\in \mathcal F.
b) P(\Omega | B) = 1
c) For a countable (finite or infinite) sequence of disjoint events A_i with A=\bigcup_i A_i we have P(A|B)=\sum_i P(A_i | B)

Also note that P(B|B)=1.

2- The function P(A | \cdot): \mathcal F \to [0,1] is obviously a set function but its not a probability measure. This function satisfies a formula called the formula of total probability. For a countable disjoint partition \{B_i\}_{i\in I} of the space \Omega where B_i\in \mathcal F, \Omega = \cup_i B_i and P(B_i) \ne 0, we can have the following for any fixed A\in \mathcal F,

    \[P(A) = \sum_i P(A\cap B_i) = \sum_i P(A | B_i)P(B_i)\]

Bayes’ formula

In the case of partitioning \Omega=\bigcup_i B_i, the events B_is are (sometimes) referred to as hypotheses and the probabilities P(B_i) are called prior (or pre-experiment) probabilities. As the result of performing an experiment or a trial of an experiment, an event, say A, occures based on the observation of the experiment’s outcome. Based on the observation/event occured, we want to comprehend which of the hypotheses B_i is most likely. Therefore, we first want to know the probability of each B_i given the observation A; then the largest value indicates the most likely hypothesis. This means calculating the maximum of P(B_i | A) for i\in I by writing

    \[P(B_i | A) = \frac{P(B_i\cap A)}{P(A)}=\frac{P(A|B_i)P(B_i)}{\sum_k P(B_k)P(A | B_k)}\]

where P(B_i | A) is called the posterior (or post-experiment) probability. The above expression is called Bayes’s formula.

Example: Fire related events in a forest can be occurance of smoke (S) or/and dangerous fires (DF). The probability (relative frequency) of (occurance of) dangerous fires is P(DF)=0.01. The probability of observing smoke is P(S)=0.1. Also we know that 95% of DFs (given or considered) make smoke (amoung DFs, 95% has smoke) meaning that P(S|DF)=0.95. If smoke is detected, how probable is it that there is a DF? We need to calculate P(DF | S) as,

    \[P(DF | S)=\frac{P(DF)P(S|DF)}{P(S)}=\frac{(0.01)(0.95)}{(0.1)}\]

For this example, the sample space (all the possibilities) contains the coupled status of smoke and fire in the forest; an status can be expressed by an unordered pair. Therefore,

    \[\Omega = \{(\text{S}, \text{DF}),  (\text{S}, \text{No DF}),  (\text{No S}, \text{DF}),  (\text{No S}, \text{No DF}) \}\]

Then, the following events are defined:

1- There is smoke (hence observed) in the forest: S:=\{(\text{S}, \text{DF}), (\text{S}, \text{No DF})\}.

2- There is DF in the forest: DF:= \{(\text{S}, \text{DF}), (\text{No S}, \text{DF}) \}

3- There is a DF and smoke: DF \cap S= \{\text{S}, \text{DF})\}

4- DF given observation/existance of smoke: DF|S is DF \cap S= \{\text{S}, \text{DF})\} BUT in a (conditional) sample space as S:=\{(\text{S}, \text{DF}), (\text{S}, \text{No DF})\}.

5- and …

And the following probabilities were given (perhaps calculated through their relative frequencies):

    \[P(S)=0.1, P(DF)=0.01, P(S|DF)=0.95\]

Note that \Omega can be (disjoint) partitioned based on the fire or the smoke status. For example:

    \[\Omega =DF\ \cup \ NoDF := \{ (\text{S}, \text{DF}) , (\text{No S}, \text{DF}) \} \cup  (\text{S}, \text{No DF}),  (\text{No S}, \text{No DF}) \}\]

And similarly \Omega= S \cup no S.

Remark: Sometimes we may consider the occurance of several events as the condition on a probability of an event. This simply means intersection of the conditioning events. i.e.

    \[P(A|B_1,\cdots, B_n)=\frac{P(A\cap B_1\cap\cdots \cap B_n)}{ P(B_1\cap\cdots \cap B_n) }\]

2-8-2 Independence

Definition: Regarding \mathcal F, two events A_1 and A_2 are called independent if P(A_1\cap A_2)=P(A_1)P(A_2) or equivalently P(A_1 | A_2)=P(A_1) and P(A_2 | A_1)=P(A_2).

The events of \emptyset and \Omega are independent of any event.

Lemma: If A and B are independent events, then each (\bar A, B), (A,\bar B), and (\bar A, \bar B) is also an independent pair of events.

Proof: use \bar A=\Omega \setminus A and P(\bar A) = 1 - P(A) for any set A.

Definition: A finite set of events \{A_i\}_{i=1} ^n, all in \mathcal F, is called pairwise independent if P(A_i\cap A_j)=P(A_i)P(A_j) for any i\ne j.

Definition: A finite set of events \{A_i\}_ {i=1}^n, all in \mathcal F, is called completely independent if P(\cap_ {i=1}^n A_i)=\prod_{i=1}^n P(A_i).

Definitoin: A finite set of events \{A_i\}_ {i=1}^n, all in \mathcal F, is called mutually independent if every event is independent of any intersection of the other events, i.e.

    \[\begin{split} P(A_i\cap A_j)&=P(A_i)P(A_j) \\  P(A_i\cap A_j\cap A_k)&=P(A_i)P(A_j)P(A_k)\\ \cdots \\ P(\cap_ {i=1}^n A_i)&=\prod_{i=1}^n P(A_i)\end{split}\]

Therefore, there are \binom{n}{2} + \binom{n}{3} + \cdots +  \binom{n}{n} conditions to be satisfied.

Remark: It should be noted that pairwise independence does not imply complete independence nor mutual independence. However, mutual independence implies the other ones. See Feller (1971) for examples.

Example: Disjoints not-empty events (A \cap B=\emptyset) are not independent.

Product spaces and independent experiments/trials

Consider a set of finite number of probability spaces \{(\Omega_i, \mathcal F_i, P_i)\}_{i=1}^n and the Cartesian product space (\Omega = \Omega_1\times \cdots\times \Omega_n, \mathcal F, P) where \mathcal F and P are the sigma algebra and probability measure of the product space. The product sample space is a set of ordered tuples as \{(\omega_1, \cdots, \omega_n)| \omega_i\in \Omega_i\}.

Definition: A set of experiments (compound experiment) with their sample spaces collected as \{\Omega_i\}_{i=1}^n is called independent if its sample space \Omega is the Cartesin product of the sample spaces of the experiments i.e. \Omega = \Omega_1\times \cdots\times \Omega_n. In this regard, the sigma algebra and the probability measure of a set of independent experiment is by convension the sigma algebra of the product space and the product measure (chosen by convension) on this space. This means if A_i\in \mathcal F_i and A:= A_1\times \cdots \times A_n \in \mathcal F, then

    \[P(A_1\times \cdots \times A_n)= \prod_{i=1}^n P_i(A_i)\]

A product space (with the afformentioned sigma algebra and meausure) can be used to discribe the sample space of a succession/sequence of experiments or trials of an experiment (i.e. the same experiment repeated several times) if the set of the experiments or the trials is independent, i.e. the sample space of the set of experiment can be expressed as a poduct space. A sequence of experiments may or may not impose/indicate any realistic (time) order (if there is any) of the experiments.

As a side note, a dependent (not independent) set of experiments, as I think, can be as of this exmple. Experiment 1 is choosing a number of \Omega_1:=\{1,2,4,5,6\} and experiment 2 is choosing a letter out of \Omega_2 = \{a,b,c,d,e,z\}. Now we can define an experiment as if 5 is chosen in experiment 1 then z vanishes or not available any more in experiment 2. Therefore, the combined/sequenttial sample space is not eqaual to \Omega_1\times \Omega_2 as it depends on the result of the first experiment. Clearly, the sample space misses (5,z).

We can define independent trials of an experiment by independently repeating an experiment. In this case, \Omega_0:=\Omega_1=\cdots = \Omega_n and and P_0 :=P_1= \cdots = P_n resulting in the space of the whole experiment (\Omega_0^n, \mathcal F, P) and

    \[P(A_1\times \cdots \times A_n)= \prod_{i=1}^n P_0(A_i)\]

Remark: When a product space is (asssumed to be) the sample space of a sequence of experiments, it is assumed that the outcome (sample space) of each (sub) experiment/trial does not affect the outcomes of any other experiments/trials.

Proposition: A set of events

    \[\{(A_1\times \Omega_2 \times \cdots \times \Omega_n), (\Omega_1 \times A_2 \times \Omega_3 \times \cdots \times \Omega_n ), \cdots,  (\Omega_1 \times \cdots \times \Omega_{n-1}\times A_n)\}\]

of independent experiments or trials (of an experiment) where A_i \in \mathcal F_i, is mutually independent.

Proof: since the experiments or the trials are independent, any arbitrary n number of events \{A_i | A_i \in \mathcal F_i\}_{i=1}^n satisfies

    \[P(A_1\times \cdots \times A_n)= \prod_{i=1}^n P_i(A_n)\]

The set A_1\times \cdots \times A_n \subset \Omega_1\times \cdots \times \Omega_n can be written as

    \[A_1\times \cdots \times A_n  = (A_1\times \Omega_2\times\cdots \times \Omega_n) \cap (\Omega_1\times A_2 \times\cdots \times \Omega_n) \cap \cdots \cap  (\Omega_1\times \Omega_2 \times \cdots \times \Omega_{n-1}\times A_n) \]

Letting B_i:=(\Omega_1\times A_i \times \cdots \times \Omega_n) we can write,

    \[ P(\cap_{i=1}^n B_i)= \prod_{i=1}^n P_i(A_i)\]

which implies P(B_i) = P_i(A_i) because P on \Omega as the product space of \Omega_is is a product measure indicating P(B_i)= P_1(\Omega_1)\cdots P_i(A_i)\cdots P_n(\Omega_n) where P_k(\Omega_k)=1.

Since P(\cap_{i=1}^n B_i)= \prod_{i=1}^n P(B_i) is true for any n, the set of events \{B_i\}_{i=1}^n is mutually independent \blacksquare

Remark: Sometimes the events A_is (each of a different experiment) are said to be mutually independent; this in fact means B_is (of the combined experiment) are mutually independent. By the first sentence, I think it is more about the independence of experimental procedures themselves.

Remark: Regarding a product sample space, P(\Omega_1\times A_i \times \cdots \times \Omega_n)=P_i(A_i)

Remark: For independent experiments, the (joint) probability of observing the (joint) event A_i \times \cdots \times A_n or literally the probability of observing A_1 and A_2 and \cdots and A_n is

    \[P(A_1\times \cdots \times A_n)=  \prod_{i=1}^n P_i(A_i) \]

Example: An experiment consists of flipping a coin and independently tossing a dice; simultaineously or successively. If C stands for coin flipping and D for dice rolling, the following holds:

    \[\begin{split} \Omega_C=\{H,T\},\quad \mathcal F_C &=\{\{H\}, \{T\}, \Omega_C, \emptyset\},\quad \Omega_D=\{1,2,3,4,5,6\},\quad \mathcal F_D=2^{\Omega_D}\\\Omega= \Omega_C\times \Omega_D &= \{(\omega_1, \omega_2)| \omega_1\in \Omega_C, \omega_2\in \Omega_D    \}, \quad \mathcal F = 2^\Omega\end{split}\]

If A_C\in F_C and A_D \in F_D then P(A_C \times A_D ) = P_C(A_C)P_D(A_D)

Example: assume that we obseved T in the coin flipping, what is the probability of observing an odd number in rolling the dice.

In the product space (whole experiment), the event regarding observing T is \{T\}\times \Omega_D, and the event of observing an odd number is \Omega_C\times \{1,3,5\}. Therefore,

    \[\begin{split}&P(\{\Omega_C\times \{1,3,5\}\}|\{T\}\times \Omega_D)=\frac{P(\Omega_C\times \{1,3,5\}\cap \{T\}\times \Omega_D)}{P(\{T\}\times \Omega_D)}=\frac{P(\{T\}\times \{1,3,5\})}{P(\{T\}\times \Omega_D)}\\&=\frac{P(\{(T,1)\}\cup \{(T,3)\} \cup \{(T,5)\})}{P_C(\{T\})P_D( \Omega_D)}=\frac{P(\{(T,1)\})+ P(\{(T,3)\}) + P(\{(T,5)\})}{P_C(\{T\})}\\&=\frac{P(\{T\}\times \{1\})+ P(\{T\}\times \{3\}) + P(\{T\}\times \{5\})}{P_C(\{T\})}=\frac{P_C(\{T\})P_D(\{1\})+ P_C(\{T\})P_D(\{3\}) + P_C(\{T\})P_D(\{5\})}{P_C(\{T\})}\\&= P_D(\{1\})+ P_D(\{3\}) +P_D(\{5\})\end{split}\]

Remark: For n independent experiments with discrete (countable) sample spaces, we can write

    \[P(\{(\omega_1,\cdots, \omega_n)\}) = \prod_{i=1}^n P_i(\{\omega_i\}) \]

where (\omega_1,\cdots, \omega_n)\in \Omega and \{(\omega_1,\cdots, \omega_n)\}\in \mathcal F is a simple event. Also, \{\omega_i\} \in \mathcal F_i is a simple event.

Example: Considering a set of independent experiments on selecting two real numbers from two intervals; therefore, we can wirite, \Omega_1 =[a,b], \Omega_2 =[c,d], and \Omega = [a,b] \times  [c,d]. \mathcal F_1 and \mathcal F_2 are the Borel sets on the corresponding sample spaces. Defining events A=[1.0,2)\in \mathcal F_1 and B= (70, 110]\in \mathcal F_2, we can write P(A\times B) = P(A\times \Omega_2 \cap \Omega_1\times B) =P_1(A)P_2(B) and say the probability of observing a number as in A (or equivalently in A\times \Omega_2) AND a number as in B (or equivalently (\Omega_1 \times B) is P_1(A)P_2(B).

2-8-3 Independence of random variables/vectors

Definition [independent pair of random variables]: Consider (\Omega, \mathcal F, P) and random variables X_1:\Omega \to \mathbb R and X_2:\Omega \to \mathbb R, we say X_1 and X_2 are independent if for any x,y \in \mathbb R, the events A:=\{\omega: X_1(\omega) \le x\}\in \mathcal F and B:=\{\omega: X_2(\omega) \le y\} are independent, i.e. P(A\cap B)=P(A)P(B). In other form, we say X_1 and X_2 are independent if for any sets S_1, S_2 \in \mathcal B, the events X_1^{-1}(S_1)\in \mathcal F and X_2^{-1}(S_2) \in \mathcal F are independent.

Proposition: By above and considering the joint CDF of the combined random variables (as a random vector) X_1 and X_2 as F_{X_1, X_2}(x_1, x_2), we can say that two random variables are independent iff,

    \[F_{X_1, X_2}(x_1, x_2)= F_{X_1}(x_1)F_{X_2}(x_2),\quad  \forall x_1, x_2\]

This is also applicable to their pdfs if they exist, i.e.

    \[f_{X_1, X_2}(x_1, x_2)= f_{X_1}(x_1)f_{X_2}(x_2)\]

Definition: A finite set of random varibles \{X_i\}_{i=1}^n is pairwise independent if every pair of the random variables is independent.

Definition: A finite set of random varibles \{X_i\}_{i=1}^n is mutually independent if for any set of numbers \{x_1, \cdots, x_k\} the events \{X_1 \le x_1\},\cdots , \{X_k \le x_k\} are mutually independent events.

Proposition: By the above definition, the combined random varibles (as a random vector) \{X_1,\cdots , X_n\} are mutually independent iff

    \[ F_{X_1, \cdots , X_n}(x_1, \cdots, x_n)= F_{X_1}(x_1)\cdots F_{X_n}(x_n)  ,\quad  \forall x_1, \cdots , x_n \]

Because the above holds for every x_i in the codomain/range of each random variable, then by setting x_k=\infty for any k, we can write,

    \[\begin{split}F_{X_1, \cdots, X_n}(x_1, \cdots, x_{k-1}, \infty, x_{k+1},\cdots, x_n)&= F_{X_1}(x_1)\cdots\ F_{X_k}(\infty) \cdots F_{X_n}(x_n)\\  &= F_{X_1}(x_1)\cdots F_{X_{k-1}}(x_{k-1}) F_{X_{k+1}}(x_{k+1}) \cdots F_{X_n}(x_n) \end{split} \]

which indicates that if F_{X_1, \cdots , X_n}(x_1, \cdots, x_n)= F_{X_1}(x_1)\cdots F_{X_n}(x_n)  ,\quad  \forall x_1, \cdots , x_n, then any possible number and combinations of the random variables also satisfies the product rule and are then mutually independent.

Remark: The above proposition also holds for the pdfs if available.

Proposition: The above propositions can be also expressed in terms of other Borel sets.

Definition: Two random vectors (on the same space) X:\Omega \to \mathbb R^m and Y:\Omega \to \mathbb R^n are called independent if F_{X,Y}(x,y)=F_X(x)F_Y(y) for all x\in \mathbb R^m and y\in \mathb R^n.

Definition: n random variables \{X_i\}_1^n are called independent and identically distribued (iid) if F_{X_1, \cdots , X_n}(x_1, \cdots, x_n)= F_{X_1}(x_1)\cdots F_{X_n}(x_n) and F_{X_i}=F_{X_j} for all x_i\in \mathbb R and i\in \{1,\cdots, n\}.

Example: Consider an experiment of randomy selecting a person from a group and mesuring his/her height and weight. The sample space (of the experiment) \Omega is the group of people and the height and weight of a randomly selected person are two random variables denoted as H:\Omega \to \mathbb R and W:\Omega \to \mathbb R. Obviously, H and W are not (neccessarily) independent because F_{H,W}(h,w)\ne F_H(h)F_H(w). This is bacause in this case we know from the nature of the experiment that height and weight of a person are not independent of eachother. To elucidate this, consider the range/support of H denoted as S_H\subset \mathbb R and the range of W as S_W\subset \mathbb R. If we randomly choose a number from S_H and then we want to randomly choose a number from S_W, we’ll get restirected to a subset of S_W not the entire values in the set (because not every height is possibly/naturally correspondent to every weight value). This indicates that the sample space of the combined experiment of choosing height and weight values (from S_H and S_W) is not equal to H\times W. As an example, the heights of infants do not correspond to the weights of adults.

Now, if we repeat this experiment n times to measure the heights and weights of n individuals, then each pair (height, pair) of measurment pertaining to each trial is independent of other pair of measurments. Or, we can also say that the measurments of the same type are independent, i.e. mesurments of the heights/weights are independent. Considering these two views we can deduce:

    \[\{ X_1=(H_1, W_1), \cdots,  X_n=(H_n, W_n)\},  \quad F_{X_i}(h_i, w_i)\equiv F_{H_i,W_i}(h_i,w_i) = F_{H,W}(h_i,w_i) \]

and then,

    \[F_{X_1,\cdots, X_n}( x_1,\cdots, x_n) = \prod_{i=1}^n F_{H,W}(h_i,w_i) \]


    \[\{H_i\}_{i=1}^n,  \{W_i\}_{i=1}^n, \quad \text{with } F_{H_i}(h_i) = F_{H}(h_i), \text{ and }  F_{W_i}(w_i) = F_{W}(w_i) \]

results in

    \[ F_{H_1,\cdots, H_n} (h_1,\cdots, h_n)= \prod_{i=1}^n F_{H}(h_i), \quad  F_{W_1,\cdots, W_n}(w_i,\cdots, w_n) = \prod_{i=1}^n F_{W}(w_i)\]

Remark: It may not be always clear that two or more measurments expressed by random variables/vectors are independent even if we assume a Cartesian product of real intervals for their combined sample space. In other words, the combined sample space of independent random vriables is the Cartesian product of the sample spaces but not the vice versa.

2-8-4 conditoinal CDF and PDF

Let X:(\Omega, \mathcal F, P) \to \mathbb R^m and Y:(\Omega, \mathcal F, P) \to \mathbb R^n be two random vectors/variables and A,B \subset \mathbb R^m,  \mathbb R^n. By the conditional probability we can write

    \[P(X\in A | Y\in B)=\frac{P( X\in A , Y\in B)}{P(Y\in B)} \equiv  \frac{P( X^{-1}(A) \cap Y^{-1}(B))}{P(Y^{-1}(B))} \]

Now let F_{X,Y}(x,y) be the joint CDF of the random vectors/variables, A=\prod_{i=1}^m (-\infty, x_i] and B=\prod_{i=1}^m (-\infty, y_i], therefore,

    \[ F_{X|Y}(x|y) := P(X\in A | Y\in B)=\frac{F_{X,Y}(x,y)}{F_Y(y)}, \quad x=(x_1,\cdots, x_m),  y=(y_1,\cdots, y_n) \]

where F_{X|Y}(x|y) as a function of x is called the conditional CDF of X given Y<y.

Remark: In above, we will use the combined sample space and measure if the sample spaces where different. Then, the CDF of each variable can be obtained through the marginal distribution. Recall that A\in \mathcal F_1 is A\times \Omega_2 \in \mathcal F for (\Omega_1\times \Omega_2,\mathcal F, P)

Let X and Y be two continuous random vectors/variables with a joint pdf f_{X,Y}(x,y). We can write,

    \[ \begin{split} P(x<X<x+\Delta x| y<Y<y+\Delta y)&=\frac{P(x<X<x+\Delta x, y<Y<y+\Delta y)}{ P(y<Y<y+\Delta y)}\\&=\frac{\int_{[x,x+\Delta x]\times [y,y+\Delta y]}f_{X,Y}(x,y)\mathrm d x \mathrm d y}{\int_{[y,y+\Delta y]}f_{Y}(y)\mathrm d y}\end{split}\]

By letting \Delta x, \Delta y to be small enough, we can write,

    \[P(x<X<x+\Delta x| y<Y<y+\Delta y) \approx \frac{f_{X,Y}(x,y)\|\Delta x\| \|\Delta y\|}{f_Y(y)\|\Delta y\|} = \frac{f_{X,Y}(x,y)}{f_Y(y)}\|\Delta x\|\]

Therefore, by the proposition given in sec. 2.7,

    \[ \lim_{\Delta x, \Delta y \to 0} \frac{P(x<X<x+\Delta x| y<Y<y+\Delta y)}{\|\Delta x\|}= \lim_{\Delta x, \Delta y \to 0}   \frac{f_{X,Y}(x,y)}{f_Y(y)}\frac{\|\Delta x\|}{\|\Delta x\|} = \frac{f_{X,Y}(x,y)}{f_Y(y)} \]

indicating that the term \lim_{\Delta x, \Delta y \to 0} \frac{P(x<X<x+\Delta x| y<Y<y+\Delta y)}{\|\Delta x\|} =  \frac{f_{X,Y}(x,y)}{f_Y(y)} is a pdf in term of x; this is called the conditinal pdf of X given Y=y. The conditional pdf is denoted as,

    \[f_{X|Y}(x|y):= \frac{f_{X,Y}(x,y)}{f_Y(y)} \implies  f_{X,Y}(x,y) = f_{X|Y}(x|y)f_Y(y) \]

It can be checked that f_{X|Y}(x|y) is a pdf as a function of x but not as a function of y.

Note that f_Y(y) can be obtained from the related marginal pdf of f_{X,Y}(x,y).

Using f_{X|Y}(x|y) we can write,

    \[ P(X\in S | Y=y)=\int_S  f_{X|Y}(x|y) \mathrm d x\]

Note that y is a given and hence is a fixed value.

For two random variables/vectors having a joint pdf and two we can write,

    \[P(X\in A | Y \in B)=\frac{P(X\in A, Y\in B)}{P(Y\in B)} = \frac{\int_{S} f_{X,Y}(x,y)\mathrm d x \mathrm d y}{\int_B f_Y(y)\mathrm d y} = \frac{\int_{S} f_{X|Y}(x|y)f_Y(y)\mathrm d x \mathrm d y}{\int_B f_Y(y)\mathrm d y}  \]

where S:=\{(x,y)| x\in A, y\in B\}.

Remark: We can restrict or put condition of values of a random variable itself. For example X|X\in B where B\subset S_X; meaning that an event (B) regarding some values of X have already been obseved. Therefore,

    \[P(X\in A|X\in B)=\frac{P(X\in A, X\in B)}{P(X\in B)} = \frac{P(X^{-1}(A\cap B))}{P(X\in B)}\]

2-8-5 About X|Y

Consider two random variables X,Y:\Omega\to \mathbb R, and their compound into a random vector Z:=(X,Y):\Omega\to \mathbb R^2. By writing X|Y\in B, we refer to the values of X \in\mathbb R obtained through restricting the values of (X,Y) by the rule Y\in B; this means

    \[S_{X|Y\in B} :=\{x| (x,y)\in \mathbb R^2 \text{ and } y\in B\subset \mathbb R\}\]

or, considering the supports of the random variables leads to

    \[  S_{X|Y\in B} :=\{x| (x,y)\in S_Z \text{ and } y\in B\subset S_Y\} \]

Therefore, a function X(\omega)|Y\in B defined as X|Y\in B: X^{-1}( S_{X|Y\in B})\subset \Omega \to \mathbb R is a random variable. The measurable space is then σ-algebra generated by S_{X|Y\in B}. This is as follows in a mathematical notation.

    \[X|Y\in B :=\left (\Omega^*:= X^{-1}( S_{X|Y\in B}), \sigma (\Omega^*)\right )\to (\mathbb R, \mathcal B)\]

Note that \mathcal B being the sigma algebra of the codomain of X|Y\in B is fine; because the pre-image of any set not in the range of this random variable will be the empty set being in \sigma (\Omega^*).

Because the condition X|Y\in B partitions \Omega_X, the probability measure of the measure space (\Omega^*, \sigma(\Omega^*),P^*_X) must be a conditional probability space based on the probability measure of (\Omega,\mathcal F, P). Therefore,


where E^*\in \sigma(\Omega^*). Note how P^* is the normalized/scaled version of P. Also note that \Omega^*, E^*\in \mathcal F as well.

As a result, if E :=\{x| (x,y)\in \mathbb R^2 \text{ and } x\in A\subset \mathbb R  \text{ and }  y\in B\subset \mathbb R\}, then

    \[P(X\in A|Y\in B)= \frac {P\circ X^{-1}(E)}{P(\Omega^*)}\]

Defining the distribution of X|Y\in B.

As shown before, the term P(X\in A|Y\in B) is a probability measure by its fiest argument.

We can also write,

    \[P(X\in A|Y\in B)=\frac{P(X^{-1}(A)\cap Y^{-1}(B))}{P(Y^{-1}(B))} \]

which is equivalent to \frac {P\circ X^{-1}(E)}{P(\Omega^*)}; because X^{-1}(E) and X^{-1}(A)\cap Y^{-1}(B) refer to the the same subset of \Omega, and \Omega^* can also be expressed by Y^{-1}(B), because

    \[\begin{split}  S_{X|Y\in B} &:=\{x| (x,y)\in S_Z \text{ and } y\in B\} \implies \text{ therefore } \forall (x,y), x\in S_{X|Y\in B} \iff y\in B\\\Omega^*&= X^{-1} ( S_{X|Y\in B} )=\{\omega|x=X(\omega)\in S_{X|Y\in B}\}=  \{\omega|y=Y(\omega)\in B\}=Y^{-1}(B) \end{split}\]

The following diagram demonstrates the relation between the sets and their pre-images.

As shown before, when Y=y for a fixed y, and there is a joint pdf for X and Y, we can have a conditional pdf of X given Y=y as f_{X|Y}(x|y); in other words the conditional pdf is the pdf of the random variable/vector X|Y.

A more general case is when a set of given/observed values is considered for Y instead of a single value. Therefore,

    \[P(x\le X \le x+\Delta x| Y\in B)=\frac{P( (x\le X \le x+\Delta x , Y\in B )}{P(Y\in B)} = \frac{(\int_B f_{X,Y}(x,y) \mathrm d y)\|\Delta x\|}{\int_B f_Y(y)\mathrm d y} \]

leads to the expression of the conditional pdf of X given Y\in B as

    \[f_{X|Y\in B}(x|y\in B)= \frac{\int_B f_{X,Y}(x,y) \mathrm d y}{\int_B f_Y(y)\mathrm d y} =  \frac{\int_B f_{X|Y}(x|y)f_Y(y) \mathrm d y}{\int_B f_Y(y)\mathrm d y}  \]

which is a function of x.

Example: For Z:=(X,Y):\Omega \to \mathbb R^2, let \{(\omega; x,y)| \omega \in \Omega, x,y\in \mathbb R\} be as

    \[\{(\omega_1;0,0),  (\omega_2;0,1),  (\omega_3;1,2),  (\omega_4;1,2),  (\omega_5;1,3),  (\omega_6;2,1),  (\omega_7;2,2),  (\omega_8;2,3),  (\omega_9;3,7),  (\omega_{10};4,6)\}\]

Now, let B=\{2,3\} and accordingly define a set S_{X|Y\in B} as \{1,2\}; So, we can write

    \[\Omega^* :=X^{-1}( S_{X|Y\in B})=\{\omega_3, \omega_4, \omega_5, \omega_7, \omega_8\}, \quad \mathcal F^*:=\sigma(\Omega^*)=2^{\Omega^*},\quad \mathcal A := \sigam( S_{X|Y\in B})=2^{ S_{X|Y\in B)} \]

and define a random variable as

    \[X|Y\in B: (\Omega^*,\mathcal F^*)\to  (S_{X|Y\in B} \text{ or } \mathbb R, \mathcal A \text{ or } \mathcal B )\]

such that,

    \[ X(\omega_3)|Y\in B = X(\omega_4)|Y\in B  = X(\omega_5)|Y\in B  =1,\quad  X(\omega_7)|Y\in B = X(\omega_8)|Y\in B  =2\]

Example: An integer x is randomly and uniformly chosen from integers 1 to N, then another integer y is randomly and uniformly chosen from 1 to x. Find the distribution of X, Y|X, (X,Y), and Y.

Let’s define a probability space \{\Omega, \mathcal F, P\} and two random variables X,Y:\Omega\to \mathbb R. We’ll dig into this space later. From the assumptions, the distribution of X is P(X\in A) = \#A/N. Then, for a single value P(X=x)=1/N. In the second step another value y is chosen based on the obseved result of choosing x. According to the assumptions, y is uniformly chosen at random from a set E:=\{1,\cdots,x\}. This set a distribution for Y|X=x as P(Y\in B|X=x)=\#B/\#E=\#B/x. Then, for a single value of y, P(Y=y|X=x)=1/x.

The distribution of (X,Y) is P(X=x, Y=y)=P(Y=y|X=x)P(X=x)=\frac{1}{xN}.

Since the random variables X, Y, and vector (X,Y) are discrete, P(X=x), P(Y=y) and P(X=x,Y=y) are the pmf of the random variables and vector. If the pmf of the random vector is denoted by \rho_{X,Y}(x,y), the distribution of Y can be calculated as

    \[ P(Y=y)=\rho_Y(y) = \sum_{x=1}^N  \frac{1}{xN} = \frac{1}{N} \sum_{x=1}^N  \frac{1}{x} \]

Just for checking

    \[ P(X=x)=\rho_X(x) = \sum_{y=1}^N  \frac{1}{xN} = \frac{1}{xN} N=\frac{1}{x}\]

Now let’s dig into the probability space. The experiment is choosing an integer x\le N and the choosing another one as y\le x. Therefore, the sample space, i.e. all possible outcome of the experiment, is

    \[\Omega = \{(\omega_1,\omega_2)| \omega_1\le N, \omega_2\le \omega_1\}\]

Note that \Omega does not contain all the ( \omega_1,\omega_2 ) pairs as in \{(\omega_1,\omega_2 )|  \omega_1 \le N,  \omega_2 \le N\}.

The random variables X and Y previously defined are as

    \[ X(\omega)=\omega_1, Y(\omega)=\omega_2\]

We can set a probability measure based on a counting measure on this space. From the assumptions, X has a uniform distribution meaning that P(X=x)=\frac{1}{N}. This means P(\{(\omega_1,\omega_2)| \omega1=x, \omega_2 \le x\} = \frac{1}{N}. Also, P(Y=y|X=x)=1/x. Therefore, we can write

    \[P(X=x, Y=y)=P(Y=y|X=x)P(X=x)=\frac{1}{xN}\implies P(X=x, Y=y)=P(X^{-1}(x)\cap Y^{-1}(y))= P(\{(\omega_1,\omega_2)\})=\frac{1}\omega_1N}\]

Which defines the propability measure on the sample space.

For example the probability of the simple event \{(2,1)\}\in \mathcal F if N=10 is 1/(2\times 10)=1/20.

Just for demonstration; if N=10, then

    \[P((3,2))=1/30 = P(X=3,Y=2)=P(Y=2|X=3)P(X=3)=(1/3)(1/10)=1/30\]

Example: If we say X has a particular distribution on an interval defined as [0,Y] with Y being a random variable it means X|Y=y follows that distribution for each y\in S_Y.

2-9 Functions of random variables/vectors

Side note: if f(x) is a function as f:A\to B and g(y) is a function as g:B\to C, then gof(x):=g(f(x)) is a function as gof:A\to C. As for the notations used for a random variable/vector x=X(\omega), the term g(X) is a function g\circ X:\Omega \to C whereas g(x) indicates a function g:\mathbb R^n\to C.

Let X:(\Omega, \mathcal F, P)\to \mathbb R be a random variable (i.e. a function) and g:\mathbb R\to \mathbb R be a function. Then, as we khow, g\circ X:\Omega \to \mathbb R, i.e. Y(\omega):=g(X(\omega)), also becomes a random variable.

More generally, if (X_1,\cdots,X_n):\Omega \to \mathbb R^n is a random vector and g:\mathbb R^n \to \mathbb R^m be a function. Then, g\circ X:\Omega \to \mathbb R^m is a random vector/variable.

Example: In an experiment, a person is randomly selected from a group and his/her weight and height is recorded, then the ratio of weight to height is calculated. If W and H are the random variables representing the weight and height, then X=W/H becomes a random variable for their ratio. Note that W,H:\Omega \to \mathbb R and X:\Omega \to \mathbb R. The quotient function is as g:\mathbb R^2\to \mathbb R acting as g(a,b)=a/b.

Example: Repeating the process of the measurment of a real quantity for n times produces a random vector (X_1, \cdots, X_n):\Omega \to \mathbb R^n. A function g:\mathbb R^n \to \mathbb R defined as g(x)=1/n\sum_{i=1}^n x_i creates a random variable when used as H=g\circ X(\omega)=1/n\sum_{i=1}^n X_i(\omega).

Proposition: Let X:(\Omega, \mathcal B^n,P) \to \mathbb R^n be a random vector and P_X:=P\circ X^{-1}:\mathcal B^n \to [0,1] be the distribution of X. If g:\mathbb R^n \to R^m and H:=g\circ X:\Omega\to \mathbb R^m, we can consider S\subset \mathbb R^m and determin the distribution of H, i.e. P_H, in terms of the distribution of X as follows,

    \[\begin{split}P_H(S) &= P(H\in S)\equiv P(\{\omega| H(\omega) \in S\})=P(g(X)\in S)\\&=P(X\in g^{-1}(S))=P(X^{-1}(g^{-1}(S)))\equiv P(\{\omega| \omega \in  X^{-1}(g^{-1}(S))) = P_X(g^{-1}(S))\end{split}\]

where g^{-1} is the pre-image. If g is bijective then g^{-1} is the inverse function.

By above proposition, we can obtain the CDF of H in terms of the distribution of X. Setting S as S:=(-\infty,h_1]\times\cdots \times (\-infty, h_m] and h=(h_1,\cdots ,h_m) we can write

    \[\begin{split}F_H(h)&=P(H\le h)=P(H_1\le h_1, \cdots, H_m \le h_m)\equiv P(H\in (-\infty,h_1]\times\cdots \times (-\infty, h_m] )\\&= P_X(g^{-1}((-\infty,h_1]\times\cdots \times (-\infty, h_m]))\end{split}\]

If X has a pdf, f_X(x), then

    \[F_H(h) = P_X(g^{-1}((-\infty,h_1]\times\cdots \times (-\infty, h_m])) =\int_{g^{-1}((-\infty,h_1]\times\cdots \times (-\infty, h_m])}f_X(x)\mathrm d x\]

and consequency f_H(h)=\mathrm d F_H(h) /\mathrm d h

Example: Let X,Y:\Omega \to \mathbb R, find the distribution of Z=X+Y.

In this problem, g:\mathbb R^2\to \mathbb R. If we collect the random variables into a vector and let P_{X,Y} be the joint distribution of (X,Y), we can write P_Z(S) =  P_{X,Y}(g^{-1}(S)) in general. For finding the CDF of Z we can write

    \[ F_Z(z)=P(X+Y\le z)=P_{X,Y}(\{(x,y)| x+y\le z\})\]

And if there is a joint pdf for (X,Y), we can conclude

    \[F_Z(z)=\int_{\{(x,y)| x+y\le z\}} f_{X,Y}(x,y)\mathrm dx\mathrm dy \quad \text{and } f_Z(z)=\mathrm dF_Z(z)/\mathrm dz\]

2-10 Expectation (mean/expected value)

Theorem: Let X:(\Omega, \mathcal F, P)\to (\mathbb R, \mathcal B) and P_X(A):= P\circ X^{-1}(A)\ \forall A\in \mathcal B be the distribution of X. P_X is a probability measure for the probability space (\mathbb R, \mathcal B, P_X). Then, the following change of variable is permitted in the Lebesgue integration iff the integral on the left-hand side is defined.

    \[\int_{\Omega} g(X(\omega))\mathrm dP = \int_{\mathbb R} g(x) \mathrm dP_X\]

where g:\mathbb R \to \mathbb R. Note that, the integrations are with respect to the measures P or P_X. For a more clear notation, we can write

    \[\int_{\Omega} g(X(\omega))\mathrm P(\mathrm d\omega) = \int_{\mathbb R} g(x) \mathrm P_X(\mathrm dx)\]

This theorem is also true for integrating on E \in \mathcal F and A=X(E)\in \mathcal B as the reagions of the integrations.

Definition: Let X:(\Omega, \mathcal F, P)\to \mathbb R be a random variable. The expectation or the expected value, or the mean of X is defined as

    \[\mathrm E[X]=\int_\Omega X(\omega) \mathrm dP\]

where the integral is a Lebesgue integration with respect to the probability measure.

This integral can be transfered onto the support or codomain of the random variable. If we set g:x\mapsto x, i.e. the identity function and use the above theorem, we can write

    \[\mathrm E[X]=\int_\Omega X(\omega) \mathrm dP = \int_{\mathbb R \text{ or } X(\Omega) } x \mathrm dP_X  \]

The expectation of a function of a random variable can also be obtained. If H=g(X), then,

    \[\mathrm E[H]=\int_\Omega g(X(\omega)) \mathrm dP = \int_{\mathbb R \text{ or } X(\Omega) } g(x) \mathrm dP_X \]

If we consider the CDF of X, we can write,

    \[E[X] =   \int_{\mathbb R \text{ or } X(\Omega) } x \mathrm d F(x)  \]

where F(x) = P (X\le x).

Mean of a discrete random variable

For a random variable with a discrete support S=X(\Omega), the above integral becomes a sum as

    \[\mathrm  E[X] =\sum_{x\in S} xP(X^{-1}(x)) = \sum_{x\in S} xP_X(x)\]

Mean of a continuous random variable with a pdf f(x)

In this case, the support of the random variable is an uncountable measurable subset of \mathbb R. Therefore, the Lebesgue integration becomes a Reimann integraion over S=X(\Omega) as

    \[\mathrm  E[X] = \int_{\mathbb R \text{ or } S} xf(x)\mathrm dx\]

note that dP_X =dF_X = f(x)dx indicates the probability of a measurable differential element/set.

Remark: Integration over \mathbb R is ok because the pdf vanishes where x in not in S.

Example: If H=g(X) where X is a random variable with a pdf f_X(x), determine \mathrm E(g(X)) by using \mathrm  E[X] = \int_{S_X} xf(x)\mathrm d x.

Without loss of generality, we can assume S_H=[a,b]; then we can partition the support of H into disjoint sets as

    \[S_H = [h_1=a,h_2)\cup \cdots \cup [h_{n-2},h_{n-1}) \cup [h_{n-1},h_{n}=b]\]

By chosing \tilde h_i \in [h_i,h_{i+1}) and using the definition of the expected value

    \[\begin{split} \mathrm E(H)&\approx \sum_i \tilde h_i P(H\in [h_i, h_{i+1})) = \sum_i \int_{g^{-1}([h_i,h_{i+1}))} \tilde h_i f(x)\mathrm d x \\ &=  \int_{g^{-1}([h_1=a,h_{2}))} \tilde h_1 f(x)\mathrm d x + \cdots +   \int_{g^{-1}([h_{n-1},h_n=b])} \tilde h_n f(x)\mathrm d x    \end{split}\]

The value \tilde h_i is a single value in each interval of the S_H partition. If we replace h_i with the function g(x) restricted to the corresponding interval i.e. [h_i, h_{i+1}), each integral term becomes accurate; Therefore,

    \[\begin{split} \mathrm E(H) &=  \int_{g^{-1}([h_1=a,h_{2}))}  g(x) f(x)\mathrm d x + \cdots +   \int_{g^{-1}([h_{n-1},h_n=b])} g(x) f(x)\mathrm d x   =  \int_{\big \cup_i g^{-1}([h_i,h_{i+1}))} g(x) f(x)\mathrm d x \\ &=  \int_{ g^{-1}(\big \cup_i [h_i,h_{i+1}))} g(x) f(x)\mathrm d x =  \int_{ g^{-1}(S_H)} g(x) f(x)\mathrm d x =  \int_{S_X} g(x) f(x)\mathrm d x    \end{split}\]

Remark: In general, the expectation of a random variable is the weighted average of all possible states/values with the
weights representing their (differential) probabilities. The expectation describes something like the average result of a random variable.

Proposition: The expecation of random vector X:=(X_1,\cdots,X_n) is the vector of expectaion of each variable, i.e.

    \[\mathrm E_{X}[X] = (\mathrm E_{X_1}[X_1],\cdots, \mathrm E_{X_n}[X_n] )\]

where the subscrip of each \mathrm E indicaes the random variable whose distribution is involved in calculating the expectation. To show this proposition, we assume that the components of X have a joint pdf and write,