401-4623-00: Time Series Analysis
Section 1
Characteristics of Time Series
Swiss Federal Institute of Technology Zurich
Eidgenössische Technische Hochschule Zürich
Last Edit Date: 01/04/2025
Disclaimer and Term of Use:
We do not guarantee the accuracy and completeness of the summary content. Some of the course material may not be included, and some of the content in the summary may not be correct. You should use this file properly and legally. We are not responsible for any results from using this file
This personal note is adapted from Professor Fadoua Balabdaoui and Professor Nicolai Meinshausen. Please contact us to delete this file if you think your rights have been violated.
This work is licensed under a Creative Commons Attribution 4.0 International License.
Objectives of time series analysis¶
Goals of time series analysis can be classified in one of the following:
- Compact description of data as $X_t = T_t + S_t + Y_t$, where $X_t$ is the observed time series, $T_t$ a trend, $S_t$ a seasonal component and $Y_t$ stationary noise. This can aid with interpretation for example by seasonal adjustment of unemplyment figures. 
- Hypothesis testing. We might for example want to test whether the trend component $T_t$ vanishes for summer rainfall figures in Zurich over the last 10 years. 
- Prediction. Examples are: predict unemployment data / strength of El Nino / airlines passenger numbers or next word in a text. Might sometimes only be possible via simulation, as when trying to forecast hurricane intensity for the next dacade at a specific location. 
- Control / Causality / Reinforcement learning. One example is impact of monetary policy (interest rates) on inflation, where causal impact is quite different (possibly even different sign) to pure observational correlation. Or optimal filling and draining of lakes for energy storage. 
Stochastic process¶
A stochastic process is a mathematical model for a time series.
Stochastic process equals to the collection of random variables $(X_t(\omega); t \in T)$. Alternative view: Stochastic process as a random function from $T$ to $\mathbb{R}$.
A basic distinction is between continuous and discrete equispaced time $T$. Models in continuous time are preferred for irregular observation points. In this course, we will restrict oursleves mostly to discrete equispaced time and, if not stated otherwise, use $T = \mathbb{Z}$.
In all interesting cases, there is dependence between the random variables at different times. Hence need to consider joint distibutions, not only marginals. Gaussian stochastic processes have joint Gaussian distribution for any number of time points.
A stochastic process describes how different time series (when different $\omega$'s are drawn) could look like. In most cases, we observe only one realization $x_t(\omega)$ of the stochastic process (a single $\omega$). Hence it is clear that we need additional assumptions, if we want to draw conclusions about the joint distributions (which involves many $\omega$'s) from a single realization. The most common such assumption is stationarity.
Stationarity means the same behavior of the observed time series in different time windows. Mathematically, it is formulated as invariance of (joint) distributions when time is shifted. Stationarity justifies taking of averages (mathematically, one needs ergodicity in addition).
Examples¶
- White nosie $X_t = W_t$, where $W_t \sim WN(0, \sigma^2)$, that is $W_t \sim F$ iid from some distribution $F$ with mean 0 and variance $\sigma^2$. Special case is Gaussian white noise, where $F = \Phi$. 
- Harmonic oscillations plus (white) noise, - $$X_t = \sum_{k = 1}^K \alpha_k \cos (\lambda_k t + \phi_k) + W_t$$ - where $W_t \sim WN(0, \sigma^2)$ as above a white noise process and $K$, $\alpha$, $\lambda$, $\phi$ unknown parameters. 
- Moving averages. For example - $$X_t = \frac{1}{3} (W_t + W_{t - 1} + W_{t - 2})$$ - where $W_t \sim WN(0, \sigma^2)$. 
- Auto-regressive processes. For example - $$X_t = 0.9 X_{t - 1} + W_t$$ - plus initial conditions. 
- Random Walk (special case of an auto-regressive process) - $$X_t = X_{t - 1} + W_t$$ - or, with drift, - $$X_t = X_{t - 1} + 0.2 + W_t.$$ 
- Auto-regressive conditional heteroscedastic models - $$X_t = \sqrt{1 + 0.9 X_{t - 1}^2} W_t$$ - where again $W_t \sim WN(0, \sigma^2)$. 
Measures of dependence¶
We want to summarize the distribution of a stochastic process ($X_t$) by the first two moments.
- The mean function of process ($X_t$) is defined as - $$\mu_t = \mathbb{E}[X_t] = \int_{-\infty}^{\infty} x f_t(x) dx$$ - where $f_t$ is the density of $X_t$ (if it exists). 
- The auto-covariance function (ACVF) for all $s, t \in \mathbb{Z}$ defined as - $$\gamma (s, t) = \text{Cov}(X_s, X_t) = \mathbb{E}[(X_s - \mu_s)(X_t - \mu_t)].$$ 
- The auto-correlation function (ACF) is defined as - $$\rho (s, t) = \frac{\gamma(s, t)}{\sqrt{\gamma(s, s)\gamma(t, t)}}.$$ 
- The cross-covariance for two time series ($X_t$) and ($Y_t$) is defined as - $$\gamma_{X, Y}(s, t) = \text{Cov}(X_s, Y_t).$$ 
We can also use a subscript $X$ for the first three definition bout usually drop it for notational simplicity.
In vector notation, we can thus write for a collection of $n$ time-points that the vector
$$(X_1, \cdots, X_n)$$
has mean
$$(\mu_1, \cdots, \mu_n)$$
and covariance matrix
$$ \begin{pmatrix} \gamma(1, 1) & \gamma(1, 2) & \gamma(1, 3) & \cdots & \gamma(1, n)\\ \gamma(2, 1) & \gamma(2, 2) & \gamma(2, 3) & \cdots & \gamma(2, n)\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ \gamma(n, 1) & \gamma(n, 2) & \gamma(n, 3) & \cdots & \gamma(n, n)\\ \end{pmatrix} $$
The first two moments of any collection of $n$ random variables can be described as above. For time series, we would like to see translational invariance in time, which will be called stationarity.
Stationarity¶
Strictly stationary¶
Definition: A time series $\{X_t\}_t$ is strictly stationary if and only if the distribution of $(X_{t_1}, \cdots, X_{t_k})$ is identical to the distribution of $(X_{t_{1} + h}, \cdots, X_{t_{k} + h})$ for all $k \in \mathbb{N}^{+}$, time points $t_1, \cdots, t_k \in \mathbb{Z}$ as shifts $h \in \mathbb{Z}$.
If $\{X_t\}_t$ is strictly stationary, then
- $\exists\mu \in \mathbb{R}$ such that $\mu_t = \mu$ for all $t \in \mathbb{Z}$, that is the mean is constant. 
- $\gamma(s, t) = \gamma(s + h, t + h)$ for all $s, t, h \in \mathbb{Z}$, that is the covariance is invariant under time-shifts and write $\gamma$ without the second argument as - $$\gamma(k) = \gamma(k, 0), \forall k \in \mathbb{Z}.$$ 
We can also use just the last two properties about the first two moments (which are implied by strict stationarity) to define weak stationarity.
Weakly stationary¶
Definition: A time series $\{X_t\}_t$ is weakly stationary (or just stationary henceforth) if and only if $\{X_t\}_t$ has finite variance and
- the mean function $\mu_t$ does not depend on $t \in \mathbb{Z}$ 
- the autocovariance $\gamma(s, t)$ depends on $s, t$ only through $|s - t|$ and we use again the notation - $$\gamma(k) = \gamma(k, 0), \forall k \in \mathbb{Z}.$$ 
The mean and covariance of a collection of $n$ consecutive observations, for example $(X_1, \cdots, X_n)$, are now
$$(\mu, \cdots, \mu)$$
and covariance matrix
$$ \begin{pmatrix} \gamma(0) & \gamma(1) & \gamma(2) & \cdots & \gamma(n - 1)\\ \gamma(1) & \gamma(0) & \gamma(1) & \cdots & \gamma(n - 2)\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ \gamma(n - 1) & \gamma(n - 2) & \gamma(n - 3) & \cdots & \gamma(0)\\ \end{pmatrix} $$
in contrast to the general case discussed above. The invariance under time shifts is now easily visible.
Note that a strictly stationary time series is always weakly stationary.
Moreover, if the distribution of $(X_{t_{1}}, \cdots, X_{t_{k}})$ is multivariate Gaussian for all $k \in \mathbb{N}$ and $t_{1}, \cdots, t_{k} \in \mathbb{Z}$, then weak stationary implies strict stationary (the proof is trivial since a multivariance Gaussian distribution is uniquely identified by its mean and covariance).
Examples¶
We look at the same examples as above and see whether they are (weakly) stationary.
- White noise $X_t = W_t$ with $W_t \sim WN(0, \sigma^2)$, that is $W_t$ is iid with distribution $F$ with mean 0 and variance $\sigma^2$. The expected value is - $$\mathbb{E}[X_t] = 0, \forall t \in \mathbb{Z}$$ - The variance is given by $\mathbb{E}[X_t^2]$ for all $t$ and the auto-covariance is - $$ \gamma(t+h, t) = \begin{cases} 0 & h \neq 0 \\ \mathrm{Var}(X_t) = \sigma^2 & h = 0 \end{cases} $$ - We can thus write $\gamma(t+h, t) = \gamma(h)$ for all $t$ and white noise is (weakly) stationary. the ACF is given by - $$ \rho(h) = \frac{\gamma(h)}{\gamma(0)} = \begin{cases} 0 & h \neq 0 \\ 1 & h = 0 \end{cases}. $$ 
- The harmonic oscillator is not stationary as the mean function $\mu_t$ is not constant in time. 
- Moving average. Say - $$X_t = W_t + \theta W_{t - 1}$$ - where $W_t \sim WN(0, \sigma^2)$. Then - $$\mathbb{E}[X_t] = 0$$ - and - $$ \gamma(t + h, t) = \mathrm{Cov}(X_t, X_{t + h}) = \mathbb{E}[X_t X_{t + h}] \begin{cases} \sigma^2(1 + \theta^2) & h = 0 \\ \sigma^2 \theta & h \in \{-1, 1\} \\ 0 & |h| > 1 \end{cases}, $$ - and we can again write $\gamma(h) = \gamma(t + h, t)$ for all $t$, and the process is weakly stationary. The ACF is then - $$ \rho(h) = \frac{\gamma(h)}{\gamma(0)} = \begin{cases} 1 & h = 0 \\ \frac{\theta}{1 + \theta^2} & h \in \{-1, 1\} \\ 0 & |h| > 1 \end{cases}. $$ 
- AR(1) process. 
- Random Walk - $$X_t = \sum_{j = 1}^t W_j, W_j \sim WN(0, \sigma^2).$$ - First, - $$\mathbb{E}[X_t] = 0, \forall t \in \mathbb{Z}$$ - and second - $$\gamma(s, t) = \mathrm{Cov}(X_s, X_t) = \mathrm{Cov}\left( \sum_{j = 1}^s W_j, \sum_{j = 1}^t W_j \right) = \mathbb{E}\left[\sum_{j = 1}^{\min\{s, t\}} W_j^2 \right] = \min\{s, t\} \sigma^2,$$ - and the Random Walk is thus not stationary. 
- ARCH model - $$X_t = \sqrt{1 + \phi X_{t - 1}^2} W_t, W_t \sim WN(0, \sigma^2).$$ - First, - $$\mathbb{E}(X_t) = 0, \forall t \in \mathbb{Z}.$$ - Second, for $0 \le \phi \sigma^2 < 1$, weakly stationary since - $$\gamma(t, t+h) = \mathrm{Cov}(X_t, X_{t + h}) = 0, h \neq 0,$$ - and the variance $\gamma(t, t)$ is time-invariant with - $$\mathbb{E}[X_t^2] = \mathbb{E}[1 + \phi X_{t - 1}^2] \sigma^2 = \frac{\sigma^2}{1 - \phi \sigma^2}.$$ - The ACF is hence $\rho(h) = \mathbb{1}\{h = 0\}$, ust as for a white noise process. Note, however, that while $X_{t - 1}$ and $X_t$ are uncorrelated, they are not independent. For example, $|X_{t - 1}|$ and $|X_t|$ or $X_{t - 1}^2$ and $X_t^2$ will in general have a positive correlation in this model. 
While the stationary above refers to weak stationarity, all weakly stationary examples above are also strongly stationary.
Properties of the autocovariance for stationary time series¶
In general, for a stationary time series,
- The variance is given by $\gamma(0) = \mathbb{E}[(X_t - \mu)^2] > 0$. 
- $|\gamma(h)| \le \gamma(0)$ for all $h \in \mathbb{Z}$. This follows by Cauchy-Schwarz as - $$ \begin{align} |\gamma(h)| &= |\mathbb{E}[(X_t - \mu)(X_{t+h} - \mu)]| \\ &\le \left[ \mathbb{E}[(X_t - \mu)^2] \mathbb{E}[(X_{t + h} - \mu)] \right]^{\frac{1}{2}} \\ &= \left[\gamma(0)^2\right]^{\frac{1}{2}} = \gamma(0). \end{align}$$ 
- $\gamma(-h) = \gamma(h)$ follows trivially. 
- $\gamma$ is positive semi-definite, that is for all $a \in \mathbb{R}^n$ (and any choice of $n \in \mathbb{N}$), - $$\sum_{i,j=1}^n a_i \gamma(i - j) a_j \ge 0.$$ - Consider the variance of $(X_1, \cdots, X_n)a = \sum_{i = 1}^n a_i X_i$, where $a \in \mathbb{R}^n$ is a column vector: - $$0 \le \mathrm{Var}\left( \sum_{i = 1}^n a_i X_i \right) = \sum_{i,j = 1}^n a_i a_j \mathrm{Cov}(X_i, X_j) = \sum_{i,j=1}^n a_i \gamma(i - j) a_j,$$ - which completes the proof. 
Estimating the auto-covariance¶
For observations $x_1, \cdots, x_n$ of a stationary time series, estimate the mean, auto-covariance and auto-correlation as follows
- Sample mean $\hat{\mu} = \bar{x} = \frac{1}{n} \sum_{i = 1}^n x_i$. 
- Sample auto-covariance function is, for $-n \le h \le n$ - $$\hat{\gamma}(h) = \frac{1}{n} \sum_{t = 1}^{n - |h|} (x_{t + |h|} - \bar{x})(x_t - \bar{x}),$$ - and set to 0 otherwise. 
- Sample auto-correlation is given by - $$\hat{\rho}(h) = \frac{\hat{\gamma}(h)}{\hat{\gamma}(0)}$$ 
Note that $\hat{\gamma}(h)$ is identical to the sample covariance of $(X_1, X_{1 + h}), \cdots, (X_{n - h}, X_n)$, except that we normalize by $n$ instead of $n - h$ to keep $\hat{\gamma}$ positive semi-definite.
Properties of the sample ACF¶
The four properties of the ACF are also true for the sample ACF:
- $\hat{\gamma}(-h) = \hat{\gamma}(h)$ holds trivially. 
- $\hat{\gamma}$ is positive semi-definite. - We can write - $$ \hat{\Gamma}_n = \begin{pmatrix} \gamma(0) & \gamma(1) & \gamma(2) & \cdots & \gamma(n - 1)\\ \gamma(1) & \gamma(0) & \gamma(1) & \cdots & \gamma(n - 2)\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ \gamma(n - 1) & \gamma(n - 2) & \gamma(n - 3) & \cdots & \gamma(0)\\ \end{pmatrix} = \frac{1}{n}MM^T, $$ - where the $n \times (2n - 1)$ dimensional matrix $M$ is given by - $$ M = \begin{pmatrix} 0 & \cdots & \cdots & 0 & \tilde{X}_1 & \tilde{X}_2 & \tilde{X}_3 & \cdots & \tilde{X}_{n - 1} & \tilde{X}_{n}\\ 0 & \cdots & 0 & \tilde{X}_1 & \tilde{X}_2 & \tilde{X}_3 & \cdots & \cdots & \tilde{X}_{n} & 0\\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \ddots & \vdots\\ 0 & \tilde{X}_1 & \tilde{X}_2 & \tilde{X}_3 & \cdots & \tilde{X}_{n} & 0 & \cdots & \cdots & 0\\ \tilde{X}_1 & \tilde{X}_2 & \tilde{X}_3 & \cdots & \tilde{X}_{n} & 0 & \cdots & \cdots & 0 & 0\\ \end{pmatrix} $$ - where $\tilde{X}_t = X_t - \hat{\mu}$. Hence, for any $a \in \mathbb{R}^n$, - $$a^T \hat{\Gamma}_n a = \frac{a^T M}{M^T a} = \frac{1}{n} ||M^T a||_2^2 \ge 0,$$ - which completes the proof. 
- $\hat{\gamma}(0) \ge 0$ and $|\hat{\gamma}(h)| \le \hat{\gamma}(0)$ follows from property 2. 
Transforming to stationarity¶
Several steps / strategies, not always in the same order
- Plot the time series: look for trends, seasonal components, step changes, outliers etc. 
- Transform data so that residuals are stationary. - Estimate and subtract trend $T_t$ and seasonal components $S_t$ 
- Differencing 
- Nonlinear transformations (log, $\sqrt{.}$, etc.). 
 
- Fit a stationary model to resifuals. This yields an overall model for the data. 
For the first point of 2, we can use non-parametric estimation (with large bandwidth) to get trend $T_t$ and smoothing (with medium bandwidth) to get seasonal component. Seasonal component can also be estimated as empirical everage of detrended data in, for example, each given month (if it is yearly data).
For the second point of 2, define lag-1 difference operator via
$$(\nabla X)_t = (1 - B)X_t = X_t - X_{t - 1},$$
where $B$ is the backshift operator defined via
$$(BX)_t = X_{t - 1}.$$
- For a linear trend, that is if - $$X_t = \mu + \beta t + N_t,$$ - with $N_t$ the noise process, we have - $$(1 - B)X_t = \beta + (1 - B)N_t.$$ - In differenced noise $(1 - B)N_t$ is stationary, we can estimate slope $\beta$ from data as the mean of the differenced time series $\nabla X$. 
- For a polynomial trent with noise, that is if - $$X_t = \sum_{j = 1}^k \beta_j t^j + N_t,$$ - difference $k$ times to get - $$\nabla^k X_t = (1 - B)^k X_t = k!\beta_k + (1 - B)^k N_t.$$ - If k-times differenced noise $(1 - B)^k N_t$ is stationary, can estimate highest-order term as the mean of the k-times differenced time series. 
- For a seasonal variation of length $s$, define lag-s differencing as - $$(1 - B^s)X_t = X_t - B^s X_t = X_t - X_{t - s},$$ - where $B^s$ is the backshift operator applied $s$ times. If - $$X_t = T_t + S_t + N_t,$$ - and $S_1$ has period $s$, then - $$(1 - B^s)X_t = T_t - T_{t - s} + (1 - B^s)N_t,$$ - and the seasonal component has been removed and we can then proceed as in 1 or 2, depending on the nature of the trend.