Timeseries Analysis
Table of Contents
Prerequisites
Moments
Moments of a distribution is just physics-envy jargon for the basic “dimensions” describing a distribution. Moments can be raw (origin is 0) or “centered” on the mean (basically standardized for comparison). They are computed using integrals because you need to get a sense of the distribution of mass of the function. they can be defined for both continuous and discrete distributions
- The first “moment” is the center / mean of the distribution (Expectaction, \(E[X]\))
- the second “moment” is the variance (\(E[(X-\mu)^{2}]\)), or how wide the distribution is. a wider spread will always create a greater variance, because larger numbers on the right tail inflate the average square value
- the third is skewness (\(S[x]=E[\frac{(X-\mu)^{3}}{\sigma^{3}}]\)), which is relative sizes of the tails (left skew = left tail), how symmetrical the distribution is
- the fourth is kurtosis, \(K[x] =E[\frac{(X-\mu)^{4}}{\sigma^{4}}]\), which measures how fat the tails are, how high the peaks are
Formulas:
the \(l\)th moment: \(m'_{l} = E[X^{l}] = \int_{-\infty}^{\infty}x^{l}f(x)dx\)
the \(l\)th central moment: \(m_{l} = E[(X-\mu)^{l}] = \int_{-\infty}^{\infty}(x-\mu)^{l}f(x)dx\)
Statistical Estimation
Method of Moments - given data \(\{X_{1},...,X_{n}\}\):
- sample mean: \(\hat{\mu}=\bar{X} = \frac{\Sigma_{i=1}^{n}X_{i}}{n}\)
- sample variance: \(\sigma^{2} = S^{2} = \frac{\Sigma_{i=1}^{n}(X_{i}-\hat{\mu})^{2}}{n-1}\)
- sample skewness: \(\hat{S}(X_{1},...,X_{n}) = \frac{\Sigma_{i=1}^{n}(X_{i}-\hat{\mu})^{3}}{(n-1)\hat{\sigma}^{3}}\)
- sample kurtosis: \(\hat{K}(X_{1},...,X_{n}) = \frac{\Sigma_{i=1}^{n}(X_{i}-\hat{\mu})^{4}}{(n-1)\hat{\sigma}^{4}}\)
Maximum Likelyhood Estimation (MLE):
- tbd
Multiple Linear Regression
The linear relationship between the response variables and the predicting variables. These deviances (epsilons) or error terms are the difference between the response variable and the linear function in the axis.
Assumptions:
- Linearity / Mean 0 Assumption: \(E(\varepsilon_{i}) = 0\) - expected value of errors is 0 across all errors
- Constant Variance Assumption: \(Var(\varepsilon_{i}) = \sigma^{2}\) - the model is not more accurate for some parts than for others
- Independence Assumption: \({ \varepsilon_{1},...,\varepsilon_{n}}\) are independent random variables - the response variables are independently drawn from the data generating process
- for confidence intervals, hypothesis testing: \(\varepsilon_{i}\) ~ Normally distributed
Parameter Estimation: \((\beta_{0},\beta_{1},...,\beta_{p}),\sigma^{2}\)
- to estimate \((\beta_{0},\beta_{1},...,\beta_{p})\) find values of \((\hat{\beta_{0}},\hat{\beta_{1}},...,\hat{\beta_{p}})\) that minimize square error. the betas is just the line equation basically
- minimize: \(\Sigma_{i=1}^{n} (y_{i} - \hat{y_{i}})^{2} = \Sigma_{i=1}^{n} (y_{i} - (\hat{\beta_{0}}+\hat{\beta_{1}}x_{i,1} + ... + \hat{\beta_{1}}x_{i,p}))^{2} = (y-X\hat{\beta})^{T}(y-X\hat{\beta})\)
- y is stacked vector of responses, X is design matrix where each column corresponds to one predicting variable, B is the stacked vectors of parameters
- By linear algebra(Orthogonal Decomposition Theorem) or differentiation: \(X^{T}(y-\hat{y})=X^{T}(y-X\hat{\beta})=0\) so \(X^{T}X\hat{\beta}=X^{T}y\)
- if \(X^{T}X\) is invertible, \(\hat{\beta}=(X^{T}X)^{-1}X^{T}y\)
- if it’s not invertible, then we have multicolinearity issue
Fitted Values:
- FVs (\(\hat{y}\)) are the predicted variables that correspond with the input data. fitted values are derived from the linear model by replacing the true coefficients with the estimated ones. basically, it’s what the model would predict for a given set of input data. the difference between response variable and fitted variable is the residuals (error terms)
- \(\hat{y} = X\hat{\beta}\) and \(\hat{\beta}=(X^{T}X)^{-1}X^{T}y\)
- so \(\hat{y}=X\hat{\beta}=X(X^{T}X)^{-1}X^{T}y = Hy\)
- where \(H\equiv X(X^{T}X)^{-1}X^{T}\) is called the Hat Matrix because multiplying \(y\) by \(H\) gives \(\hat{y}\)
Residuals:
- i.e. errors: \(\varepsilon = y - \hat{y} = y - X\hat{\beta} = y - Hy = (I - H)y\)
- to estimate \(\sigma^{2}\) (variance): \(\hat{\sigma}^{2} = \frac{\hat{\varepsilon}^{T}\hat{\varepsilon}}{(n-p-1)}\)
Model Interpretation:
- \(\hat{\beta}_{0}\) - the estimated expected value of the response variable when all the predictor variables are set to 0, also the intercept_
- \(\hat{\beta}_{i}\) - the estimated expected change in the value of the response variable associated with one unit of change in the ith predictor variable, holding all other predictors fixed._
Sampling Distribution:
- ???
Confidence Intervals:
- using t-sampling distribution: \(\hat{\beta}_{j}\)
- tbd
Basic Stuff
- Stationarity - temporal dependence is similar across time, the mean, variance and autocorrelation are constant. this is an assumption for certain modeling techniques. after you remove trend and seasonality the residual process is often stationary
- Homoskedasticity - constant variance; Heteroskedasticity - non-constant variance
- Univariate Analysis:
- ARMA Model (Auto-Regressive Moving Average) - conditional mean model, stationary
- ARIMA (Auto-Regressive Integrated Moving Average) extended to be non-stationary using differencing (when a model is non-stationary due to trend)
- Seasonal ARIMA
- GARCH (Generalized Autoregressive Conditional Heteroskedasticity) - conditional variance model, also stationary but with time-variant conditional variance (variance given past data will vary in time), also called volatility
- extended GARCH
- Joint ARIMA-GARCH
- ARMA Model (Auto-Regressive Moving Average) - conditional mean model, stationary
- Multivariate Analysis:
- VAR (Vector Autoregressive) model
- Granger Causality (lag between time series)
- State Space Model
- Functional Data Analysis
Module 1
Time Series Basics:
- Data: \(Y_{t}\) where t indexes time
- Model: \(Y_{t}=m_{t}+s_{t}+X_{t}\)
- m is a trend component, it doesn’t capture large variations, only smooth gradual overall changes
- s is a seasonality component with known periodicity d \((s_{t}=s_{t}+d)\) such that \(\Sigma_{j=1}^{d}s_{j}=0\)
- \(X_{t}\) is a stationary component, it’s probability distribution doesn’t change with time
- approach: \(m_{t}\) and \(s_{t}\) are first estimated and subtracted from \(Y_{t}\), resulting in \(X_{t}\)
Trend Estimation
- Estimate and remove, or difference the data to remove it directly?
- estimation methods:
- moving average
- parametric regression (linear, quadratic, etc)
- non-parametric regression
Moving Average (non-parametric, the simplest version of kernel smoothing):
- Estimate the trend \(m\) for time \(t\) (\(m_{t}\)) with a moving window of \(d\),
- by taking the average of the time values within the neighborhood of t
- (q is the distance on either side of t)
- if d is even, then d = 2q, use: \(\hat{m_{t}} = \frac{1}{d}[\frac{x_{t-q}}{2}+x_{t-q+1}+x_{t-q+2}+...+\frac{x_{t+q}}{2}]\)
- if odd then d = 2q+1, use: \(\hat{m_{t}} = \frac{1}{d}\Sigma_{j=-q}^{q}x_{t+j}\)
- bias-variance trade-off:
- if width is large, then the trend is smooth (i.e. low variability)
- if width is small, then the trend is not smooth (low bias)
Parametric Regression
- estimate the trent in m_t assuming a polynomial in t
- formula: \(m_{t} = \beta_{0} + \beta_{1}t +\beta_{2}t^{2} +...+\beta_{p}t^{p}\)
- usually use small polynomial (p=1 or 2) max…for higher polynomials it’s better to use non-parametric methods, esp if p isn’t known in advance
- estimation approach: fit a linear regression model where the predicting variables are \((t, t^{2},...,t^{p})\)
- if polynomial is large, can use model selection to select among predicting variables (caustion: strong correlation among predicting variables since they’re all orders of t)
Non-Parametric Regression:
- Kernel Regression: \(m_{t} = m(t) = \Sigma_{i=1}^{n}l_{i}(t)X_{t_{i}}\) where \(l_{i}(t)\) is a weight function depending on a kernel function.
- Local Polynomial regression: an extension of the kernel regression and the polynomial regression - fit a polynomial within a width of a data point
- Other approaches: splines regression, wavelets, orthogonal basis function decomposition
- Which one to choose?
- Local polynomial regression is preferred over kernel since it overcomes boundary problems,a nd it’s performance is not dependent on the design of the time points
- other methods depending on the level of smoothness of the function you are estimating
- for estimating trend, local polynomial or splines will work best in most cases
Trend Code
Reading Data In, plotting time series:
data = read.table("avTempAtlanta.txt", header=T)
names(data) ## "year", "jan", ... , "dec", "annual"
head(data)
## get all temps in one vector:
## select only month columns (exclude year and annual),
## as.vector takes a matrix of columns and makes into one vector,
## so you have to transpose columns so it takes the rows of
## temp data and sews it together seamlessly
temp = as.vector(t(data[,-c(1,14)]))
temp = ts(temp, start = 1879, frequency = 12) ##convert vector into time series
ts.plot(temp, ylab="Temperature")
Moving Average in R:
## create equally spaced time points
time.pts = c(1:length(temp))
time.pts = c(time.pts-min(time.pts))/max(time.pts)
## fit a moving average
mav.fit = ksmooth (time.pts, temp, kernel = "box")
temp.fit.mav = ts(mav.fit$y, start=1879, frequency = 12)
## is there a trend?
ts.plot(temp,ylab="Temperature") # plot data
lines(temp.fit.mav, lwd=2,col="purple") ## plot trend line
abline(temp.fit.mav[1],0,lwd=2,col="blue") ## plot horizontal line for comparison
Parametric Regression:
## fit a parametric quadratic polynomial
x1 = time.pts
x2 = time.pts^2
lm.fit = lm(temp~x1+x2)
summary(lm.fit)
## Is there a trend?
temp.fit.lm = ts(fitted(lm.fit), start=1879, frequency=12)
ts.plot(temp, ylab="Temperature")
lines(temp.fit.lm,lwd=2,col="green")
abline(temp.fit.mav[1],0,lwd=2,col="blue")
Local Polynomial Regression:
loc.fit = loess(temp~time.pts)
temp.fit.loc = ts(fitted(loc.fit),start=1879, frequency=12)
## splines trend estimation
library(mgcv)
gam.fit = gam(temp~s(time.pts)) ##transform to splines fit, otherwise it will do a linear model
temp.fit.gam = ts(fitted(gam.fit),start=1879,frequency=12)
## is there a trend?
ts.plot(temp, ylab="Temperature")
lines(temp.fit.loc,lwd=2,col="brown")
lines(temp.fit.gam,lwd=2,col="red")
abline(temp.fit.loc[1],0,lwd=2,col="blue")
Compare all estimated trends:
all.val = c(temp.fit.mav, temp.fit.lm, temp.fit.gam, temp.fit.loc)
ylim = c(min.(all.val),max(all.val))
ts.plot(temp.fit.lm(lwd=2,col="green", ylim=ylim, ylab="Temperature"))
lines(temp.fit.mav, lwd=2,col="purple") ## plot trend line
lines(temp.fit.gam,lwd=2,col="red")
lines(temp.fit.loc,lwd=2,col="brown")
legend(X=1900, Y=64, legend=c("MAV", "LM", "GAM", "LOESS"),lty=1,col=c("purple", "green", "red", "brown"))
Seasonality Estimation
Seasonality is a regular and predictable change that recurs with a specific known period of time, such as weekly, monthly or yearly. Periodicity refers more generally to any type of periodicty (used interchangeably in the course). Cycles refers to changes that don’t happen with regular periodicity and are hence not predictable.
Estimation methods:
- Seasonality Average:
- assume d data points for # of periods in each seasonal cycle (i.e. 12 for monthly data)
- then you compute the average “weights” of a given time in the cycle, averaged across all cycles in data (for 3 years, average the first month of each year)
- example: \(w_{1} = (Y_{1}, +Y_{13},+Y_{26})/3 , w_{2} = (Y_{2}, +Y_{14},+Y_{27})/3\)
- continue to do so for all weights, then average all 12 weights to get corresponding value \(a = (w_{1} + ... + w_{12})/12\)
- the seasonal effect (by periodic ordinal) is equal to the wk minus the average of all weights. example: \(s_{1} = w_{1} - a\)
- for \(k=1,2,...,d\) compute the average \(w_{k}\) of: {\(Y_{k+jd}\), with k+jd in the time domain }
- then: \(\hat{s}_{k}=w_{k} - \frac{1}{d}\Sigma_{j=1}^{d}w_{j}\)
- in plain language, the seasonal effect (by periodic ordinal)
- Parametric regression:
- fit a mean for each seasonality group (such as month) using linear regression
- use a cosine-sine curve to fit seasonal component
- Seasonal Means Model (can do both seasonality and trend):
- model: \(Y_{t}=\mu+s_{t}+X_{t}\) with \(\Sigma_{j=1}^{2}s_{j}=0\)
- approach: fit a mean for each seasonality group using linear regression
- basically ANOVA model - group by peridicity group, treating seasonality as a categorical variable: group k: \(Y_{t}\) for \(t=k+jd\)
- Dummy variables: \(C_{k}=1\) if \(t=k+jd\) and 0 otherwise
- for example for monthly data spanning 3 years, you’ll have 36 column vector of 1 and 0s, 12 dummy variables total (one for each month), then you make a regression model with the 12 dummy variables as the predictors
- fit a linear regression model with d-1 dummy variables (with intercept) or d dummy variables (without intercept) - because the variables are linearly dependent
- if there’s a lot longer periodicity group (like 52 weeks a year) then the model will have a bajillion regression coefficients. we generally prefer models with fewer coeffs, so it may be better to use alternative method ->
- Cosine-Sine Model:
- model: \(Y_{t} = \mu + s_{t} + X_{t}\) with \(\Sigma_{j=1}^{2}s_{j}=0\)
- approach: assume a cosine curve \(s+{t} = \beta cos(2\pi ft + \phi)\) for seasonality, specified by three parameters:
- amplitude = \(\beta\)
- frequency (1/f is the period) = \(f\)
- phase = \(\phi\) - sets the arbitrary origin on the time axis
- amplitude and phase do not enter the expression linearly so it’s inconvenient to estimate them, but you can reparameterize them using trig identities:
- like so: \(s+{t} = \beta cos(2\pi ft + \phi) = \beta_{1}cos(2\pi ft)-\beta_{2}sin(2\pi ft)\)
- with \(\beta_{1} = \beta cos(\phi)\) and \(\beta_{2} = \beta sin(\phi)\)
- fit a linear regression: \(Y_{t}\) ~ \(\beta_{1}cos(2\pi ft)-\beta_{2}sin(2\pi ft)\) where B1 and B2 are regression coefficients
- if seasonality has multiple frequenceis (e.g. month, week), we can use different values of f (2 predicting variables for each f)
- when we fit the cosine curve, we don’t know the amplitude and phase, we use the reparameterization into a linear combination of sine and cosine with frequency 1/period (e.g. 1/12 for monthly)
- Vars: \(X_{1}=cos(2\pi ft), X_{2}=\beta_{2}sin(2\pi ft), f=1/12\)
- Regression: \(Y\) ~ \(X_{1}, X_{2}\)
Removing Trend & Seasonality
- Step 1: estimate the trend \(\hat{m}_{t}\) for \(t=q + 1 ...n-q\)
- Step 2: estimate seasonal components:
- for k = 1,2,…d, compute the average \(w_{k}\) of: {\(x_{k+jd}-\hat{m}_{k+jd}, q < k + jd \leq n-q\)}
- then \(\hat{s}_{k} = w_{k} - \frac{1}{d}\Sigma_{j=2}^{d}w_{j}\)
- Step 3: re-estimate the trend from the “deseasonalized data” \(d_{t} = X_{t} - \hat{s}_{t}\)
- a new set of estimates \(\hat{m}_{t}\) of the trend based on the deseasonalized data
- Since using regression we can simultaneously do the seasonality and trend:
- Seasonality: set the predicting variables
- dummy variables for the seasonal effects (ANOVA)
- cosine and sine variables
- Trend: set the approach
- Parametric Regression - Polynomial predicting variables
- nonparametric regression
- Joint Modeling:
- linear regression: seasonality predicting variables and polynomial predicting variables in t
- semiparametric regression: nonparametric model for the trend and linear model predicting variables for seasonality
- Differencing to Remove Trend & Seasonality:
- define \(\nabla_{d} Y_{t} = Y_{t} - Y_{t -d} = (1-B^{d})Y_{t}\)
- then apply the differencing operator:
\(\nabla_{d} Y_{t}\) = \(\nabla_{d} m_{t} + \nabla_{d} s_{t} + \nabla_{d} X_{t}\)
= \(m_{t} - m_{t-d} + s_{t} - s_{t-d} + X_{t} - X_{t-d}\)
= \(m_{t} - m_{t-d} + X_{t} - X_{t-d}\)
- Seasonality: set the predicting variables
Seasonality Code
Seasonality: Seasonal Models
library(dynlm)
## Seasonal Means Model
## Drop january ( with intercept)
model1 = dynlm(temp~season(temp))
summary(model1)
##Seasonal means effect (without intercept)
model2 = dynlm(temp~ season(temp)-1)
summary(model2)
## Cosine-sine model
model3 = dynlm(temp~harmon(temp))
summary(model3)
model4 = dynlm(temp~harmon(temp,2)) ## adding harmonics to add to the explanatory power and increase r2
summary(model4)
Compare Models:
## Seasonal Means Model
st1 = coef(model2)
## Cosine-sine model
st2 = fitted(model4)[1:12]
## Compare Seasonality Estimates
plot(1:12,st1,lwd=2, type="l",xlab="Month",ylab="Seasonality")
lines(1:12, st2, lwd=2, col="brown")
Seasonality & Trend: Parametric Model
## fit parametric model for both trend and seasonality
x1 = time.pts
x2 = time.pts^2
lm.fit=dynlm(temp~x1+x2+harmon(temp,2))
summary(lm.fit)
dif.fit.lm = ts((temp-fitted(lm.fit)), start=1879, frequency=12)
ts.plot(dif.fit.lm,ylab="Residual Process")
Seasonality & Trend: Nonparametric Model
- the gam function in R allows for fitting non-parametric and linear components jointly
- we specify that the trend in time is non-parametric, while not transforming the seasonality. we add the harmonics to the non-parametric trend
- the harmonic function in this code is part of the TSA library
## fit nonparametric model for both trend and seasonality
har2 = harmonic(temp,2)
gam.fit=gam(temp~s(time.pts)+har2)
dif.fit.gam = ts((temp-fitted(gam.fit)),start=1879,frequency=12)
## Compare approaches:
ts.plot(dif.fit.lm,ylab="residual process", col="brown")
lines(dif.fit.gam,col="blue")
Stationarity
- \(X_{t}\) is generally considered to be the stationary component after the trend and seasonality have been removed
- Intuitive understanding: stability over time, constant dependence pattern. somewhat predictable in the short term. the only possible patterns in it are seemingly random and unpredictable
- Removing trend and seasonality do not necessarily make a time series stationary, especially if the process itself is nonstationary (e.g. a random walk). there also could be residual trend or seasonality left in the residual process
- How to Evaluate Stationarity?
- the autocovariance function \(\gamma_{x}(r,s)\) quantifies the linear relationship between the process at two time points.
- if the process is stationary, then \(\gamma_{x}(r,s)\) depends only on the time difference |r-s|, not on the specific values of r and s.
- {X_t} is (weakly) stationary if:
- \(E[X_{t}] = m\) for all \(t \in \mathbb{Z}\) - constant mean
- \(V[X_{t}] < \infty\) for all \(t \in \mathbb{Z}\) - finite variance (second moment)
- \(\gamma_{x}(r,s) = \gamma_{x}(r+t, s+t)\) for all \(r,s,t \in \mathbb{Z}\) - constant autocovariance - autocovariance does not change when shifted in time (i.e. the dependence between \(X_{r}\) and \(X_{s}\) is the same for when it’s shifted \(X_{r+t}, X_{s+t}\)). it only depends on lag, not on time. unconditional constant variance over time
- it’s weak because the conditions are first and second-order properties of the stochastic process, but it’s possible for the higher moments to vary with time. there exists a more strict definition of stationarity
- if the auto-covariance function of the time series depends on the lag only and if it decays fast to 0, then conditions 1 and 3 hold, this is the pattern we seek for stationarity
- condition 2 can only be evaluated by plotting the time series and looking for any extreme change point in the data
- Examples of stationary time series:
- White noise process - assumes uncorrelated data
- IID (independent and identically distributed) with finite second moment - assumes independent data
- difference? - independence implies uncorrelatedness but not vice versa
- Example of non-stationary time series:
- random walk, not stationary because \(V(S_{t} = t \sigma^{2})\)c
Autocovariance:
- the covariance of any two variables of the stochastic process generating the time series. it is unknown but can be estimated
- for example if we consider time points r and s, the autocovariance of X_r and X_s is the expectation of the mean-centered variables X_r and X_s
- auto-covariance of a time series: {\(X_{t},t \in \mathbb{Z}\)}:
\(\gamma_{x}(r,s) = E[(X_{r} - E[X_{r}]) \cdot (X_{s} - E[X_{s}])]\) - basically logic notation for a time series - X_t is a sequence of random variables indexed by t, t in Z means the time series is defined for all integer values of t (it could hypothetically extend infinitely in both past and future) the whole logic “phrase” represents a discrete-time stochastic process
Autodependence
- time-dependence means the variables making up a time series are correlated in time
- also referred to as “serial correlation”
- serial correlation in a univariate time series is measured by auto-covariance, meaning the degree of similarity bytween the time series observed at a given time point and a lagged version of itself
- auto-dependence is when there is time dependence between variables in a univariate time series (a sequence of measurements of a single variable)
- direct dependence = what happens at time t directly impacts what happens at a different time s
- indirect dependence - what happens at time t indirectly impacts what happens at time s through other data in the time series (like skipping a chain link)
- Marginal regression:
- \(X_{t}\) ~ \(X_{t - 1}\) direct
- \(X_{t}\) ~ \(X_{t - 2}\) indirect
- Conditional Regression:
- \(X_{t}\) ~ \(X_{t - 1} + X_{t - 2}\) direct relationship between \(X_{t}\) and \(X_{t - 2}\) given \(X_{t - 1}\)
- Autocorrelation is a scaled autocovariance: \(\rho_{x}(r,s) = \frac{\gamma_{x}(r,s)}{\sqrt{V[X_{r}]V[X_{s}]}}\)
- it means dividing the covariance of the two variables by the product of their standard deviation
- because it’s scaled, it’s between -1 and 1, so it’s easier to interpret than auto-covariance
Autocovariance Function (ACF)
- Autocovariance (or autocorrelation) Function (ACF) is a measure of direct and indirect dependence that (assumes) depends only on the lag \(h\) between two variables in the time series
- applies to stationary processes only
- for a stationary time series {\(X_{t}\)}, the ACF is: \(\gamma_{x}(h) = Cov(X_{t+h}, X_{t})\)
- properties of ACF:
- \(\gamma_{x}(0) \geq 0\) - autocovariance of lag 0 (variance of x_t) is positive
- \(\mid \gamma_{x}(h)\mid \leq \gamma(0)\) - the absolute value of the covariance function at lag h is bound by the variance of x_t
- \(\gamma_{x}(h) = \gamma(-h)\) - since the autocovariance function is symmetric, the autocovariance at lag h is the same as at lag -h
- autocorrelation function of a stationary time series is \(\rho_{x}(h)=\frac{\gamma_{x}(h)}{\gamma_{x}(0)}\) - the ratio between the autocovariance at lag h divided by the autocov at lag 0. it’s the same as the other formulation but because of the rescaling, the \(\rho_{x}(0)=1\) (autocovariance of lag 0 equals 1)
Partial Autocovariance Function (PACF)
- \(\alpha_{x}(h)\) - measure of direct dependence of two vars that are \(h\) lag apart (conditional on the in-between vars so it can measure direct dependence)
- whereas the ACF is the marginal dependence not controlling for the in-between variables
- assume a stationary time series \(\{X_{t}\}\) where the autocovariance function decays to 0 as \(h\) increases t \(\gamma_{x}(h) \rightarrow 0\) as \(h \rightarrow \infty\)
- derivation process:
- predict \(X_{t}\) using Multiple linear regression model
- regress \(X_{t}\) ~ \(X_{t-1}+X_{t-2}+...+X_{t-h}\)
- estimate the regression coefficients: \(\alpha_{1}\) (for \(X_{t-1}\)), \(\alpha_{2}\) (for \(X_{t-2}\)),…,\(\alpha_{h}\) (for \(X_{t-h}\))
- PACF at lag \(h\) is \(\alpha_{h}\)
Estimating Autocovariance:
- given {\(x_{1},...,x_{n}\)} observations of a stationary time series {\(X_{t}\)}, to estimate the autocovariance \(\gamma_{x}(\cdot)\) of {\(X_{t}\)}:
- the sample autocovariance function of lag h is the sum of the products between any two mean-centered observations of the time series observed within a lag h
- \(\hat{\gamma}_{x}(h) = \frac{1}{n}\Sigma_{j=1}^{n-h}(x_{j+h} - \bar{x})(x_{j}-\bar{x})\), where \(0 \leq h \leq n\)
- with \(\hat{\gamma}_{x}(h) = \hat{\gamma}_{x}(-h)\) , \(-n \leq h \leq 0\) , where \(\bar{x}=\frac{1}{n}\Sigma_{j=1}^{n}x_{j}\)
- the sample autocorrelation is simply the sample autocovariance at h divided by the sample autocovariance at 0: \(\hat{\rho}_{x}(h) = \frac{\hat{\gamma}_{x}(h)}{\hat{\gamma}_{x}(0)}\) , where \(\mid h\mid < n\)
- Don’t forget about uncertainty! the sample autocorrelation is not the same thing as the actual autocorrelation, it’s only an estimate given the data. the true parameters are actually unknown for a stochastic time series
Autocovariance and Stationarity
- Autocovariance applies generally to all time series, both stationary and nonstationary processes: \(Cov(X_{t+h}, X_{t})\)
- for stationary time series it only depends on lag, for nonstationary it can depend on time and lag
- the autocovariance function is only defined stationary timeseries,
- but sample autocovariance/estimated ACF can be used to evaluate both stationary and nonstationary time series for stationarity
ACF Plot
- \(\hat{\gamma}_{x}(h)\) or \(\hat{\rho}_{x}(h)\) plotted against h
- since stationarity is the same for h and -h, we only need to plot positive values of h
- Plausible Stationarity: ACF plot displays large value for some small lag values h, then quickly goes to zero
- Non-stationarity: ACF plot displays a seasonal pattern or it does not decay fast to zero
- the lack of non-stationary pattern in the acf plot does not necessarily guarantee stationarity of the time series
- therefore it’s important to complement the ACF plot with the raw timeseries plot as well as hypothesis testing, etc.
ACF Tests
- pay attention to null vs alt hypothesis
- Unit root tests (e.g. Dickey-Fuller or Augmented Dickey-Fuller (ADF) tests)
- H_0: nonstationarity vs H_A: stationarity
- interpretation: plausible stationarity if rejecting the null
- KPSS Test (Kwiatkowski-Phillips-Schmidt-Shin)
- H_0: stationary or trend stationarity vs HA: nonstationarity
- interpretation: plausible stationarity if FAIL to reject null
Stationarity Code
Sample ACF Plots can be used for doing sample autocovariance but default is sample autocorrelation
## ACF for the temp time series
acf(temp, lag.max=12*4, main="")
## ACF for the residual process
acf(dif.fit.lm, lag.max=12*4, main="") ## linear model
acf(dif.fit.gam, lag.max=12*4, main="") ## linear model and splines for trend
Testing for Stationarity
## ADF test applied to residual process
adf.test(dif.fit.gam)
#KPSS test applied to the residual process
kpss.test(dif.fit.gam)
Linear Processes & Prediction
- A time series {X_t} is a linear process if it can be represented as a linear combination of a white noise process, where the sum of the absolute value of the coefficients in the linear combination is finite
- \(X_{t} = \Sigma_{j=-\infty}^{\infty}\psi_{j}Z_{t-j}\) , where \(\{Z_{t}\}\) ~ \(WN(0,\sigma^{2})\) and \(\{\psi_{j}\}\) is a sequence of constants with \(\Sigma_{j=-\infty}^{\infty} \mid \psi_{j}\mid < \infty\)
- Moving Average - special case of linear process: if \(\psi_{j} = 0\) for all j > 0 then {X_t} is called a moving average or \(MA(\infty)\) process. this is the case we will assume in the course since it encompasses past and current shocks
- Linearity Assumption: prediction relies on this
- Parsimonious Representation: use fewer (finite number) relationships with past and current random shocks, resulting in the ARMA process
- finite sum across all relationships is interpreted as “stability” over time - meaning reliable predictions
- An AR(1) process (autoregression 1 process) is defined to be the stationary solution of:
\(X_{t}-\phi X_{t-1} = Z_{t}\) where {Z_t} ~ \(WN(0,\sigma^{2}), \mid \phi \mid < 1\) and Z_t is uncorrelated with X_s for all s > t- AR(1) does not look like a linear process representation
- if \(\mid \phi \mid < 1\), the AR(1) can be shown to be written equivalently as a linear process
- if \(\mid \phi \mid \geq 1\), the AR(1) is not a linear process
- AR(1) does not look like a linear process representation
- not all linear representations are linear processes but they can be translated into a linear process
- In certain cases (eg some AR(1) processes), a process can be stationary while having significant measured ACF lags.
Prediction for a Stationary Time Series
- Data = {X_t} a linear process
- Short term prediction relies on the linearity assumption
- Objective: Predict \(X_{n+h}\) in terms of values {X_1,…,X_n}
- \(P_{n}X_{n+h}\) defines the prediction at \(h\) time points ahead (called h-step ahead prediction) given that the time series has been observed up to time point ‘n’
- the best linear unbiased predictor (BLUP) of \(X_{n+h}\) in terms of {X_1,…,X_n} is defined to be a linear combination of the n past data points with different weights given by the vector of coefficients \(a_{0}\) to \(a_{n}\):
- \(P_{n}X_{n+h} = a_{0} + a_{1}X_{n} + a_{2}X_{n-1} + ... + a_{n}X_{1}\) which minimises mean squared error (MSE)
- in this formula, a_1 is the coeff for the most recent data point in the time series, a2 is for the next most recent, and so on.
Best Linear Predictor
- Data {X_t} a linear process with E[X_t] = 0
- if E[X_t] is non-zero, we can remove the constant mean by subtracting the average across all time series values
- the BLP of \(X_{n+h}\) in terms of {X_1,…,X_n}
- is the linear combination of the data up to point n: \(P_{n}X_{n+h} = \Sigma_{i=1}^{n}a_{i}X_{n+1-i}\)
- where alpha_n is the vector of coefficients in the linear combination of past data in the linear prediction: \(\alpha_{n} = (a_{1},...,a_{n})^{T}\)
- the coefficients of alpha_n are directly related to the autocovariance function of the time series through \(\Gamma_{n}\alpha_{n}=\gamma_{n}(h)\)
- with \(\Gamma_{n} = [\gamma_{x}(i-j)]_{i,j=1,...,n}\)
\(\Gamma_{n}\) is a matrix of the autocovariance function at i minus j where i and j take values from 0 to n. it is an nxn matrix where on the diagonal we have the unconditional variance of the time series (or the autocovariance function at h=0) Each row in this matrix is a combination of the autocovariance function at values 0 up to n-1 in different orders, For example, the first row consists of the ACF at h=0, h=1, up to h=n-1 in this exact order. The 2nd row is the ACF at h=1, h=0, h=1, h=2 up to h=n-2 and so on.
- and \(\gamma_{n}(h)=(\gamma_{x}(h),\gamma_{x}(h+1),...,\gamma_{x}(h+n-1))^{T}\)
- with \(\Gamma_{n} = [\gamma_{x}(i-j)]_{i,j=1,...,n}\)
- the MSE is \(E[(X_{n+h}-P_{n}X_{n+h})^{2}]=\gamma_{x}(0)-\alpha_{n}^{T}\gamma_{n}(h)\)
- ** add more gamma stuff here
- the gamma matrix can get super huge, but there are recursive methods that can speed up transposing + adding in new data
Durbin-Levinson Algorithm
- tbd Innovations Algorithm
Module 2
- ARMA Model (Autoregressive Moving Average)
- Concept: the observed time series is just one realization of a stochastic process, so it’s practically just one “observation” regardless of the length of the timeseries
- a time series is an ARMA(p,q) process if {X_t}:
- it’s weakly stationary
- \(Z_{t}\) ~ \(WN(0,\sigma^{2})\)
- such that \(X_{t}-\phi_{1}X_{t-1} - ... -\phi_{p}X_{t-p} = Z_{t} - \theta_{1}Z_{t-1}-...-\theta_{q}Z_{t-q}\)
- it consists of two parts: the left half is the AR (auto-regressive) part modeling the relationships between X_t and past or lagged variables. in this formulation, \(p\) is called the autoregressive order of the model. this is basically an intuitive way to model a time series on past data
- the second half is the MA (moving average). it’s a linear combination of a white noise process, (a sequence of shocks), because it’s assumed that the shocks are uncorrelated (not necessarily independent), which guarantees the stationarity of the ARMA process
- using the MA model we can model a time series at point t to not only be affected by variations (shocks) at time t, but other times before it as well
- the ARMA model is specified by two orders. it can include one or both.
- AR of order P: dependence and predictability based on the past data, assuming that the future will resemble the past. the dependence is up to p timepoints in the past
- MA of order q: a linear combination of random past shocks (AKA innovations) originating from an unknown white noise process we can’t measure. the process Z_t remains unknown, regardless of how much data we have.
- The MA model is similar to the AR model except that instead of being a linear combination of past data, it’s a linear combination of past white noise terms. the MA model “sees” such shocks directly at each current value of the model, whereas in an AR(p) model, the shocks are only seen indirectly via regression on the previous terms.
- the MA model only sees the last “q” shocks, whereas the AR(p) model takes all past shocks into account, albeit in a decaying weak manner
- the ARMA formulation is a broader concept, whereas an ARMA process is a stationarity
- the MA in ARMA is not the same as the moving average model used to model trends!
- More compact form: \(\phi(B)X_{t}=\theta(B)Z_{t}\)
- where B: backward/backshift operater, \(BX_{t} = X_{t-1}\) and \(B^{k}X_{t}=X_{t-k}\) when applied to X_t, it outputs the lagged timeseries. if you use power operator k, then it outputs the time series lagged by k points
- \(\phi\) is a polynomial of order p with coefficients given by the AR portion of the model, it’s called the autoregressive polynomial
- \(\theta\) is a polynomial of order q with coefficients given by the MA portion of the model, called the moving average polynomial
- for any autocovariance function \(\gamma(\cdot)\) such that \(\lim_{h \rightarrow \infty}\gamma(h)=0\), the covariance function increases as we increase lag h. the dependence becomes smaller and smaller as we increase h lag between two time points, it has fast decaying autocovariance. if this condition holds, we can find an ARMA process with the given autocovariance function (it exists). we can apply a linear ARMA to any process with a fast decaying AuCov, and use it to predict easily.
White Noise Process:
- white noise processes don’t have to have the random variables normally distributed, can be exponential, etc.
- ACF should look like 1 big bar at lag 0 and then almonst nothing
- the height of ACF lags and also +/- value corresponds with the coefficients of the MA process
Formulation vs Stationarity
- Not all ARMA formulations represent stationary timeseries
- AR(1) (Random Walk): \(X_{t}-X_{t-1} = Z_{t}\) and \(X_{t}-X_{t}B=Z_{t} \rightarrow X_{t}(1-B) = Z_{t}\)
- \(\phi(B)X_{t} = Z_{t}\) therefore \(\theta(B) = 1\)
- ARMA(1,q): \(X_{t}-X_{t-1} = Z_{t} - \theta_{1}Z_{t-1}-...-\theta_{q}Z_{t-q}\)
- basically AR(1) on the left and normal MA(q) on the right, also nonstationary
how to tell if stationary?
- Stationarity is established by the AR polynomial ONLY
- translate to compact format:
- example ARMA(2,3): \(X_{t}-2X_{t-2} + 2X_{t-2} = Z_{t}+.5Z_{t-1} -.2Z_{t-2}-Z_{t-1}\)
- AR Coefficients: \(\phi_{1}=2, \phi_{2}=-2\)
- AR polynomial: \(\phi(z) = 1-\phi_{1}z+\phi_{2}z^{2}=1-2z+2z^{2}\)
- MA Coefficients: \(\theta_{1}=-0.5,\theta_{2}=.2,\theta_{3}=1\)
- MA Polynomial: \(\theta(z)= 1-\theta_{1}z-\theta_{2}z^{2}-\theta_{3}z^{3}=1+.5z-.2z^{2}-z^{3}\)
- stationarity condition: ARMA stationary exists and is unique if and only if t\(\phi(z) \neq 0\) for all \(z \in \mathbb{C}\) such that \(\mid z \mid = 1\) (no zeroes of the AR polynomical \(\phi(z)\) lie on the unit circle (<1))
- all MA formulations with Z_t white noise are stationary since \(\phi(z)=1\), because the AR polynomia of a pure MA process is simply equal to 1, so it can’t have solutions on the unit circle, since it doesn’t have any solutions, period
- step 1: solve \(\phi(z)=0\) and obtain any solutions, check if they are on the unit circle
- all MA formulations with Z_t white noise are stationary since \(\phi(z)=1\), because the AR polynomia of a pure MA process is simply equal to 1, so it can’t have solutions on the unit circle, since it doesn’t have any solutions, period
ARMA: Prediction
- to predict values in an ARMA process, it must be stationary and have “causality”
- Causality indicates that the time series is dependent of past/lag values. causality implies stationarity but not the other way around
- formal definition is that we can invert the ARMA model into an MA model, express it as a linear process of Z_t(white noise) ->
- An ARMA process {X_t} is a causal function of Z_t if there exists constants \(\{\psi_{j}\}\) such that the sum of the coefficients of the linear process is finite: \(\Sigma_{j=0}^{\infty}\mid \psi_{j}\mid < \infty\) and \(X_{t} = \Sigma_{j=0}^{\infty}\psi_{j}Z_{t-j}\) for all \(t \in \mathbb{Z}\)
- there are many ways to represent a process, and the formulation chosen depends on the context of the problem
Causal Process:
- for Causality, we only need the AR polynomial
- \(X_{t}\) is causal if and only if:
- \(\phi(\cdot)\) and \(\theta(\cdot)\) have no common zeroes
-
all roots of AR polynomial \(\phi(z)\) lie outside the unit circle: z > 1
- step by step:
- solve \(\phi(z)=0\) and obtain all solutions (if the AR order is p, then we have p solutions: z_1, …, z_p)
- take the modulus / abs value of the solutions:\(\mid z_{1}\mid, ..., \mid z_{p}\mid\)
- if ALL \(\mid z_{1}\mid, ..., \mid z_{p}\mid\) are >1 we have causality, if at least one is smaller, then we don’t have causality
- Example: Pure AR(1) process \(X_{t}(1-\phi B)=Z_{t}\)
- the AR polynomial is \(\phi(z)=1-\phi(z)\)
- set to 0: \(0=1-\phi(z)\), the zero iz \(z=1/\phi\)
- modulus is \(\mid 1/\phi\mid\)
- Stationarity: A unique stationary solution exists if and only if \(\mid 1/\phi\mid \neq 1\), or \(\mid\phi\mid \neq 1\), then the AR(1) process is stationary
- Causality: \(X_{t}\) is causal if and only if \(\mid 1/\phi\mid > 1\) or \(\mid\phi\mid < 1\)
- the AR polynomial is \(\phi(z)=1-\phi(z)\)
Invertible Process:
- for invertibility we only need MA polynomial
- an ARMA process {X_t} is invertible if there exists constants \(\pi_{j}\) such that \(\Sigma_{j=0}^{\infty}\mid \pi_{j} \mid < \infty\) (the sum of the abs val of the constants is finite) and,
- \(Z_{t} =\Sigma_{j=0}^{\infty} \pi_{j} X_{t-j}\) for all \(z \in \mathbb{Z}\) (the linear combination of past X_ts weighted by Pi_j is a white noise process)
- evaluating invertibility: the zeroes of the MA polynomial are outside of the unit circle, the abs val of the solutions to MA = 0 are >1
- invertibility does not imply neither stationarity nor causality
Conditions:
- Stationarity: \(\phi(z)\)(AR) roots are \(\neq 1\)
- Causality: \(\phi(z)\) (AR) roots are > 1
- Invertibility: \(\theta(z)\)(MA) roots are > 1
PACF and Causal ARMA Process
- an ARMA process {X_t} is a causal process (with a corresponding linear transformation) for Z_t if there exist constants \(\{\psi_{j}\}\) such that \(\Sigma_{j=0}^{\infty} \psi_{j} Z_{t-j}\) for all \(t \in \mathbb{Z}\)
- then its ACF of lag h is \(\gamma_{x}(h)=\sigma^{2}\Sigma_{j=0}^{\infty}\psi_{j} \psi_{j+\mid h \mid}\)
- to use the formula, we need to find the coefficients first {$$\psi_{j}, j=0,1,2…}
- coefficients of causal funtion: \(\psi(z) = \Sigma_{j=0}^{\infty}\psi_{j}z^{j}=\frac{\theta(z)}{\phi(z)}\) OR \(\psi(z)\phi(z)=\theta(z), \mid z \mid \leq 1\)
- you can identify the order of an MA process because it cuts off at the order q
-
use sample PACF of pure AR(p) process: \(\alpha_{x}(h)=0\) for h > p -
use sample ACF of pure MA(q) process: \(\rho_{x}(h)=0\) for h > q
parameter estimation:
- pure ar model:
- linear regression (straightforward, but needs to be normal, use MLE, subject to multicolinearity)
- yule-walker equations: method of moments approach, no distribution assumptions, no multicolinearity risk
- steps: assuming an AR model,multiply both sides by \(X_{t-j}\) for j = 0 to p. for each J you get one equation (p+1 total), with p+1 parameters (\(\phi_{1},...,\phi_{p},\sigma^{2}\)) then you solve the linear system of equations
- alternative matrix format: \(\Gamma_{p}\bar{\phi} = \gamma_{p}(1)\) and \(\sigma^{2}=\gamma(0)-\bar{phi}^{T}\gamma_{p}(q)\)
- Capital Gamma_p is the matrix where the ij element is the auto-covariance at (i- j) lag -The gamma p of 1 is a vector of the auto-covariance at lags 1 up to p.
ARMA Estimation:
- Pure AR (under normality) - linear regression
- Pure AR (other distribution) - Yule-Walker
- Pure MA or ARMA - MLE
Order Selection & Residual Processes:
- Parameters: AR coefficients, MA coefficients, mean, std dev
- usually demean the process, so it’s 0
- the orders are fixed but need to be selected
- we can use the ACF to identify the order of an MA process
- we can use the PACF to determine the order of an AR process
- AICC (modified AIC) Approach:
- fit models with varying orders
- compute AICC and find best one (min AICC)
- residuals:
- must be scaled/standardized to identify outliers
- properties of residuals should match the process (uncorrelated if white noise, normal if normal)
- test for uncorrelation:
- Sample ACF / PACF: H0 then
- Portmanteu: H0 uncorrelated (large p values for good fit)
- Ljung-Box test
- Mcleod-Li test
- also use QQ and histogram of residuals
- Normality Assumption: Shapiro Wilk (W), Jarque-Bera (JB), D’Agostino ( D or Y), Anderson-Darling (A)
- Variance-stabilizing transformation - the lambda stuff using box-cox power transformation
ARIMA Model
- ARIMA is an ARMA formulation but not an ARMA process, because X_t is non-stationary
- if Yt (\((1-B)^{d}X_{t}=Y_t\)) is Xt differenced, and Yt is a causal ARMA (p,q) process, then Xt is an ARIMA(p,d,q) process, with d solutions on the unit circle, since the original process is not stationary
- Prediction: usually, an ARMA process for best linear prediction we assume it to be causal, but for an ARIMA this assumption does not hold
Interpreting ARMA output:
- S.E. (standard error) - smaller is better
- Sigma^2 (Variance of residuals) - smaller is better
- Log Likelyhood - higher is better
- AIC - lower is better
Module 3
Terms
- Independent means independent for any linear or nonlinear transformation
- independence can’t be reasonably checked, it would take too long to check infinite transformations, so it is only ever established through the experiment design which generates the data
- uncorrelated means only independent for linear transformation, it might not be independent with a nonlinear transformation
- the data used in the class is from observation studies, so it isn’t independent
- unconditional variance = instantaneous variability, assumed constant for a stationary timeseries
- conditional variance = volatility, depends on previous observations, may not be constant for a stationary timeseries
Module 4
Definitions
- a multivariate timeseries is a matrix of multiple univariate timeseries stacked together and aligned by time
Stationarity
- multivariate timeseries is weakly stationary if (regular univariate conditions):
- constant mean, for all k lags
- finite variance V[Y_t]
- within dependence or auto-covariance. cov(Y_it, Y_it-k) depends on lag k only for i=1,…,n
- but also for dependence between time series, (also second moment) we also need:
- contemporaneous covariance does not depend on t for all finite elements (measuring dependence between timeseries over the same time period)
- lag or cross-covariance: (dependence between timeseries over lagged time)
- contemporaneous dependence matrix:
- measures dependence between timeseries at contemporaneous time points. diagonals are variance of the univariate time series, the off diagonals are contemporaneous dependence between variables time series
- lagged dependence matrix measures dependence at lagged time points:
- diagonal is the k lag of univariate timeseries
- off-diagonal is k-lag cross dependence b/w variables
VAR Model
- if the coefficients are not statistically significant, it’s possible they are just 0
- covariance matrix: