# Difference between revisions of "Time Series Model Building Process"

(2 intermediate revisions by the same user not shown) | |||

Line 8: | Line 8: | ||

The autoregressive process has form | The autoregressive process has form | ||

− | (math formula that is difficult to format in this editor -_-") | + | (math formula that is pretty difficult to format in this editor -_-") |

However, we observe only the realized value and we do not know the hyperparameter p. Therefore, we must estimate it. The considered methods are following. | However, we observe only the realized value and we do not know the hyperparameter p. Therefore, we must estimate it. The considered methods are following. | ||

=== GETS (General-to-Specific) === | === GETS (General-to-Specific) === | ||

+ | |||

+ | We first start with high enough p (pmax in the code), and perform statistical test whether the last lag has statistically significant impact, i.e., | ||

+ | |||

+ | (formula) | ||

+ | |||

+ | If the impact is insignificant, we reduce p by one and continue until the parameter for the last lag is significant, i.e. last lag has significant effect. | ||

=== Specific-to-General === | === Specific-to-General === | ||

+ | |||

+ | The second approach works similarly as GETS, the only difference is that we first start with model without any lag, i.e., | ||

+ | |||

+ | (formula) | ||

+ | |||

+ | and we are adding lags until they are significant. | ||

=== BIC (Bayesian Information Criteria) === | === BIC (Bayesian Information Criteria) === | ||

− | === AIC === | + | Bayesian information criteria is based on a likelihood of the model and number of parameters. The criteria punish the model for high number of parameters for a potential over-fitting, and rewards the model for good fit of the data. The lower the BIC value is, the better the model (a little bit counterintuitive definition). We are supposed to select a model with the lowest value of BIC. |

+ | |||

+ | === AIC (Akaike Information Criteria) === | ||

+ | |||

+ | Works exactly the same as BIC, but the formula is slightly different. | ||

+ | |||

+ | == Method and Simulation Method == | ||

+ | |||

+ | The solution consists of three key elements, data generating process, AR(p) model, and simulation. | ||

+ | |||

+ | === Data Generating Process === | ||

+ | |||

+ | To answer the question, which model selection method is the best, we are creating an artificial data. We assume that data are created by AR(1) process | ||

+ | |||

+ | (Formula) | ||

+ | |||

+ | with random error (formula). Therefore, we must generate random numbers for random errors, and iterate over the equations described above. The initial values of the processes, e.g., (formula), are set to the theoretical mean value of the process. | ||

+ | |||

+ | === AR(p) model === | ||

+ | For a given series {x_1,x_2,…,x_T} with a length T estimates a model for a given p. | ||

− | == | + | === Simulation === |

− | = | + | Perform n Monte Carlo simulations, where in each simulation a the data generating process is used to create a time series. Then the AR(p) model is estimated for p=0,1,…,pmax, and the best models are selected according to the selection criteria described above. Therefore, we will obtain four estimates of p that corresponds to the described methods for each of the n simulation. The final step is to compute % success of correctly estimated p for each of four methods. |

== Results == | == Results == |

## Revision as of 17:56, 16 January 2020

## Contents

## Introduction

Linear time series models, e.g., ARIMA models for univariate time series, is a popular tool for modeling dynamics of time series and predicting the time series. The methodology is not popular only in statistics, econometrics and science, but also in machine learning business applications. The key question is how to build the model. We have to choose the form of the model, in particular number of lags for so-called autoregressive part (usually denoted as p) and moving average part of the model (usually denoted as d). The common guides usually provide a ‘cook book’ for model selection, see for example ARIMA Model – Complete Guide to Time Series Forecasting in Python . This goal of this simulation is to compare four basic methods for selection of lags (p) for the autoregressive part.

## Problem Definition

The autoregressive process has form

(math formula that is pretty difficult to format in this editor -_-")

However, we observe only the realized value and we do not know the hyperparameter p. Therefore, we must estimate it. The considered methods are following.

### GETS (General-to-Specific)

We first start with high enough p (pmax in the code), and perform statistical test whether the last lag has statistically significant impact, i.e.,

(formula)

If the impact is insignificant, we reduce p by one and continue until the parameter for the last lag is significant, i.e. last lag has significant effect.

### Specific-to-General

The second approach works similarly as GETS, the only difference is that we first start with model without any lag, i.e.,

(formula)

and we are adding lags until they are significant.

### BIC (Bayesian Information Criteria)

Bayesian information criteria is based on a likelihood of the model and number of parameters. The criteria punish the model for high number of parameters for a potential over-fitting, and rewards the model for good fit of the data. The lower the BIC value is, the better the model (a little bit counterintuitive definition). We are supposed to select a model with the lowest value of BIC.

### AIC (Akaike Information Criteria)

Works exactly the same as BIC, but the formula is slightly different.

## Method and Simulation Method

The solution consists of three key elements, data generating process, AR(p) model, and simulation.

### Data Generating Process

To answer the question, which model selection method is the best, we are creating an artificial data. We assume that data are created by AR(1) process

(Formula)

with random error (formula). Therefore, we must generate random numbers for random errors, and iterate over the equations described above. The initial values of the processes, e.g., (formula), are set to the theoretical mean value of the process.

### AR(p) model

For a given series {x_1,x_2,…,x_T} with a length T estimates a model for a given p.

### Simulation

Perform n Monte Carlo simulations, where in each simulation a the data generating process is used to create a time series. Then the AR(p) model is estimated for p=0,1,…,pmax, and the best models are selected according to the selection criteria described above. Therefore, we will obtain four estimates of p that corresponds to the described methods for each of the n simulation. The final step is to compute % success of correctly estimated p for each of four methods.