Statistical Methods for Discrete Response, Time Series, and Panel Data

by Ryan Delgado, Aditi Khullar, Aaron Olson" - Thu 08 March 2018

Tags: #python #gradient boost #decision tree #spark #pyspark

Question 1: Forecasting using a SARIMA model

ECOMPCTNSA.csv, contains quarterly data of E-Commerce Retail Sales as a Percent of Total Sales. The data can be found at: https://fred.stlouisfed.org/series/ECOMPCTNSA.

```{r setup, include=FALSE} library(car) library(dplyr) library(quantmod) library(ggplot2) library(ggfortify) library(astsa) library(forecast) library(xts)

options("getSymbols.warning4.0"=FALSE)

Build a Seasonal ARIMA model and generate quarterly forecast for 2017. Make sure you use all the steps of building a univariate time series model between lecture 6 and 9, such as checking the raw data, conducting a thorough EDA, justifying all modeling decisions (including transformation), testing model assumptions, and clearly articulating why you chose your given model. Measure and discuss your model's performance. Use both in-sample and out-of-sample model performance. When estimating your model, exclude the series from 2015 and 2016. For the out-of-sample forecast, measure your model's performance in forecasting the quarterly E-Commerce retail sales in 2015 and 2016. Discuss the model performance. Also forecast beyond the observed time-period of the series. Specifically, generate quarterly forecast for 2017.

## Introduction

In this lab, we'll analyze Quarterly Retail E-Ecommerce Sales as a Percent of Total Sales data sourced from the U.S. Bureau of the Census. This time series data measures the share of total sales in the United States that can be attributed to E-Commerce. Prior to our analysis, this data was retrieved from the [Federal Reserve Economic Database](https://research.stlouisfed.org/) and stored in a local CSV file. Our goal for this analysis are to understand the properties of the time series and forecast future values. We will achieve this by thoroughly exploring the data, building a statistical model, and evaluating the assumptions and results of our model using standard statistical tests and measures.

We will proceed as follows:

1. We'll read the raw data into the appropriate data structure, and examine it for any quirks or data quality issues (e.g. missing data) that need to be addressed before proceeding.
2. We'll then conduct a thorough Time Series Exploratory Data Analysis. We'll look for evidence of seasonality, trend, and other properties that will need to be accounted for during the modeling phase. 
3. Next, we'll model the series using the findings from the EDA and standard approaches. We'll suggest an appropriate class of models, and fit several models to find one that best explains the data. For each model, we'll evaluate the goodness if fit and the assumptions we must satisfy in order to make inferences and predictions.
4. Among multiple valid models, we'll choose the best one. We'll use the AIC to evaluate goodness of fit and test that the model meets the assumptions these models requrie in order to ensure accurate forecasts and inferences.
5. We'll then generate out-of-sample forecasts. We'll visually compare our forecasts with the actual value, and quantitatively evaluate the performance.

## Initial Examination

The data is stored locally in a csv file that we can read into a `data.frame` and transform into a `ts` object. We'll look at the first few observations and plot a histogram of the series' values:

```{r, fig.width=4, fig.height=2}
df <- read.csv('ECOMPCTNSA.csv')
cbind(head(df), tail(df))

ggplot(df, aes(x=ECOMPCTNSA)) +
  geom_histogram(color='black', fill='blue') +
  ggtitle('Distribution of ECOMPCTNSA')

Observations:

We do not see any missing values in the data.
The start and end of the series appear to be similarly-valued (i.e. within the same order of magnitude), with the values at the end of the series being consistently greater than the start.
The histogram shows the data is not normally distributed, and does not have any significant outliers that are worthy of further exploration.

Nothing in the data seems aberrant and needs to be addressed. Let's proceed to the Time Series Exploratory Data Analysis.

Exploratory Data Analysis

We'll start by converting our data to a ts object and plotting it: ```{r, fig.width=4, fig.height=2} series <- ts(df$ECOMPCTNSA, start=c(1999,4), frequency=4) autoplot(series, ts.colour='blue', ts.size=0.75) + ggtitle('Time Series Plot of ECOMPCTNSA') + xlab('Time') + ylab('ECOMPCTNSA')

**Observations:**

* There is a strong upward trend in the series, with the trend appearing to grow exponentially with time.
* The amplitude of the seasonal component increases with the trend (i.e. non-constant variance).
* There does not appear to be much randomness in the series. The trend and seasonal components dominate.

Because our series is non-negative and exponentially growing, a log transformation will be useful for handling the exponential growth in the trend and seasonality. The components of the log-transformed series will grow linearly instead of exponentially, and thus will be easier to model using a linear time series model. Let's perform this transformation and plot the resulting series:

```{r, fig.width=4, fig.height=2}
log.series <- ts(log(df$ECOMPCTNSA), start=c(1999,4), frequency=4)
autoplot(log.series, ts.colour='darkgreen', ts.size = 0.75) +
  ggtitle('Time Series plot of log(ECOMPCTNSA)') +
  xlab('Time') + ylab('log(ECOMPCTNSA)')

The series still has a strong upward trend, but the trend is now mostly linear and the seasonality is stable Let's now explore the trend, seasonality and random component separately by decomposing this log-transformed series:

```{r, fig.width=4, fig.height=3} decomposed <- decompose(as.ts(log.series)) autoplot(decomposed)

**Observations:**

* We can more clearly see that the trend grows linearly and the seasonal component is stable. 
* Because the trend component is linear, we can handle it in the modeling phase using a first-order difference of the series.
* We see that the seasonality peaks at the end of each year, suggesting the frequency of the seasonality is annual/four quarters. We can handle this in the modeling phase by including seasonal differencing and AR/MA components with a base lag of 4.
* The random component appears weakly stationary, though it does not appear to resemble Gaussian White Noise. Thus, there appears to be persistence in this component.

Let's apply the seasonal and non-seasonal differencing to the series and examine the ACF and PACF plots of the result:

```{r, fig.width=4, fig.height=3}
diff(log.series, lag = 4) %>% diff() %>% ggtsdisplay()

Despite the differencing, the ACF and PACF plots continues to show statistically significant autocorrelation with the 4th lag. Interestingly, no other lags show statistically significant autocorrelation. We can see in the PACF plot that this apparently seasonal autocorrelation seems to decay gradually with each yearly increment, with the autocorrelation strongly negative in lag 4, weakly negative in lag 8, weakly positive in lag 12, and even more weakly positive at lag 16. The ACF plot shows a strong negative autocorrelation at lag 4, but a sharp cut-off afterwards (i.e. no slow decay). This behavior is typical of an MA component, and thus suggests the model likely has a seasonal first-order MA component, i.e. $SARIMA(0,1,0)(0,1,1)4$. We will experiment with this in the _Modeling section, as well as experiment with a seasonal AR component.

Key Takeaways:

The trend and seasonal components exhibited exponential growth. We handled this by log-transforming the series.
The trend of the log-transformed series is linear, which can be handled with first-differencing in the time series model.
The series has annual seasonality. Since the time series has a quarterly frequency, we can incorporate seasonal differencing and seasonal AR/MA terms with a lag order of 4 into the model.
The random component exhibits serial correlation with a seasonal component, which is likely a first-order MA component.

Modeling

Since the log-transformed series exhibits a linear trend and annual seasonality, we'll use a Seasonal AutoRegressive Integrated Moving Average (SARIMA) model. SARIMA models are useful for better understanding or predicting future values of a univariate time series with a seasonal component, which makes it perfect for our purposes. In order for the inferences and forecasts of this model to be valid, it must meet several assumptions:

The residual series is independently distributed from its lagged values. This can be evaluated visually with the ACF and PACF plots of the residuals, and statistically with the Ljung-Box test.
The residual series has a zero mean.
The residuals are normally distributed and resemble Gaussian White Noise. We can test that the residuals are normally distributed with the Shapiro-Wilk test.

In order to find the "best" model, we'll fit multiple SARIMA models with different specifications of the AR and MA lag orders in both the seasonal and nonseasonal components of the series. We'll evaluate each model by its Akaike Information Criterion (AIC).

Gaussian residuals are important for ensuring good forecast performance. With the sample size of this dataset being so low, however, (60 observations in the training set) it may be difficult to obtain Gaussian residuals. We will therefore proceed by fitting two different models:

A SARIMA model with lag orders determined using EDA and ACF/PACF plots, but relaxing the Gaussian residuals assumption.
A SARIMA model programmatically determined, and requiring Gaussian distributed residuals.

Before proceeding, we will subset the time series into "train" and "test" series' to fit the model and eventually evaluate its out-of-sample forecast performance:

series.train <- window(log.series, c(1999,4), c(2014,4))
series.test <- window(log.series, c(2015,1))

Model 1: Derived from Intuition & Manual Experimentation

We'll first fit a $SARIMA(0,1,0)(0,1,1)_4$ model, as discussed in the EDA:

```{r, fig.width=6, fig.height=3} ma <- Arima(series.train, c(0,1,0), seasonal = list(order = c(0,1,1), 4)) summary(ma)

Test stationarity

Box.test(ma$residuals, type = "Ljung-Box")

Examine Normality

shapiro.test(ma$residuals)

par(mfrow = c(1, 2)) acf(ma$residuals, main='Seasonal MA(1) \nResiduals ACF') pacf(ma$residuals, main='Seasonal MA(1) \nResiduals PACF')

**Observations:**

* We see that the seasonal MA(1) coefficient is highly statistically significant.
* We do not reject $H_0$ in the Box-Ljung test, but we _do_ reject the $H_0$ in the Shapiro-Wilk test. We have strong evidence that the model's residuals _are not_ normally distributed.
* Visually, we do not see any statistically significant lag orders in the residual ACF and PACF plots.
* The AICc is -201.31

Everything in the results looks acceptable, except for the apparent non-normally distributed residuals. Let's see if a seasonal AR(1) model has different results:
```{r, fig.width=6, fig.height=3}
ar <- Arima(series.train, c(0,1,0), seasonal = list(order = c(1,1,0), 4))
summary(ar)

#Test stationarity
Box.test(ar$residuals, type = "Ljung-Box")

# Examine Normality
shapiro.test(ar$residuals)

par(mfrow = c(1, 2))
acf(ar$residuals, main='Seasonal AR(1) \nResiduals ACF')
pacf(ar$residuals, main='Seasonal AR(1) \nResiduals PACF')

The ACF and PACF plots both show all non-significant lags meaning that the model appears to produce serially independent residuals. The Box-Ljung test confirms that the residuals appear statistically as White Noise. However, we again reject $H_0$ in the Shapiro-Wilk test, indicating the residuals are not Gaussian.

Among these two "hand-fit" models, we'll choose the seasonal AR(1) model as our "best" one since its AICc is slightly lower. We'll evaluate this model's out of sample performance against a model that meets the Gaussian assumption in the Lab's Forecasting section.

Model 2: Require Gaussian White Noise

Both of the previous models had a p-value < 0.05 for the Shapiro-Wilk test, indicating the residuals were not Gaussian distributed. This will likely result in prediction intervals that are not as accurate compared to a model with normally distributed residuals.

In order to find model that meets this assumption while still achieving a low AICc score, we'll programmatically iterate over different combinations of nonseasonal and seasonal lag orders and evaluating each combination. At each iteration, we'll

Test the independence and normality of the model residuals using the Ljung-Box and Shapiro-Wilk tests, ensuring that the residuals are indepedent and normally distributed.
Compare the AICc with the AICc of the previous best model. If the current model's AICc is lower than the previous model's AICc (and it does not violate the independence and normality assumptions), the lag orders and AICc of this model will become the new "bests".

We'll try up to 2 lags in the seasonal AR and MA components, and up to 4 lags in the non-seasonal components. We'll keep the seasonal and non-seasonal differencing constant at 1, based on the strong trend and seasonal components we found in the EDA:

```{r, fig.width=4, fig.height=3} aicc.best <- 100000 mod.best <- 0

for(P in 0:2){ for(Q in 0:2){ for (p in 0:4){ for (q in 0:4){ mod <- Arima(series.train, order = c(p,1,q), seasonal = list(order = c(P,1,Q), 4), method = 'ML') if ((Box.test(mod$residuals, type = "Ljung-Box")$p.value > 0.05) & (shapiro.test(mod$residuals)$p.value > 0.05)){ if (mod$aicc < aicc.best){ aicc.best <- mod$aicc mod.best <- mod } } } } } }

summary(mod.best)

mod.best$residuals %>% ggtsdisplay()

The best fitting model was a Seasonal $ARIMA(4, 1, 1)(0,1,0)_4$. This model did not show evidence of non-normally distributed residuals, and had a slightly higher AICc score compared to Model 1 (-201.76 vs -205.99), indicating that the model may not be as accurate as the first model in out of sample forecasts. The only non-seasonal AR coefficient with statistical significance is the 4th lag. It will be interesting to see how the forecasts of these two models compare.

## Forecasting

Now we will forecast data for both models and observe forecast performance for the out of sample data:

```{r}
# Write a RMSE function to compute RMSE values
rmse <- function(pred, actual){
  return(sqrt(mean((pred-actual)^2)))
}

# Generate forecasts
ar_forecast <- forecast(ar, 12)
mod.best_forecast <- forecast(mod.best, 12)
# Plot series with both models & model forecasts
plot(series, main = 'E-Commerce Retail Sales and Model Performance', lwd = 1.5)
lines(exp(fitted.values(ar)), col = "red", lty = 2, lwd = 1.5)
lines(exp(fitted.values(mod.best)), col = "blue", lty = 4, lwd = 1.5)
lines(exp(ar_forecast$mean), col = "green", lty = 2, lwd = 1.5)
lines(exp(mod.best_forecast$mean), col = "pink", lty = 4, lwd = 1.5)
legend(2000,8, legend = c("ECOMPCTNSA Data", "AR Model", "Gaussian Model", "AR forecast", "Gaussian forecast"), 
       col = c("black", "red", "blue", "green", "pink"), 
       lty=c(1,2,4,2,4), cex = 0.8)

print("Best Fit Model: ")
print(rmse(series.test,forecast(ar, 8)$mean))
print("Model with Gaussian Residuals: ")
print(rmse(series.test,forecast(mod.best, 8)$mean))

The forecasted values for both models are similar. They both undershoot the actual values, and the RMSE scores differ by only 0.0004.

Let's look at the 12-quarter ahead forecasts with confidence intervals:

ar_forecast$lower = exp(ar_forecast$lower)
ar_forecast$upper = exp(ar_forecast$upper)
ar_forecast$mean = exp(ar_forecast$mean)
ar_forecast$x = exp(ar_forecast$x)
mod.best_forecast$lower = exp(mod.best_forecast$lower)
mod.best_forecast$upper = exp(mod.best_forecast$upper)
mod.best_forecast$mean = exp(mod.best_forecast$mean)
mod.best_forecast$x = exp(mod.best_forecast$x)

```{r, fig.width=8, fig.height=3} par(mfrow = c(1, 2))

For manually fit seasonal model

plot(ar_forecast)

Best fit model with Gaussian residuals

plot(mod.best_forecast)

We see that the forecasts are predicting that the prevailing trend and increasing seasonality will continue, with the confidence intervals widening as time progresses.

Below we explicitly define both models:

Let: $z_t = log(y_t)$

**Model 1: : Best fit without Gaussian residuals requirement:**

$$
z_t-z_{t-1} = - 0.6666 (y_{t-4} - z_{t-5}) + \omega_t
$$

**Model 2: : Best fit with Gaussian residuals requirement:**

$$
z_t - z_{t-1} = 0.1577 (z_{t-1} - z_{t-2}) + 0.1743 (z_{t-2} - z_{t-3}) - 0.1395 (z_{t-3} - z_{t-4}) - 0.6743( (z_{t-4}-z_{t-5}) + \omega_t -0.2237 \omega_{t-1}
$$

Despite the first model's parsimony, lower AICc score, and slightly better forecast performance to the second model, we prefer the second model since it met the Gaussian residuals assumption. We think that a model with Gaussian residuals will generally produce more reliable and accurate forecast intervals than a model without Gaussian residuals.

## Conclusion

After analyzing the E-commerce data via EDA and ACF/PACF plots, we fit two SARIMA models against the data and evaluated their out-of-sample forecast performance. Our first model, a $SARIMA(0,1,0)(1,1,0)_4$ model, was crafted based on our EDA findings. Despite achieving a low AICc and serially independent residuals, this model had residuals that showed evidence of a non-normal distribution. Non-Gaussian residuals could result in faulty confidence intervals, so we fit a second model programmatically. This model was found by iterating over different combinations of seasonal and non-seasonal AR and MA lag orders, and choosing the model with the lowest AICc score among the models with residuals that showed no evidence of serially dependent or non-Gaussian residuals and was determined to be $SARIMA(4,1,0)(0,1,0)_4$. Further modeling utilizing a daily, weekly or monthly period (rather than quarterly) may help to determine better model performance since the time series data is limited to approximately Q4 1999 (when e-commerce began to become available). With the data available, the $SARIMA(4,1,0)(0,1,0)_4$ model was chosen since it meets all assumptions and is beneficial for forecasting, however further analysis using more datapoints should be completed to indicate if this continues to be a strong performing model - since the model parameters do not match our intuition derived from EDA. 

In terms of forecasting and forecasted out of sample RMSE scores, he performance differences between the $SARIMA(0,1,0)(1,1,0)_4$ and $SARIMA(4,1,0)(0,1,0)_4$ model were negligible, and the forecast plots looked very similar. Perhaps with a longer series or more random data we would see larger differences between these two models.



# Question 2: Learning how to use the xts library

## Materials covered in Question 2 of this lab

  - Primarily the references listed in this document:

      - "xts: Extensible Time Series" by Jeffrey A. Ryan and Joshua M. Ulrich. 2008. (xts.pdf)
      - "xts FAQ" by xts Development Team. 2013 (xts_faq.pdf)
      - xts_cheatsheet.pdf

# Task 1:

  1. Read 
    A. The **Introduction** section (Section 1), which only has 1 page of reading of xts: Extensible Time Series" by Jeffrey A. Ryan and Joshua M. Ulrich
    B. The first three questions in"xts FAQ"
        a. What is xts?
        b. Why should I use xts rather than zoo or another time-series package?
        c. HowdoIinstallxts?
    C. The "A quick introduction to xts and zoo objects" section in this document

  2. Read the "A quick introduction to xts and zoo objects" of this document

# A quick introduction to xts and zoo objects

### xts
```xts```
  - stands for eXtensible Time Series
  - is an extended zoo object
  - is essentially matrix + (time-based) index (aka, observation + time)

  - xts is a constructor or a subclass that inherits behavior from parent (zoo); in fact, it extends the popular zoo class. As such. most zoo methods work for xts
  - is a matrix objects; subsets always preserve the matrix form
  - importantly, xts are indexed by a formal time object. Therefore, the data is time-stamped
  - The two most important arguments are ```x``` for the data and ```order.by``` for the index. ```x``` must be a vector or matrix. ```order.by``` is a vector of the same length or number of rows of ```x```; it must be a proper time or date object and be in an increasing order

# Task 2:

  1. Read 
    A. Section 3.1 of "xts: Extensible Time Series" by Jeffrey A. Ryan and Joshua M. Ulrich

    B. The following questions in "xts FAQ"
        a. How do I create an xts index with millisecond precision?
        b. OK, so now I have my millisecond series but I still can't see the milliseconds displayed. What went wrong?

  2. Follow the following section of this document


# Creating an xts object and converting to an xts object from an imported dataset

We will create an `xts` object from a matrix and a time index. First, let's create a matrix and a time index.  The matrix, as it creates, is not associated with the time indext yet.

```{r include=FALSE}
#rm(list = ls())
library(knitr)
opts_chunk$set(tidy.opts=list(width.cutoff=60),tidy=TRUE)

# Set working directory
setwd("~/Berkeley MIDS/W271 Statistical Methods for Discrete Response, Time-Series, and Panel Data/repo/W271_Lab3")

# Create a matrix
x <- matrix(rnorm(200), ncol=2, nrow=100)
colnames(x) <- c("Series01", "Series02")
str(x)
head(x,10)

idx <- seq(as.Date("2015/1/1"), by = "day", length.out = 100)
str(idx)
head(idx)
tail(idx)

In a nutshell, xts is a matrix indexed by a time object. To create an xts object, we "bind" the object with the index. Since we have already created a matrix and a time index (of the same length as the number of rows of the matrix), we are ready to "bind" them together. We will name it X.

library(xts)
X <- xts(x, order.by=idx)
str(X)
head(X,10)

As you can see from the structure of an xts object, it contains both a data component and an index, indexed by an object of class Date.

xtx constructor

xts(x=Null,
    order.by=index(x),
    frequency=NULL,
    unique=NULL,
    tzone=Sys.getenv("TZ"))

As mentioned previous, the two most important arguments are x and order.by. In fact, we only use these two arguments to create a xts object before.

With a xts object, one can decompose it.

Deconstructing xts

coredata() is used to extract the data component

head(coredata(X),5)

index() is used to extract the index (aka times)

head(index(X),5)

Conversion to xts from other time-series objects

We will use the same dataset "bls_unemployment.csv" that we used in the last live session to illustarte the functions below.

df <- read.csv("bls_unemployment.csv", header=TRUE, stringsAsFactors = FALSE)

# Examine the data structure
  str(df)
  names(df)
  head(df)
  tail(df)

#table(df$Series.id, useNA = "always")
#table(df$Period, useNA = "always")

# Convert a column of the data frame into a time-series object
unemp <- ts(df$Value, start = c(2007,1), end = c(2017,1), frequency = 12)
  str(unemp)
  head(cbind(time(unemp), unemp),5)

# Now, let's convert it to an xts object
df_matrix <- as.matrix(df)
  head(df_matrix)
  str(df_matrix)
  rownames(df)

unemp_idx <- seq(as.Date("2007/1/1"), by = "month", length.out = 
length(df[,1]))
  head(unemp_idx)

unemp_xts <- xts(df$Value, order.by = unemp_idx)
  str(unemp_xts)
  head(unemp_xts)

Task 3:

Read A. Section 3.2 of "xts: Extensible Time Series" by Jeffrey A. Ryan and Joshua M. Ulrich
Follow the following section of this document

Merging and modifying time series

One of the key strengths of xts is that it is easy to join data by column and row using a only few different functions. It makes creating time series datasets almost effortless.

The important criterion is that the xts objects must be of identical type (e.g. integer + integer), or be POSIXct dates vector, or be atomic vectors of the same type (e.g. numeric), or be a single NA. It does not work on data.frames with various column types.

The major functions is merge. It works like cbind or SQL's join:

Let's look at an example. It assumes that you are familiar with concepts of inner join, outer join, left join, and right join.

library(quantmod)
getSymbols("TWTR")
head(TWTR)
str(TWTR)

Note that the date obtained from the getSymbols function of the quantmod library is already an xts object. As such, we can merge it directly with our unemployment rate xts object constructed above. Nevertheless, it is instructive to examine the data using the View() function to ensure that you understand the number of observations resulting from the joined series.

# 1. Inner join
TWTR_unemp01 <- merge(unemp_xts, TWTR, join = "inner")
  str(TWTR_unemp01)
  head(TWTR_unemp01)

# 2. Outer join (filling the missing observations with 99999)
# Basic argument use
TWTR_unemp02 <- merge(unemp_xts, TWTR, join = "outer", fill = 99999)
  str(TWTR_unemp02)
  head(TWTR_unemp02)
  #View(TWTR_unemp02)

# Left join
TWTR_unemp03 <- merge(unemp_xts, TWTR, join = "left", fill = 99999)
  str(TWTR_unemp03)
  head(TWTR_unemp03)
  #View(TWTR_unemp03)

# Right join
TWTR_unemp04 <- merge(unemp_xts, TWTR, join = "right", fill = 99999)
  str(TWTR_unemp04)
  head(TWTR_unemp04)
  #View(TWTR_unemp04)

Missing value imputation

xts also offers methods that allows filling missing values using last or previous observation. Note that I include this simply to point out that this is possible. I by no mean certify that this is the preferred method of imputing missing values in a time series. As I mentioned in live session, the specific method to use in missing value imputation is completely context dependent.

Filling missing values from the last observation

# First, let's replace the "99999" values with NA and then exammine the series. 

# Let's examine the first few dozen observations with NA
TWTR_unemp02['2013-10-01/2013-12-15'][,1]

# Replace observations with "99999" with NA and store in a new series
unemp01 <- TWTR_unemp02[, 1]
unemp01['2013-10-01/2013-12-15']
str(unemp01)
head(unemp01)
#TWTR_unemp02[, 1][TWTR_unemp02[, 1] >= 99990] <- NA

unemp02 <- unemp01
unemp02[unemp02 >= 99990] <- NA

cbind(unemp01['2013-10-01/2013-12-15'], unemp02['2013-10-01/2013-12-15'])

# Impute the missing values (stored as NA) with the last observation
TWTR_unemp02_v2a <- na.locf(TWTR_unemp02[,1], 
                            na.rm = TRUE, fromLast = TRUE) 
unemp03 <- unemp02
unemp03 <- na.locf(unemp03, na.rm = TRUE, fromLast = FALSE);

# Examine the pre- and post-imputed series
cbind(TWTR_unemp02['2013-10-01/2013-12-30'][,1], TWTR_unemp02_v2a['2013-10-01/2013-12-15'])

cbind(unemp01['2013-10-01/2013-12-15'], unemp02['2013-10-01/2013-12-15'],
unemp03['2013-10-01/2013-12-15'])

Another missing value imputation method is linear interpolation, which can also be easily done in xts objects. In the following example, we use linear interpolation to fill in the NA in between months. The result is stored in unemp04. Note in the following the different ways of imputing missing values.

unemp04 <- unemp02
unemp04['2013-10-01/2014-02-01']
unemp04 <- na.approx(unemp04, maxgap=31)
unemp04['2013-10-01/2014-02-01']

round(cbind(unemp01['2013-10-01/2013-12-15'], unemp02['2013-10-01/2013-12-15'],
unemp03['2013-10-01/2013-12-15'],
unemp04['2013-10-01/2013-12-15']),2)

Calculate difference in time series

A very common operation on time series is to take a difference of the series to transform a non-stationary serier to a stationary series. First order differencing takes the form $x(t) - x(t-k)$ where $k$ denotes the number of time lags. Higher order differences are simply the reapplication of a difference to each prior result (like a second derivative or a difference of the difference).

Let's use the unemp_xts series as examples:

str(unemp_xts)
unemp_xts

diff(unemp_xts, lag = 1, difference = 1, log = FALSE, na.pad = TRUE)

# calculate the first difference of AirPass using lag and subtraction
#AirPass - lag(AirPass, k = 1)

# calculate the first order 12-month difference if AirPass
diff(unemp_xts, lag = 12, differences = 1)

Task 4:

Read A. Section 3.4 of "xts: Extensible Time Series" by Jeffrey A. Ryan and Joshua M. Ulrich

B. the following questions in "xts FAQ" a. I am using apply() to run a custom function on my xts series. Why the returned matrix has di↵erent dimensions than the original one?

It is possible to apply to a particular period (month, day, etc). If the time series for instance has a second period, but we apply by minute, then we have reduced the dimension length by 60.

Follow the following two sections of this document

Apply various functions to time series

The family of apply functions perhaps is one of the most powerful R function families. In time series, xts provides period.apply, which takes (1) a time series, (2) an index of endpoints, and (3) a function to apply. It takes the following general form:

period.apply(x, INDEX, FUN, ...)

As an example, we use the Twitter stock price series (to be precise, the daily closing price), create an index storing the points corresopnding to the weeks of the daily series, and apply functions to calculate the weekly mean.

# Step 1: Identify the endpoints; in this case, we use weekly time interval. That is, we extract the end index on each week of the series

#View(TWTR)
head(TWTR)
TWTR_ep <- endpoints(TWTR[,4], on = "weeks")
#TWTR_ep

# Step 2: Calculate the weekly mean
TWTR.Close_weeklyMean <- period.apply(TWTR[, 4], INDEX = TWTR_ep, FUN = mean)
head(round(TWTR.Close_weeklyMean,2),8)

The power of the apply function really comes with the use of custom-defined function. For instance, we can easily

f <- function(x) {
  mean <- mean(x)
  quantile <- quantile(x,c(0.05,0.25,0.50,0.75,0.95))
  sd <- sd(x)

  result <- c(mean, sd, quantile)
  return(result)
}
head(round(period.apply(TWTR[, 4], INDEX = TWTR_ep, FUN = f),2),10)

Calculate basic rolling statistics of series by month

Using rollapply, one can calculate rolling statistics of a series:

# Calculate rolling mean over a 10-day period and print it with the original series
head(cbind(TWTR[,4], rollapply(TWTR[, 4], 10, FUN = mean, na.rm = TRUE)),15)

Task 5:

Read AMAZ.csv and UMCSENT.csv into R as R DataFrames

AMAZ <- read.csv("AMAZ.csv")
  head(AMAZ)
  tail(AMAZ)
  str(AMAZ)
  summary(AMAZ)
UMCSENT <- read.csv("UMCSENT.csv")
  head(UMCSENT)
  tail(UMCSENT)
  str(UMCSENT)
  summary(UMCSENT)

AMAZ seem to contain daily value with start date as 2007-01-03 and end date as 2013-01-13 UMCSENT seem to contain monthly value with start date as 1978-01-01 and end date as 2017-09-01

Convert them to xts objects

AMAZ$Index <- strptime(AMAZ$Index, "%Y-%m-%d")
AMAZ_xts <- xts(AMAZ[, -1], order.by = AMAZ[,1])
head(AMAZ_xts)
str(AMAZ_xts)

UMCSENT$Index <- strptime(UMCSENT$Index, "%Y-%m-%d")
UMCSENT_xts <- xts(UMCSENT[, -1], order.by = UMCSENT[,1])
head(UMCSENT_xts)
str(UMCSENT_xts)

Merge the two set of series together, perserving all of the obserbvations in both set of series a. fill all of the missing values of the UMCSENT series with -9999

merged <- merge(UMCSENT_xts, AMAZ_xts, join = "outer")
UMCSENT01 <- merged[,1]
UMCSENT01[is.na(UMCSENT01)] <- -9999
merged[,1] <- UMCSENT01
tail(UMCSENT01)

rbind(head(merged), merged['2011-07-01/2011-08-09'], tail(merged))

b. then create a new series, named UMCSENT02, from the original  UMCSENT series replace all of the -9999 with NAs

UMCSENT02 <- UMCSENT01
UMCSENT02[UMCSENT02 < -9990] <- NA

c. then create a new series, named UMCSENT03, and replace the NAs with the last observation

UMCSENT03 <- UMCSENT02
UMCSENT03 <- na.locf(UMCSENT02, na.rm = TRUE, fromLast = TRUE)

d. then create a new series, named UMCSENT04, and replace the NAs using linear interpolation.

UMCSENT04 <- UMCSENT02
UMCSENT04 <- na.approx(UMCSENT04, maxgap=31)

e. Print out some observations to ensure that your merge as well as the missing value imputation are done correctly. I leave it up to you to decide exactly how many observations to print; do something that makes sense. (Hint: Do not print out the entire dataset!)

binded <- cbind(UMCSENT01, UMCSENT02, UMCSENT03, UMCSENT04, merged[,-1])
binded['2011-07-01/2011-08-09']

Calculate the daily return of the Amazon closing price (AMAZ.close), where daily return is defined as $(x(t)-x(t-1))/x(t-1)$. Plot the daily return series.

daily_return <- diff(AMAZ_xts[,4])/lag(AMAZ_xts[,4])

plot(daily_return)

Create a 20-day and a 50-day rolling mean series from the AMAZ.close series.

# Calculate rolling mean over a 20-day period and print it with the original series
Twenty_day_rolling <- cbind(AMAZ[,4], rollapply(AMAZ[,4], 20, FUN = mean, na.rm = TRUE))
head(Twenty_day_rolling,15)
plot(Twenty_day_rolling[,2])

# Calculate rolling mean over a 50-day period and print it with the original series
Fifty_day_rolling <- cbind(AMAZ[,4], rollapply(AMAZ[,4], 50, FUN = mean, na.rm = TRUE))
head(Fifty_day_rolling,15)
plot(Fifty_day_rolling[,2])