Time Series Analysis - Driving Records 25 year history United States

by Himal Suthar, XT Nguyen, Aaron Olson - Thu 08 March 2018

Tags: #python #gradient boost #decision tree #spark #pyspark

Description of the Lab

In this lab, you are asked to answer the question "Do changes in traffic laws affect traffic fatalities?" To do so, you will conduct the tasks specified below using the data set driving.Rdata, which includes 25 years of data that cover changes in various state drunk driving, seat belt, and speed limit laws.

Specifically, this data set contains data for the 48 continental U.S. states from 1980 through 2004. Various driving laws are indicated in the data set, such as the alcohol level at which drivers are considered legally intoxicated. There are also indicators for per se laws - where licenses can be revoked without a trial - and seat belt laws. A few economics and demographic variables are also included. The description of the each of the variables in the dataset is come with the dataste.

Exercises:

Question 1

Load the data. Provide a description of the basic structure of the dataset, as we have done throughout the semester. Conduct a very thorough EDA, which should include both graphical and tabular techniques, on the dataset, including both the dependent variable totfatrte and the potential explanatory variables. You need to write a detailed narrative of your observations of your EDA. Reminder: giving an "output dump" (i.e. providing a bunch of graphs and tables without description and hoping your audience will interpret them) will receive a zero in this exercise.

```{r echo = FALSE} library(ggplot2) library(dplyr) library(mcprofile) library(car) library(Hmisc) library(ggcorrplot) library(plm) library(GGally) library(EnvStats)

load("driving.RData") df <- data head(df)

describe(df)

table(df$year, df$state)

During our analysis we utilized the $describe$ function in order to print out table statistics by variable in the dataset. Due to the length, we have excluded that print out here. 

Within the dataset, we note that there are 51 states (50 states plus the district of columbia) with panel data collected from 1980 to 2004. There are no missing variables, and using the table functino, we can see that the dataset is structured by state and by year, with a single row for each state-year combination. 

## Variable Transformations and EDA
```{r}
table(df$sl75)

There are also several variables, such as sl75, bac_08, that are primarily binary (0 or 1) but also have some values indicating partial year truths. Analyzing these variables, most values are either 0 or 1 with only a handful (out of the entire dataset) being between 0 or 1. For this analysis we have chosen to assign values less than 0.5 to 0/False and values greater than or equal to 0.5 to 1/True.

#Make Variables binary and transform other variables
df$bac_binary[df$bac08 < 0.5 | df$bac10 <= 0.5] <- 0
df$bac_binary[df$bac08 >= 0.5 | df$bac10 > 0.5] <- 1

#Create speed variable with the intent of creating one variable denoting speed limit
df$speed[df$sl55 >=0.5] <- 55
df$speed[df$sl65 >=0.5] <- 65
df$speed[df$sl70 >=0.5] <- 70
df$speed[df$sl75 >=0.5] <- 75
df$speed[df$sl70plus == 1] <- 80

#Continue to make variables binary
df$sl55_binary <- ifelse(df$sl55 >= 0.5, 1, 0)
df$sl65_binary <- ifelse(df$sl65 >= 0.5, 1, 0)
df$sl70_binary <- ifelse(df$sl70 >= 0.5, 1, 0)
df$sl75_binary <- ifelse(df$sl75 >= 0.5, 1, 0)
df$sl70plus_binary <- ifelse(df$sl70plus >= 0.5, 1, 0)
df$bac08_binary <- ifelse(df$bac08 >= 0.5, 1, 0)
df$bac10_binary <- ifelse(df$bac10 >= 0.5, 1, 0)
df$gdl_binary <- ifelse(df$gdl >= 0.5, 1, 0)
df$zerotol_binary <- ifelse(df$zerotol >= 0.5, 1, 0)
df$perse_binary <- ifelse(df$perse >= 0.5, 1, 0)
df$seatbelt_binary <- ifelse(df$seatbelt >= 0.5, 1, 0)

After transforming the variables, we begin the EDA analysis of the dataset.

#Histogram of totfatrte variable
ggplot(df, aes(x = totfatrte)) +
  geom_histogram() +
  ggtitle('Fatality Rate (Per 100,000 population) Histogram') +
  theme_bw()

ggplot(df, aes(x = totfatrte, fill = as.factor(bac_binary))) +
  geom_histogram(alpha = 0.7) +
  theme(legend.position = c(0.9, 0.9)) +
  theme_bw() +
  ggtitle('Fatality Rate Histogram by Presence of Blood Alcohol Limit (BAC) Laws')

ggplot(df, aes(x = totfatrte, fill = as.factor(seatbelt))) +
  geom_histogram(alpha = 0.7) +
  theme(legend.position = c(0.9, 0.9)) +
  theme_bw() +
  labs(title = 'Fatality Rate Histogram by Presence of Seatbelt laws',
       subtitle = '=0 if none, =1 if primary, =2 if secondary')

The histograms above show the distribution of the total fatalities per 100,000 people and shaded by the categories: whether the state had a seatbelt law in effect (0 for no law, 1 for front seat laws, 2 for back seat laws) and whether the state has a bac law in effect (0 for laws in place for up to half the year, 1 for if the law was in place for greater than half the year). All histogram distributions appear to give a similar shape across the different categories and have a long right tail. This potentially indicates that there are limited differences accounting for changes in these two laws.

In order to further our analysis we also looked at the distribution of miles driven, unemployment and perc14_24. From the histograms we can see that miles drive follows an approximate normal distribution with a long right tail. Unemployment has a long tail and would benefit from a log transformation in an effort to make the variable approximately normal. Perc14_24 does not appear to approximate a normal distribution and applying a log transformation only has limited effect - we chose to leave the variable as is for this analysis.

We know that our panel dataset is organized by year and by state. We further our EDA below by exploring the dependent variable (totfatrte) amongst these two indexes.

```{r warning = FALSE, fig.width=10, fig.height=4} ggplot(df, aes(as.factor(year), totfatrte)) + geom_boxplot() + geom_jitter(width = 0.2) + theme_bw() + ggtitle('Fatality Rates by Year')

ggplot(df, aes(as.factor(state), totfatrte)) + geom_boxplot() + geom_jitter(width = 0.2) + theme_bw() + ggtitle('Fatality Rates by State')

xyplot(totfatrte ~ year | state, data=df, as.table=T)

The bar plot across years shows an overall general decreasing trend until approximately 1992 where total fatalities per 100,000 people remains approximately constant. Variance is not constant across years indicating that totfatrte is heterogeneous. 

In the total fatalities by state boxplot, it can be determined that each state has a unique distribution of total fatality rates. This is likely due to a number of different effects, such as road hazard index, population driving on the roadways, incliment weather, etc. The purpose of our analysis and model will be to determine the effectiveness of traffic laws on the total fatality rate. 

Total fatalities per 100,000 broken up by state and/or year indicates that there is serial correlation, therefore pooled OLS is likely a poor model for fit. Instead a fixed effect model is a good candidate to analyze this dataset and problem. 

We previously analyzed the histogram of totfatrte grouped by bac and seatbelt traffic laws. Below we extend this analysis by observing the effect of these laws over time. 

```{r fig.width=10, fig.height=4}
ggplot(df, aes(as.factor(year), totfatrte)) +
  geom_boxplot(aes(colour = as.factor(bac_binary))) +
  theme_bw() +
  ggtitle("Fatalities by Year by Blood Alcohol Limit (BAC) Law Presence") 

ggplot(df, aes(as.factor(year), totfatrte)) +
  geom_boxplot(aes(colour = as.factor(seatbelt))) +
  ggtitle("Fatalities by Year by Seatbelt Law Presence") +
  theme_bw() +
  theme(legend.position = "bottom")

Above are the barplots for totfatrte broken up by seatbelts and the bac_binary variable across time. Neither analysis shows strong effect of either variable based on the yearly dataset. We can see that there was an introduction and/or modification to both of these laws as time progressed.

Our dependent variable totfatrte is defined as the total fatalities per 100,000 population. Intuitively we know that the risk associated with driving will increase the more a particular sample drives. It is therefore important to analyze how totfatrte varies with vehicmilesdriven. Furthermore we have generated a correlation plot matrix for all variables (except year indicators) to determine if there are correlations amongst other variables that have not yet been analyzed.

```{r fig.width=7, fig.height=7} ggcorrplot(cor(df[,-c(31:55)]), type = 'upper') + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ggtitle("Correlation plot for Driving Dataset")

The overall correlation plot does not show any surprising correlations that have not previously been discussed. Certain law variables are correlated with the year indicating that the law has changed over the years analyzed in the dataset. 

Let's now take a look a correlation matrix of the variables perc14_24, unem and vehicmilespc with totfatrte.  

```{r}
ggpairs(df[, c('totfatrte', 'perc14_24', 'unem', 'vehicmilespc')])

There seems to be a slight correlation between both totfatrte and perc14_24, unem and vehicmilesp, with the strongest correlation with perc14_24. From the histogram distribution we can see a slight skew in the unem histogram - which can be corrected via a log transformation. Alternatively we see a significantly different range in values between vehicmilespc and all other variables. Therefore it is recommended to apply a normalization or scaling term. We will do this prior to fitting a model with this parameter to normalize the variable between 0-100.

Let's also take a look at these values by year and state to check for heterogeneity across both dimensions.

# Conditional Box-plot
conditional_plot = function(data, plotvar, condvar, title) {
  g <- ggplot(data, aes(as.factor(condvar), plotvar)) 
  g +
    geom_boxplot() +
    geom_jitter(width = 0.2, size = 0.1) +
    ggtitle(title) +
    theme_bw()
}

# perc14_24 by year 
conditional_plot(df, df$perc14_24, df$year, "perc14_24 by year")

# perc14_24 by state
conditional_plot(df, df$perc14_24, df$state, "perc14_24 by state")

# unem by year 
conditional_plot(df, df$unem, df$year, "unem by year")

# unem by state
conditional_plot(df, df$unem, df$state, "unem by state")

# vehicmilespc by year 
conditional_plot(df, df$vehicmilespc, df$year, "vehicmilespc by year")

As seen in the charts above, there appears to be a definitive negative trend in perc14_24 with time, and a slight negative trend in unem with time, with a slight cyclical pattern. Additionally, vehicmilespc increases over time, which could be a function of increasing overall population or another reason. The variance in vehicmilespc also appears to increase with time. There does not appear to be any heterogeneity in the two variables by state but does appear to be heterogeneity by year.

From our EDA we have determined that unemployment, perc14_24 and vehicmilespc are likely significant variables in the prediction of total fatality rate. Miles driven is another.

Question 2.

2.1 How is the our dependent variable of interest totfatrte defined?

From the dataset website, totfatrte = total fatalities per 100,000 population. It should be noted that presumably population doesn't necessarily mean drivers (as such it may not precisely represent the number of persons that are on the roadways).

2.2. What is the average of this variable in each of the years in the time period covered in this dataset?

avg_year <- df %>% group_by(year) %>% summarise(avg_rate = mean(totfatrte, na.rm = T))

avg_year

ggplot() +
  geom_point(data = avg_year, aes(x = year, y = avg_rate), stat = 'identity') + 
  ggtitle('Average totfatrte per Year')

In the plot above we can generally see a decreasing trend until the year 1992 when totfatrte is approximately constant. Below we will fit a linear model using year dummy variables as the explanatory variables. It is expected to see an overall decrease in totfatrte across the years.

2.3 Estimate a linear regression model of totfatrte on a set of dummy variables for the years 1981 through 2004.

fit.year <- lm(totfatrte ~  d80 + d81 + d82 + d83 + d84 + d85 + d86 + d87 + d88 + d89 + d90 + 
     d91 + d92 + d93 + d94 + d95 + d96 + d97 + d98 + d99 + d00 + d01 + d02 + d03, data = df)
summary(fit.year)

2.4 What does this model explain? Describe what you find in this model.

The model explains how fatality rate changes on average cross states compared to the base year of 2004. For example, the coefficient for d88 is 4.16271, which means that on average, total fatality rate was 4.16 greater in 1998 (20.89) compared to 2004 (16.72), which is exactly waht reported on the average fatality rate by year above.

2.5 Did driving become safer over this period? Please provide a detailed explanation.

From the estimated coefficients alone that are all positive, we can say that, compared to the based year of 2004, the previous years had higher fatality rate.

From the wilcox test statistics and the estimated parameters we can see a statistically significant effect in the years 1980 through 1991. However from 1992 onwards there is insufficient difference to detect a statistically significant change in the total fatality rate.

Therefore, we would say that driving became safer in the years 1980s though 1991, but there is insufficient evidence to conclude that for the years after 1991 through 2004.

Question 3

3.1 Expand your model in Exercise 2 by adding variables bac08, bac10, perse, sbprim, sbsecon, sl70plus, gdl, perc14_24, unem, vehicmilespc, and perhaps transformations of some or all of these variables.

# Apply transformations as previously discussed in the EDA section
df$log_unem = log(df$unem)
df$norm_vehicmilespc <- as.data.frame(apply(df, 2, function(x) (100 * x)/(max(x)-min(x))))$vehicmilespc

# Fit model
fit.multi <- lm(totfatrte ~  d80 + d81 + d82 + d83 + d84 + d85 + d86 + d87 + d88 + d89 + d90 + 
     d91 + d92 + d93 + d94 + d95 + d96 + d97 + d98 + d99 + d00 + d01 + d02 + d03 + bac10_binary + 
       bac08_binary + perse_binary + sbprim + sbsecon + sl70plus_binary + gdl_binary + perc14_24 + log_unem + 
       norm_vehicmilespc, data = df)
summary(fit.multi)

3.2 Please explain carefully your rationale, which should be based on your EDA, behind any transformation you made. If no transformation is made, explain why transformation is not needed.

For this model we are using binarized versions of the variables bac10, bac08, perse, sl70plus, and gdl, as there is only a small proportion of the dataset that has values between 0 and 1 for these variables. We have also explained the rationale in more detail in the Transformation and EDA section above.

We decided to use a log transformation on the unem variable because the histogram indicated that a log transformation would make the distribution amongst unemployment numbers approximately normal.

We kept the variables sbprim and sbsecon as is since they only take on values 0 and 1.

We kept the variable perc14_24 as is since it is can be treated as a continuous variable for this model.

For the vehicmilespc variable, the range associated with this variable is quite large in comparison to the range amongst all other variables (ignoring vehicmiles) as well as our dependent variable. We therefore applied a normalization step to make the range for this variable between 0 and 100. This does complicate interpretation based on the coefficient. From analysis, the minimum and maximum values for this category repectively are:

print(min(df$vehicmilespc))
print(max(df$vehicmilespc))

From our model summary, the coefficient for norm_vehicmilespc is 0.41121, therefore if there was a 1,000 mile increase we would estimate a $\Delta totfatrte = 0.41121 \frac{100000}{18390 - 4372} = 2.933$ increase in totfatrte variable.

3.3 How are the variables bac8 and bac10 defined? Interpret the coefficients on bac8 and bac10.

The bac8 and bac10 variables indicate whether the state had a BAC limit of 0.8 and 0.10, respectively, during that year.

The coefficients on bac10_biary and bac08_binary are -1.131 (SE .3608) and -2.207 (SE .4890), respectively. This means that the presence of a BAC .10 limit decreased fatality rates by -1.131 and the presence of a BAC .08 limit decreased fatality rates by -2.208. Intuitively, this makes sense because the lower the blood alcohol limit is, the stricter the traffic laws, hence lowering fatality rates by a larger amount.

3.4 Do per se laws have a negative effect on the fatality rate?

The perse laws is only slightly statistically significant effect (at the 0.05 level) on totfatrte. Its coefficient of -0.5396e indicates that in general it had an increase effect on total fatality rate.

3.5 What about having a primary seat belt law? (Note that if a law was enacted sometime within a year the fraction of the year is recorded in place of the zero-one indicator.)

The primary seatbelt law (sbprim) did not have a statistically significant effect (even at 0.1 level) on totfatrte. Its coefficient of -0.3530 indicates that in general it had an negative effect on total fatality rate. Intuitively, this makes sense but this effect is not statistically significant.

Question 4

4.1 Reestimate the model from Exercise 3 using a fixed effects (at the state level) model.

model.fe2 <-plm(totfatrte ~  d80 + d81 + d82 + d83 + d84 + d85 + d86 + d87 + d88 + d89 + d90 + 
                  d91 + d92 + d93 + d94 + d95 + d96 + d97 + d98 + d99 + d00 + d01 + d02 + d03 + 
                  bac10_binary + bac08_binary + perse_binary + sbprim + sbsecon + sl70plus_binary +
                  gdl_binary + perc14_24 + log_unem + norm_vehicmilespc,
                data = df,
                index=c("state", "year"),
                model="within")

summary(model.fe2)
pbgtest(model.fe2)

4.2 How do the coefficients on bac08, bac10, perse, and sbprim compare with the pooled OLS estimates?

The coefficients on bac08, bac10 are statistically significant in both pooled OLS and FE models, but less strongly significant in the FE model. Also, the effect estimate of blood alcohol limit law is less than half of that in the pooled OLS model.

The coefficients on perse, and sbprim are either weakly statistically significant or not significant at all in the pooled OLS model but they are both strongly statistically in the fixed effects (at the state level), although the effect estimates are much lower in the FE model.

4.3 Which set of estimates do you think is more reliable?

The FE model should be more reliable than the pooled OLS model. The reason is the FE model takes into account the variation over time and by states.

ggplot(df, aes(as.factor(year), model.fe2$residuals)) +
  geom_boxplot() +
  geom_jitter(width = 0.2, size = 0.1) +
  theme_bw() +
  ggtitle('Model Residuals by Year')

4.4 What assumptions are needed in each of these models? Are these assumptions reasonable in the current context?

Assumptions for the FE model

Random sample from the cross section

This is presumed to be true during the data collection for this dataset.

Each explanatory variable changes over time and no perfect linear relationships exist among the explanatory variables.

This was analyzed via the describe function during the EDA analysis and is true.

Strict Exogeneity

Below we analyze the condition that $cov(x_{itj, \epsilon_{is}}) = 0$ which is required for strict exogeneity.

# Function that iterates over variables and years to compute covariance for strict exogeneity
df$fitted.fe <- df$totfatrte - residuals(model.fe2)
df$residuals.fe <- residuals(model.fe2)

df_model <- data.frame(matrix(ncol = 3, nrow = 0))
x <- c('year', 'variable', 'covariance')
colnames(df_model) <- x

for (t in unique(df$year)){
  for (variable in colnames(model.fe2$model)[c(-1)]){
    for (year_s in unique(df$year)){
      c = cov(df[df$year == t,][variable], df[df$year == year_s,]$residuals.fe)
      df_model <- rbind(df_model,data.frame(year = t, variable = variable, covariance = c[1]))
    }
  }
}

ggplot(df_model, aes(x = year, y = covariance, col = variable))+geom_point() +
    xlab("Fitted values")+ylab("Covariance") + ggtitle("Strict Exogenous Assumption - Fixed Effect Model")

From the produced plot we can see that not all covariance values are close to or equal to zero and therefore this assumption is violated. We therefore do not have an unbiased model, however the fixed effect estimator may be consistent.

When analyzing the covariance plot, all other variables besides vehicmilespc are close to zero. We normalized vehicmilespc in order to see if correlation would decrease, however it did not decrease to the point of stating that covariance is equal to zero. From our EDA section we observed that there was an overall increasing trend in vehicmilespc and an overall decreasing trend in totfatrte. From the plot above we can determine that there is a violation in the strict exogeneity assumption. However we choose to include the vehicmilespc variable as intuitively miles driven is an explanatory variable in predicting total fatality rate (if vehicmilespc = 0 the total fatalities from auto accidents would equal zero).

No Serial Correlation or Heterogeneity in the Error Term:

Homoskedasticity

To get standard errors on the coefficients, we need a constant error variance. From the residuals vs fitted plot, there appears to be an increase in variance as the fitted values increases indicating heterogeneity.

```{r echo=FALSE} library(lmtest) bptest(model.fe2)

From the breush-pagan test we can see that there is heteroskedasticity in the model. One option would be to utilize heterostkedastic robust standard errors. This analysis will not be carried out here. 

It would be possible to compute heterogeneous robust errors using the variance covariance matrix. 

```{r}
coeftest(model.fe2, vcov = vcovHC(model.fe2, cluster = 'group'))

Comparing the standard errors, we can see that the robust errors tended to err towards non-significance, particularly for the variables: perc14_24 and bac08. Bac10 is only signiciant at the 0.10 level using robust standard errors.

Serial Correlation:

We also must check for serial correlation, this can be completed by using the Breusch-Godfrey test.

pbgtest(model.fe2)

The test statistic has a p-value < 0.05 indicating that there is serial correlation present in our fixed effect model. The presence of serial correlation will further upset our standard error computation.

We also look further at the normal OLS assumptions:

Zero conditional mean

```{r fig.height = 3, fig.width = 4, fig.align = "center"}

df$fitted.fe <- df$totfatrte - residuals(model.fe2) df$residuals.fe <- residuals(model.fe2)

ggplot(df, aes(fitted.fe, residuals.fe))+ geom_point() + stat_smooth(method="loess")+ geom_hline(yintercept=0, col="red", linetype="dashed")+ xlab("Fitted values")+ylab("Residuals") + ggtitle("Residual vs Fitted Plot")+ theme_bw()

#### Normally Distributed Error
To conduct hypothesis t-tests for the LPM, we need normally distributed errors.
```{r fig.height = 3, fig.width = 4, fig.align = "center"}
ggplot(df, aes(sample=residuals.fe)) +
    stat_qq() + stat_qq_line() +
    xlab("Theoretical Quantiles") + ylab("Standardized Residuals") +
    ggtitle("Normal Q-Q")

We can see that there is fairly significant deviation from the QQ line, indicating that the residuals are not normally distributed.

Question 5

Would you prefer to use a random effects model instead of the fixed effects model you built in Exercise 4? Please explain.

model.re2<-plm(totfatrte ~  d80 + d81 + d82 + d83 + d84 + d85 + d86 + d87 + d88 + d89 + d90 + 
     d91 + d92 + d93 + d94 + d95 + d96 + d97 + d98 + d99 + d00 + d01 + d02 + d03 + bac10_binary + 
       bac08_binary + perse_binary + sbprim + sbsecon + sl70plus_binary + gdl_binary + perc14_24 + log(unem) + 
       vehicmilespc, data = df, index=c("state", "year"), model="random")
summary(model.re2)
phtest(model.fe2, model.re2)

Despite the Hausman test results, we prefer the fixed effect model because the random effects model assumption is that the individual state specific effects are uncorrelated with the independent variable, and this assumption is not satisfied. Moreover, from analyzing the model parameters, the random and fixed effects tended to have similar coefficients and levels of significance. Therefore, we decide to go with the fixed effect model.

Question 6

Suppose that vehicmilespc, the number of miles driven per capita, increases by $1,000$. Using the FE estimates, what is the estimated effect on totfatrte? Please interpret the estimate.

Based on the fixed effect model in part 4, the coefficient for norm_vehicmilespc is 0.133128. Because this was a normalized variable we have to compensate to understand the effect: $$\Delta totfatrte = 0.133128 \frac{100*1000}{18390-4372} = 0.9497$$ Therefore a 1,000 mile increase will cause a 0.95 increase in the totfatrte variable.

Question 7

If there is serial correlation or heteroskedasticity in the idiosyncratic errors of the model, what would be the consequences on the estimators and their standard errors?

Serial correlation or heteroskedasticity in the idosyncratic errors will cause the standard errors for the model to be incorrect. As described by Wooldridge, the central limit theorem can be invoked for first-differenced models. However for fixed effects models, this is not necessarily the case as we have not accounted for AR(1) effects. We can use the Breusch-Godfrey test to test for serial correlation.

pbgtest(model.fe2)

We can see the test shows serial correlation, therefore our computed standard errors will likely be innacurate.