Goodness of fit in Rstudio

After estimating Ordinary Least Squares or OLS in Rstudio, we must ensure that the model is a good fit. The model must satisfy the assumptions of OLS including heteroscedasticity, no multicollinearity, normality of residuals and no autocorrelation. We can apply several tests and generate plots to check for the goodness of fit of a model.

Here, we will estimate the same OLS model as we did in the earlier post on OLS estimation. After this, we will illustrate the estimation of various tests and procedures essential for determining the fit of any OLS model. Our analysis will focus on the following:

  • Checking the normality assumption of residuals using plots.
  • Reporting the Standard errors, P-values, R-square, Adjusted R-square, Confidence intervals and Anova results.
  • Testing for heteroscedasticity using several tests and plots.
  • Check for the problem of multicollinearity.
  • Perform tests of autocorrelation and generate Autocorrelation (ACF) and Partial autocorrelation (PACF) plots.

Goodness of fit: code and results

We will use the “mtcars” data available in “datasets” package to estimate an OLS model. First step involves loading the libraries and the dataset, followed by the analysis.

Here we will illustrate the tests for Heterocedasticity, Multicollinearity and Autocorrelation that can be conducted easily using R or Rstudio.

OLS Estimation, Confidence Intervals, Anova and Normality of Residuals

The “lm” is used to estimate the OLS model. For Goodness of fit, the function automatically estimates Standard Errors, P-values, R-square and Adjusted R-square.

To supplement this, we can use confint to estimate the confidence intervals of the coefficients and anova to do an F-test.

Fitted values and residuals/error terms can be obtained using fitted and resid commands. To check the distribution of residuals graphically, we can plot a histogram using hist.

stargazer(ols1 <- lm(mpg ~ IV), type = "text", style = "all", title = "OLS results")
## 
## OLS results
## ======================================================
##                            Dependent variable:        
##                     ----------------------------------
##                                    mpg                
## ------------------------------------------------------
## IVhp                              -0.018              
##                                  (0.012)              
##                                 t = -1.519            
##                                 p = 0.141             
## IVcyl                            -0.942*              
##                                  (0.551)              
##                                 t = -1.709            
##                                 p = 0.099             
## IVwt                            -3.167***             
##                                  (0.741)              
##                                 t = -4.276            
##                                 p = 0.0002            
## Constant                        38.752***             
##                                  (1.787)              
##                                 t = 21.687            
##                                 p = 0.000             
## ------------------------------------------------------
## Observations                        32                
## R2                                0.843               
## Adjusted R2                       0.826               
## Residual Std. Error          2.512 (df = 28)          
## F Statistic         50.171*** (df = 3; 28) (p = 0.000)
## ======================================================
## Note:                      *p<0.1; **p<0.05; ***p<0.01
conf <- confint(ols1, level = 0.95)
stargazer(conf, type = "text", title = "Confidence Intervals")
## 
## Confidence Intervals
## =========================
##             2.5 %  97.5 %
## -------------------------
## (Intercept) 35.092 42.412
## IVhp        -0.042 0.006 
## IVcyl       -2.070 0.187 
## IVwt        -4.684 -1.650
## -------------------------
anova <- anova(ols1)
stargazer(anova, type = "text", title = "Anova")
## 
## Anova
## ============================================
## Statistic N  Mean   St. Dev.   Min     Max  
## --------------------------------------------
## Df        2 15.500   17.678     3      28   
## Sum Sq    2 563.024 546.456  176.621 949.427
## Mean Sq   2 161.392 219.322   6.308  316.476
## F value   1 50.171           50.171  50.171 
## Pr(> F)   1  0.000              0       0   
## --------------------------------------------
ycap <- fitted(ols1)
error <- resid(ols1)
hist(error, breaks = 5)

Heteroscedasticity

We will use the skedastic package to apply various tests of heteroscedasticity.

We can also use the plot command to get some insights into the behaviour of residuals by plotting them with fitted values.

An interesting application of the plot command is to use it on the model object ols1 that we estimated: plot(ols1). This automatically generated 4 different types of plots that give insights into normality as well as heteroscedasticity.

plot(ycap,error) # Residual vs fitted plot

plot(ols1) # 4 different types of plot

lmtest::bptest(ols1)
## 
##  studentized Breusch-Pagan test
## 
## data:  ols1
## BP = 2.9351, df = 3, p-value = 0.4017
skedastic::white_lm(ols1) # White's Heteroscedasticity Test
## # A tibble: 1 × 5
##   statistic p.value parameter method       alternative
##       <dbl>   <dbl>     <dbl> <chr>        <chr>      
## 1      5.83   0.442         6 White's Test greater
skedastic::breusch_pagan(ols1) # Breusch-Pagan Test
## # A tibble: 1 × 5
##   statistic p.value parameter method                alternative
##       <dbl>   <dbl>     <dbl> <chr>                 <chr>      
## 1      2.94   0.402         3 Koenker (studentised) greater
skedastic::cook_weisberg(ols1) # Cook-Weisberg Test
## # A tibble: 1 × 5
##   statistic p.value parameter method alternative
##       <dbl>   <dbl>     <int> <chr>  <chr>      
## 1      3.48   0.324         3 mult   greater
skedastic::glejser(ols1) # Glejser Test
## # A tibble: 1 × 4
##   statistic p.value parameter alternative
##       <dbl>   <dbl>     <dbl> <chr>      
## 1      2.49   0.478         3 greater
skedastic::hetplot(ols1, horzvar = "fitted.values")

Multicollinearity

To check for multicollinearity in the model, we can look at the correlations using the cor command.

Using the car package, we will estimate the Variance Inflation Factors (VIF). As a reule of thumb, multicollinearity is a serious problem if VIF > 10.

stargazer(cor(IV), type = "text", title = "Correlations")
## 
## Correlations
## =====================
##      hp    cyl   wt  
## ---------------------
## hp    1   0.832 0.659
## cyl 0.832   1   0.782
## wt  0.659 0.782   1  
## ---------------------
vif1 <- vif(lm(mpg ~ hp + cyl + wt, data = ols))
stargazer(vif1, type = "text", title = "VIF")
## 
## VIF
## =================
## hp     cyl   wt  
## -----------------
## 3.258 4.757 2.580
## -----------------

Autocorrelation

Autocorrelation and Partial Autocorrelation plots are widely used in Econometrics. Here, acf and pacf commands are used on the residuals to determine their autocorrelation structure.

In addition to this, we can also apply the Durbin-Watson and Breusch-Godfrey test of serial correlation.

acf(error, lag.max = 6, type = "correlation")

pacf(error, lag.max = 6)

lmtest::dwtest(ols1) # Durbin-Watson test for AR(1) only
## 
##  Durbin-Watson test
## 
## data:  ols1
## DW = 1.6441, p-value = 0.1002
## alternative hypothesis: true autocorrelation is greater than 0
lmtest::bgtest(ols1, order = 6)
## 
##  Breusch-Godfrey test for serial correlation of order up to 6
## 
## data:  ols1
## LM test = 5.0855, df = 6, p-value = 0.5329

Leave a Reply