OLS in Rstudio: Ordinary Least Squares

The method of Ordinary Least Squares (OLS) is one of the basic types of analysis in Econometrics. It is widely used in other fields of study as well. R programming language can be used to apply OLS. We are going to implement OLS in Rstudio because it provides an easy-to-use environment for implementing R programming code.

In this post, we will discuss how OLS can be easily implemented in Rstudio. We are going to report the results of the model along with the code. This will enable us to understand the working of R code and the role of packages in R or Rstudio. To learn more about the theory of OLS and its interpretation, go to Ordinary Least Squares Estimation or Interpretation of Coefficients: OLS.

OLS in Rstudio: packages

The “install.packages” command can be used to install any packages that are needed. There are thousands of open source packages available to install in R. However, there are certain packages that are useful for Econometric analysis. Some of these are:

  • tidyverse: tidyverse is a collection of packages that are very useful for visualisation, cleaning and manipulating data.
  • stargazer: this package can be used to present the R output in several table formats widely used in research journals or simply to make the output more organised.
  • skedastic: a package to implement heteroscedasticity tests.
  • car: this package also provides various functions for Econometric analysis.
  • haven: this package can be used to load datasets of various formats such as CSV, Excel, Stata files etc.
  • datasets: as the name suggests, it contains a number of datasets that are readily available to load and be used for analysis.
  • magrittr: it allows us to use a pipe “%>%” which can be used to pass values forward into functions.
  • rmarkdown: we will use this package to generate the HTML report below.

In this post, we are going to use some of these functions to estimate an OLS model in Rstudio. Let us estimate our OLS model.

Rstudio: Code and results

First, we have to load all the libraries and the dataset that we will be using for illustration. Here, we will use the “mtcars” data available in the “datasets” package.

Next, we will extract the vectors and matrices that we are going to use for OLS estimation. We can name the vectors or columns of the matrices as needed:

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(magrittr)
library(stargazer)
library(haven)
library(skedastic)
library(car)
library(datasets)
ols <- datasets::mtcars
mpg <- cbind(ols$mpg)
hp <- cbind(ols$hp)
IV <- cbind(ols$hp, ols$cyl, ols$wt)
colnames(IV, do.NULL = FALSE)
## [1] "col1" "col2" "col3"
colnames(IV) <- c("hp", "cyl", "wt")

Basic Plots

Let us look at some basic plots of the data:

  1. A scatterplot of mileage per gallon (mpg) and horsepower (hp) showing the a negative relationship between the two.
  2. A histogram to see the distribution of mileage per gallon (mpg).
plot(mpg ~ hp, data = ols)

hist(mpg)

Correlations

We can easily estimate the correlation coefficients using the “cor” command. Using the Stargazer package, we can make the table look neat and easy to read as seen in the last table.

cor(mpg, IV)
##              hp       cyl         wt
## [1,] -0.7761684 -0.852162 -0.8676594
cor(IV)
##            hp       cyl        wt
## hp  1.0000000 0.8324475 0.6587479
## cyl 0.8324475 1.0000000 0.7824958
## wt  0.6587479 0.7824958 1.0000000
stargazer(cor(IV), type = "text", title = "Correlations of IV (Independent variables)")
## 
## Correlations of IV (Independent variables)
## =====================
##      hp    cyl   wt  
## ---------------------
## hp    1   0.832 0.659
## cyl 0.832   1   0.782
## wt  0.659 0.782   1  
## ---------------------

OLS Estimation

The “lm” function can be used to specify the OLS model. We specify the dependent variable “mpg” followed by the independent variables. Here, we are using a matrix of independent variable specified as “~ IV”. We can also specifiy each independent variable separately “~ hp + cyl + wt”.

We are again using Stargazer to customize the output. In the first output, we are using the table format accepted in the American Economic Review using the command “style =”aer”. There are many other format available as well.

Without Stargazer, the output is reported as a table shown by the command “summary(ols1)”.

Moreover, it’s easy to estimate confidence intervals and Anova with simple commands: confint and anova.

stargazer(ols1 <- lm(mpg ~ IV), type = "text", style = "aer", title = "OLS results AER")
## 
## OLS results AER
## ==========================================================
##                                      mpg                  
## ----------------------------------------------------------
## IVhp                                -0.018                
##                                    (0.012)                
##                                                           
## IVcyl                              -0.942*                
##                                    (0.551)                
##                                                           
## IVwt                              -3.167***               
##                                    (0.741)                
##                                                           
## Constant                          38.752***               
##                                    (1.787)                
##                                                           
## Observations                          32                  
## R2                                  0.843                 
## Adjusted R2                         0.826                 
## Residual Std. Error            2.512 (df = 28)            
## F Statistic                 50.171*** (df = 3; 28)        
## ----------------------------------------------------------
## Notes:              ***Significant at the 1 percent level.
##                     **Significant at the 5 percent level. 
##                     *Significant at the 10 percent level.
stargazer(ols1 <- lm(mpg ~ IV), type = "text", style = "all", title = "OLS results")
## 
## OLS results
## ======================================================
##                            Dependent variable:        
##                     ----------------------------------
##                                    mpg                
## ------------------------------------------------------
## IVhp                              -0.018              
##                                  (0.012)              
##                                 t = -1.519            
##                                 p = 0.141             
## IVcyl                            -0.942*              
##                                  (0.551)              
##                                 t = -1.709            
##                                 p = 0.099             
## IVwt                            -3.167***             
##                                  (0.741)              
##                                 t = -4.276            
##                                 p = 0.0002            
## Constant                        38.752***             
##                                  (1.787)              
##                                 t = 21.687            
##                                 p = 0.000             
## ------------------------------------------------------
## Observations                        32                
## R2                                0.843               
## Adjusted R2                       0.826               
## Residual Std. Error          2.512 (df = 28)          
## F Statistic         50.171*** (df = 3; 28) (p = 0.000)
## ======================================================
## Note:                      *p<0.1; **p<0.05; ***p<0.01
summary(ols1)
## 
## Call:
## lm(formula = mpg ~ IV)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9290 -1.5598 -0.5311  1.1850  5.8986 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 38.75179    1.78686  21.687  < 2e-16 ***
## IVhp        -0.01804    0.01188  -1.519 0.140015    
## IVcyl       -0.94162    0.55092  -1.709 0.098480 .  
## IVwt        -3.16697    0.74058  -4.276 0.000199 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.512 on 28 degrees of freedom
## Multiple R-squared:  0.8431, Adjusted R-squared:  0.8263 
## F-statistic: 50.17 on 3 and 28 DF,  p-value: 2.184e-11
conf <- confint(ols1, level = 0.95)
stargazer(conf, type = "text", title = "Confidence Intervals")
## 
## Confidence Intervals
## =========================
##             2.5 %  97.5 %
## -------------------------
## (Intercept) 35.092 42.412
## IVhp        -0.042 0.006 
## IVcyl       -2.070 0.187 
## IVwt        -4.684 -1.650
## -------------------------
anova <- anova(ols1)
stargazer(anova, type = "text", title = "Anova")
## 
## Anova
## ============================================
## Statistic N  Mean   St. Dev.   Min     Max  
## --------------------------------------------
## Df        2 15.500   17.678     3      28   
## Sum Sq    2 563.024 546.456  176.621 949.427
## Mean Sq   2 161.392 219.322   6.308  316.476
## F value   1 50.171           50.171  50.171 
## Pr(> F)   1  0.000              0       0   
## --------------------------------------------

Residuals and fitted values

To estimate the fitted values and the residuals, we can use the “fitted” and “resid” commands.

Here, we will also plot a histogram and boxplot to see how the error term or residuals are distributed. From the graphs, we can see that the residuals look approximately normal but we have a few outliers to deal with.

ycap <- fitted(ols1)
summary(ycap)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.14   15.97   20.12   20.09   24.52   28.93
error <- resid(ols1)
summary(error)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -3.9290 -1.5598 -0.5311  0.0000  1.1850  5.8986
hist(error, breaks = 5)

boxplot(error)

Leave a Reply