library(tidyverse)
This notebook is based on Chapters 21 of R for Data Science.
Einstein once said:
“Insanity is doing the same thing over and over and expecting different results.”
That is true for life. And it is true for programming. Looping or iteration is about doing the same thing over and over again and returning exactly the same results.
Let’s revisit the big picture. In data science we have data and functions that do stuff to those data. For instance, calculate the mean of a column. That’s pretty much it.
Iteration most often enters the picture when you need to use the same function many times (e.g., calculate the mean of many different columns).
This is what summarise
does for you under the hood:
# load the diamonds data
data("diamonds")
# average price, table, depth
%>%
diamonds summarise(mean(price), mean(table), mean(depth))
in this example it applies the function mean()
to three different columns.
R is a functional programming language. In a nutshell that means loops are “wrapped” inside functions which are then applied to entire vectors (rather than individual elements).
Something really confusing about R (to me, at least) is the community’s attitude towards loops. Beginners are told to never use loops because they are slow and instead use “vectorised” solutions. But as we will shortly see, “vectorised” solutions are just loops in a faster language (C). And the slowness of a loop depends not on the loop itself but what you do inside the loop.
Moreover, loops are really useful for one reason: they are easy to understand and they are an important construct to master when writing pseudocode, the most important programming language of all.
The tidyverse
leans into the idea of functional programming with its family of map
operators from the package purrr
(which ships with library(tidyverse)
).
Pseudocode is executable code for humans rather than computers. Basically it’s general instructions on how to write code. You write down the basic idea. Implementing the idea is then just a matter of converting the pseudocode into a language the computer can read (R, Python, Visual Basic, FORTRAN, whatever).
Thinking in these terms is very important when writing programs that iterate. Why? Because you have to think very carefully about two things:
and more importantly this helps you think about whether you need a loop at all to solve your problem. If you don’t need a loop, don’t use a loop.
Later in this notebook we will write a program that uses iteration to simulate the Central Limit Theorem. The pseudocode is:
for each i in N simulations:
1. draw a random sample of the data
2. estimate a linear model
3. extract the coefficient of interest
4. return the coefficient
endfor
all we have to do is re-write the instructions in R. Not that this is trivial. But the pseudocode gives us a North Star to point to in case we get lost.
Consider this vector:
# vector of integers 1 to 10
1:10
x =# view it
x
[1] 1 2 3 4 5 6 7 8 9 10
What if we wanted to write a loop that takes each element, multiply it by 2, and add it to a new vector called x_2
?
The pseudocode would be something like this:
make an empty vector called "x2" the same length as x
for each element in x:
multiply it by 2 and attach it to the corresponding place in x2
endfor
First we have to define the output of the loop, x2
. It’s a vector, and we want it to hold numbers, or “doubles”, and it needs to have the same length of x
:
vector(mode = "double", length = length(x))
x2 = x2
[1] 0 0 0 0 0 0 0 0 0 0
Why define the output before you have the output? Because it’s faster. Why? Because by pre-allocating memory you use memory more efficiently. Failure to pre-allocate when iterating is one of the main reasons why loops are “slow”. See the appendix for more details.
Next we need to set up the loop skeleton. This mainly consists of the iterator for each element in x
. What this really says is “if x
has \(n\) elements, run the loop one time for each element and in order (i.e., 1, then 2, then 3, etc.)”.
There are multiple ways you can do this in R:
for(i in 1:length(x)){
print(i)
}
or use seq_along()
:
for(i in seq_along(x)){
print(i)
}
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
(The book explains why you might prefer seq_along()
).
Finally we define the body or “guts” of the loop: multiply the \(i_{th}\) element of x
by 2 and attach it to the \(i_th\) element of x2
. Remember vector indexing in R works like this: vector[index number]
. So the iterator i
will serve as the vector index for x
and x2
:
for(i in seq_along(x)){
x[i]*2
x2[i] = }
Let’s view the results:
x2
[1] 2 4 6 8 10 12 14 16 18 20
It worked!
Write a program that loops over x
and returns a vector x_squared
with the square of each element of x
:
# define the output
vector(mode = "double", length = length(x))
x_squared =# run the loop
for(i in seq_along(x)){
x[i]^2
x_squared[i] =
}# view the output
x_squared
[1] 1 4 9 16 25 36 49 64 81 100
It makes no sense to write a loop that multiples x
by 2 when we could just do:
*2 x
[1] 2 4 6 8 10 12 14 16 18 20
“Vectorisation” refers to a function that takes as its argument a vector. In our case it means multiplying each element by 2 “at the same time”. But this is of course impossible. You would need a quantum computer!
What is really going on is that *
is a function and it is “vectorised” because it calls a function that runs the same loop we made, only in C, a much faster language. Here is the source code.
The important lesson here is that all calculations applied to each element of a vector use loops. The only question is whether the loop is run in a faster or slower language. And this is why looping is “discouraged” in R when there are “vectorised” solutions available. Long story short, don’t re-invent the wheel.
What if you want to make your loop more portable? Use it on any vector? Wrap it inside a function! This is the heart of functional programming.
Write a function called square
that takes a numerical vector and returns a vector with the square of each element of the input vector. Test it out on this vector:
# vector of integers 1 to 20
1:20 test_vector =
function(x){
square =# first define the output
vector(mode = "double", length = length(x))
output =for(i in seq_along(x)){
x[i]^2
output[i] =
}return(output)
}
square(x = test_vector)
[1] 1 4 9 16 25 36 49 64 81 100 121 144 169 196 225 256 289 324 361 400
Just for the sake of practice, let’s reinvent mean()
. Write a function called my_mean
that calculates the mean of a numerical vector. Recall the formula for the mean:
\[ \bar{x} = \frac{1}{n} \sum_{i=1}^n x_i \]
You can use length()
to calculate \(n\), but try not to use sum()
in your function! Bonus if you also avoid using length()
.
function(x){
my_mean =# 1. calculate the sum of the vector
## (if we used sum() it would just be sum(x))
## define an empty object called "sum"
0
sum =## now loop over the input and add each element to sum
for(i in seq_along(x)){
sum + x[i]
sum =
}# 2. get the length of the vector
## start with an empty value
0
n =## the for each element of the vector...
for(i in seq_along(x)){
n + 1 # add one!
n =
}# 3. the mean is sum/n
sum/n
mean =# 4. return it!
return(mean)
}
Confirm that my_mean()
returns the same results as mean()
by calculating average price in diamonds
inside a call to summarise()
:
%>%
diamonds summarise(my_mean(price), mean(price))
(Again, don’t use my_mean()
in real life because mean()
is written in C and therefore much faster.)
The tidyverse
way of looping is “mapping” through the map
functions in purrr
. The idea is to “map” a function to many objects “all at once” (i.e. the loops are written in C).
Consider this data set of standard normal random variables:
tibble(
df <-a = rnorm(n = 5, mean = 0, sd = 1),
b = rnorm(n = 5, mean = 0, sd = 1),
c = rnorm(n = 5, mean = 0, sd = 1),
d = rnorm(n = 5, mean = 0, sd = 1)
) df
Say you wanted to calculate the mean of each column. You could use summarise()
, or you could write a loop that goes over each column. But a more efficient approach is to apply or map the function mean
:
%>%
df map(mean)
$a
[1] -0.1717594
$b
[1] -0.7320268
$c
[1] 0.7291849
$d
[1] 0.2108693
The map
functions all do the same thing but return different data types. For instance, map_dbl
returns a numeric vector, while map
returns a list.
Lists are major data objects in R – the equivalent of “dictionaries” in Python (a “list” in Python is a vector in R). Unlike vectors a list can hold many different data types. One element of a list can be a tibble, another element a vector, and so on.
When you work with data frames / tibbles you don’t spend much time with lists. But it’s useful to know who to work with lists because pretty much all objects in R are lists under the hood.
How do you know? You’re dealing with a list-type object if you can index it’s attributes with $
. Tibbles are lists:
$a df
[1] -0.5797285 0.1317783 1.5385788 -0.1586313 -1.7907944
and so are lm
objects:
# estimate a linear model
lm(formula = log(price) ~ carat + table + depth, data = diamonds)
model =# view the structure of the output with `$`
str(model)
List of 12
$ coefficients : Named num [1:4] 8.0451 1.97893 -0.00854 -0.02181
..- attr(*, "names")= chr [1:4] "(Intercept)" "carat" "table" "depth"
$ residuals : Named num [1:53940] -0.902 -0.849 -0.914 -0.952 -0.969 ...
..- attr(*, "names")= chr [1:53940] "1" "2" "3" "4" ...
$ effects : Named num [1:53940] -1808.476 216.847 2.137 6.906 -0.954 ...
..- attr(*, "names")= chr [1:53940] "(Intercept)" "carat" "table" "depth" ...
$ rank : int 4
$ fitted.values: Named num [1:53940] 6.69 6.64 6.7 6.76 6.78 ...
..- attr(*, "names")= chr [1:53940] "1" "2" "3" "4" ...
$ assign : int [1:4] 0 1 2 3
$ qr :List of 5
..$ qr : num [1:53940, 1:4] -2.32e+02 4.31e-03 4.31e-03 4.31e-03 4.31e-03 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:53940] "1" "2" "3" "4" ...
.. .. ..$ : chr [1:4] "(Intercept)" "carat" "table" "depth"
.. ..- attr(*, "assign")= int [1:4] 0 1 2 3
..$ qraux: num [1:4] 1 1.01 1.02 1
..$ pivot: int [1:4] 1 2 3 4
..$ tol : num 1e-07
..$ rank : int 4
..- attr(*, "class")= chr "qr"
$ df.residual : int 53936
$ xlevels : Named list()
$ call : language lm(formula = log(price) ~ carat + table + depth, data = diamonds)
$ terms :Classes 'terms', 'formula' language log(price) ~ carat + table + depth
.. ..- attr(*, "variables")= language list(log(price), carat, table, depth)
.. ..- attr(*, "factors")= int [1:4, 1:3] 0 1 0 0 0 0 1 0 0 0 ...
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:4] "log(price)" "carat" "table" "depth"
.. .. .. ..$ : chr [1:3] "carat" "table" "depth"
.. ..- attr(*, "term.labels")= chr [1:3] "carat" "table" "depth"
.. ..- attr(*, "order")= int [1:3] 1 1 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(log(price), carat, table, depth)
.. ..- attr(*, "dataClasses")= Named chr [1:4] "numeric" "numeric" "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:4] "log(price)" "carat" "table" "depth"
$ model :'data.frame': 53940 obs. of 4 variables:
..$ log(price): num [1:53940] 5.79 5.79 5.79 5.81 5.81 ...
..$ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
..$ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
..$ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
..- attr(*, "terms")=Classes 'terms', 'formula' language log(price) ~ carat + table + depth
.. .. ..- attr(*, "variables")= language list(log(price), carat, table, depth)
.. .. ..- attr(*, "factors")= int [1:4, 1:3] 0 1 0 0 0 0 1 0 0 0 ...
.. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. ..$ : chr [1:4] "log(price)" "carat" "table" "depth"
.. .. .. .. ..$ : chr [1:3] "carat" "table" "depth"
.. .. ..- attr(*, "term.labels")= chr [1:3] "carat" "table" "depth"
.. .. ..- attr(*, "order")= int [1:3] 1 1 1
.. .. ..- attr(*, "intercept")= int 1
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. .. ..- attr(*, "predvars")= language list(log(price), carat, table, depth)
.. .. ..- attr(*, "dataClasses")= Named chr [1:4] "numeric" "numeric" "numeric" "numeric"
.. .. .. ..- attr(*, "names")= chr [1:4] "log(price)" "carat" "table" "depth"
- attr(*, "class")= chr "lm"
And the summary()
of a linear model object is a list:
summary(model)
summary_model =str(summary_model)
List of 11
$ call : language lm(formula = log(price) ~ carat + table + depth, data = diamonds)
$ terms :Classes 'terms', 'formula' language log(price) ~ carat + table + depth
.. ..- attr(*, "variables")= language list(log(price), carat, table, depth)
.. ..- attr(*, "factors")= int [1:4, 1:3] 0 1 0 0 0 0 1 0 0 0 ...
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:4] "log(price)" "carat" "table" "depth"
.. .. .. ..$ : chr [1:3] "carat" "table" "depth"
.. ..- attr(*, "term.labels")= chr [1:3] "carat" "table" "depth"
.. ..- attr(*, "order")= int [1:3] 1 1 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(log(price), carat, table, depth)
.. ..- attr(*, "dataClasses")= Named chr [1:4] "numeric" "numeric" "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:4] "log(price)" "carat" "table" "depth"
$ residuals : Named num [1:53940] -0.902 -0.849 -0.914 -0.952 -0.969 ...
..- attr(*, "names")= chr [1:53940] "1" "2" "3" "4" ...
$ coefficients : num [1:4, 1:4] 8.0451 1.97893 -0.00854 -0.02181 0.10143 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:4] "(Intercept)" "carat" "table" "depth"
.. ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)"
$ aliased : Named logi [1:4] FALSE FALSE FALSE FALSE
..- attr(*, "names")= chr [1:4] "(Intercept)" "carat" "table" "depth"
$ sigma : num 0.396
$ df : int [1:3] 4 53936 4
$ r.squared : num 0.848
$ adj.r.squared: num 0.848
$ fstatistic : Named num [1:3] 100085 3 53936
..- attr(*, "names")= chr [1:3] "value" "numdf" "dendf"
$ cov.unscaled : num [1:4, 1:4] 0.065616 0.000307 -0.000363 -0.000728 0.000307 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:4] "(Intercept)" "carat" "table" "depth"
.. ..$ : chr [1:4] "(Intercept)" "carat" "table" "depth"
- attr(*, "class")= chr "summary.lm"
So if you want to access only the R-squared of the summary object:
$r.squared summary_model
[1] 0.8477212
or the residual standard deviation sigma
(the standard deviation of the residuals):
$sigma summary_model
[1] 0.3959568
Why is this important? Because you might want to run many models and then look at just one piece of each model.
We know we run a regression like so:
%>%
diamonds lm(formula = log(price) ~ carat, data = .) %>%
summary()
Call:
lm(formula = log(price) ~ carat, data = .)
Residuals:
Min 1Q Median 3Q Max
-6.2844 -0.2449 0.0335 0.2578 1.5642
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.215021 0.003348 1856 <2e-16 ***
carat 1.969757 0.003608 546 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3972 on 53938 degrees of freedom
Multiple R-squared: 0.8468, Adjusted R-squared: 0.8468
F-statistic: 2.981e+05 on 1 and 53938 DF, p-value: < 2.2e-16
But what if we want to split the data by some grouping variable (e.g., cut
) and then estimate the model for each tibble?
The function split()
will split our data in a list:
%>%
diamonds split(.$cut)
$Fair
$Good
$`Very Good`
$Premium
$Ideal
NA
and the output is a list:
%>%
diamonds split(.$cut) %>%
class()
[1] "list"
so we can map
the linear model lm(formula = log(price) ~ carat, data = .)
onto each element of the list:
%>%
diamonds split(.$cut) %>%
map(function(x) lm(formula = log(price) ~ carat, data = x)) # notice the x! It's the argument to this "lambda" function ( a function defined inside a function)
$Fair
Call:
lm(formula = log(price) ~ carat, data = x)
Coefficients:
(Intercept) carat
6.785 1.251
$Good
Call:
lm(formula = log(price) ~ carat, data = x)
Coefficients:
(Intercept) carat
6.164 1.977
$`Very Good`
Call:
lm(formula = log(price) ~ carat, data = x)
Coefficients:
(Intercept) carat
6.115 2.088
$Premium
Call:
lm(formula = log(price) ~ carat, data = x)
Coefficients:
(Intercept) carat
6.298 1.853
$Ideal
Call:
lm(formula = log(price) ~ carat, data = x)
Coefficients:
(Intercept) carat
6.147 2.124
and then we can map
the summary()
function to each estimated model:
%>%
diamonds split(.$cut) %>%
map(function(x) lm(formula = log(price) ~ carat, data = x)) %>%
map(summary)
$Fair
Call:
lm(formula = log(price) ~ carat, data = x)
Residuals:
Min 1Q Median 3Q Max
-3.2527 -0.2229 0.0548 0.2632 1.2471
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.78482 0.02321 292.26 <2e-16 ***
carat 1.25091 0.01990 62.86 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4122 on 1608 degrees of freedom
Multiple R-squared: 0.7108, Adjusted R-squared: 0.7106
F-statistic: 3951 on 1 and 1608 DF, p-value: < 2.2e-16
$Good
Call:
lm(formula = log(price) ~ carat, data = x)
Residuals:
Min 1Q Median 3Q Max
-2.80292 -0.26122 0.06681 0.26879 1.53479
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.16355 0.01201 513.4 <2e-16 ***
carat 1.97750 0.01247 158.6 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3965 on 4904 degrees of freedom
Multiple R-squared: 0.8369, Adjusted R-squared: 0.8368
F-statistic: 2.515e+04 on 1 and 4904 DF, p-value: < 2.2e-16
$`Very Good`
Call:
lm(formula = log(price) ~ carat, data = x)
Residuals:
Min 1Q Median 3Q Max
-4.7868 -0.2603 0.0432 0.2629 1.5414
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.115141 0.007261 842.2 <2e-16 ***
carat 2.087750 0.007824 266.9 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3951 on 12080 degrees of freedom
Multiple R-squared: 0.855, Adjusted R-squared: 0.855
F-statistic: 7.121e+04 on 1 and 12080 DF, p-value: < 2.2e-16
$Premium
Call:
lm(formula = log(price) ~ carat, data = x)
Residuals:
Min 1Q Median 3Q Max
-4.0973 -0.2471 0.0324 0.2515 1.5328
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.298191 0.006587 956.1 <2e-16 ***
carat 1.852788 0.006395 289.7 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3869 on 13789 degrees of freedom
Multiple R-squared: 0.8589, Adjusted R-squared: 0.8589
F-statistic: 8.394e+04 on 1 and 13789 DF, p-value: < 2.2e-16
$Ideal
Call:
lm(formula = log(price) ~ carat, data = x)
Residuals:
Min 1Q Median 3Q Max
-4.1397 -0.2229 0.0210 0.2383 1.4408
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.146784 0.004852 1266.8 <2e-16 ***
carat 2.123797 0.005878 361.3 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3735 on 21549 degrees of freedom
Multiple R-squared: 0.8583, Adjusted R-squared: 0.8583
F-statistic: 1.305e+05 on 1 and 21549 DF, p-value: < 2.2e-16
and even extract a single element of each summary
, e.g. the R-squared value:
%>%
diamonds split(.$cut) %>%
map(function(x) lm(formula = log(price) ~ carat, data = x)) %>%
map(summary) %>%
map(~.$r.squared)
$Fair
[1] 0.7107574
$Good
[1] 0.8368502
$`Very Good`
[1] 0.8549665
$Premium
[1] 0.8589056
$Ideal
[1] 0.8583142
Write a program that splits diamonds
by color
and estimates log price as a function of carats and depth on each split. Which color’s model had the lowest residual standard error (sigma
)?
%>%
diamonds split(.$color) %>%
map(function(x) lm(formula = log(price) ~ carat + depth, data = x)) %>%
map(summary) %>%
map_dbl(~.$sigma)
D E F G H I J
0.3508852 0.3274551 0.3450613 0.3617249 0.3767596 0.3697129 0.3682481
Base R has a set of apply
functions that work the same (and are just as fast) as the map
family of functions. Read more here.
Consider a linear regression of log prices as a function of table:
lm(formula = log(price) ~ table, data = diamonds)
Call:
lm(formula = log(price) ~ table, data = diamonds)
Coefficients:
(Intercept) table
3.65905 0.07184
The coefficient on table
in the full data set is 0.07. But regression coefficients are random variables. To see this, draw a random sample and re-run the regression (play this chunk a few times):
%>%
diamonds slice_sample(n = 30) %>%
lm(formula = log(price) ~ table, data = .)
Call:
lm(formula = log(price) ~ table, data = .)
Coefficients:
(Intercept) table
13.6470 -0.1022
The coefficient bounces around with each new sample. Why? Because regression coefficients (like all sample statistics estimating population parameters) are random variables!
The Central Limit Theorem says that sampling distribution of a regression coefficient will be normal and centered at it’s expected value (i.e. the population mean). So if we were to run this experiment many times with diamonds
(draw a sample, run a regression) we would see a distribution of regression coefficients centered at around 0.07.
Let’s use iteration to recover the sampling distribution of the regression coefficient on table
and see the Central Limit Theorem in action.
The goal with this example is to get you thinking about how you can take one action (draw a sample, run a regression), turn it into a function, and then run that function many times.
Let’s start with one random sample:
%>%
diamonds slice_sample(n = 30) %>%
lm(formula = log(price) ~ table, data = .)
Call:
lm(formula = log(price) ~ table, data = .)
Coefficients:
(Intercept) table
7.351740 0.008632
We’re interested in the coefficient on table
. We can extract coefficients from a lm
object with the function coef()
:
%>%
diamonds slice_sample(n = 30) %>%
lm(formula = log(price) ~ table, data = .) %>%
coef()
(Intercept) table
-5.1135756 0.2300307
coef()
returns a numeric vector:
%>%
diamonds slice_sample(n = 30) %>%
lm(formula = log(price) ~ table, data = .) %>%
coef() %>%
class()
[1] "numeric"
whose second element is the coefficient on table
. That’s the piece of information we want to keep.
Use the examples above to write code that takes diamonds
, draws a random sample of 30 observations, runs the regression (log price as a function of table), and assigns the coefficients to a vector called betas
.
diamonds %>%
betas = slice_sample(n = 30) %>%
lm(formula = log(price) ~ table, data = .) %>%
coef()
Assign the second element of betas
to the object beta_
:
1 = betas[2] beta_
View beta_1
:
1 beta_
table
-0.01759162
If we want to avoid creating an intermediary object betas
and only keep beta_1
we can subset .
after coef()
:
%>%
diamonds slice_sample(n = 30) %>%
lm(formula = log(price) ~ table, data = .) %>%
coef() %>%
.[2]
table
0.1908739
Turn the code you just wrote into a function called lm_sampler
that takes an argument sample_size
(defaulting to 30) and returns only the coefficient on table
.
function(sample_size=30){
lm_sampler =1 = diamonds %>%
beta_ slice_sample(n = 30) %>%
lm(formula = log(price) ~ table, data = .) %>%
coef() %>%
.[2]
return(beta_1)
}
Play it a few times:
lm_sampler(sample_size = 30)
table
-0.0206476
Now we just need to scale this to run many samples and collect many coefficients. This will be the main body of our loop.
Suppose want to run 100 simulations
100 n_sims =
Let’s set up the output of our loop, an empty numeric vector called betas
of length n_sims
:
first set up an empty vector:
vector("double", length = n_sims) betas =
And now let’s think about the loop.
At at high level our loop will look like this:
for(i in seq_along(betas)){
# sample, regress, return the coefficient
}
Since lm_sampler
returns a single number (the regression coefficient), on the i-th run of lm_sampler
, we assign the coefficient to the i-th value of betas
.
Let’s run it:
for(i in seq_along(betas)){
lm_sampler(sample_size = 30)
betas[i] = }
View the output:
betas
[1] 0.0373648384 0.1007167488 0.0374748548 0.0816243558 0.0981774766
[6] 0.1009007360 0.0900070346 0.0803810247 0.1587780382 0.1567878424
[11] -0.0053931381 0.1506902626 0.0380578391 0.0950906853 0.3782696511
[16] -0.0053336283 0.0652926029 0.0083030814 0.0143193305 -0.0294493646
[21] 0.0390856536 0.2500974314 0.0886805951 0.2131391297 -0.1782070591
[26] 0.1960762921 0.0756415413 0.0523193609 0.1161074096 0.1418404986
[31] 0.0880674226 -0.0315184461 0.0434383968 -0.0974639621 0.0814845778
[36] 0.1820401017 0.0103756568 0.1303736830 0.0142663357 0.0112442308
[41] 0.1801361092 0.0078340515 0.0061664808 -0.1203981318 0.1306002967
[46] -0.1365979645 0.1775004131 0.0999527896 0.0372036437 0.2037224385
[51] 0.1501231849 0.0886167517 0.1479924396 0.1394686654 0.0349873651
[56] 0.1427380952 0.0183281236 0.0684611352 -0.0067121591 0.0574731123
[61] 0.1715373982 -0.0719094601 0.1792294578 0.0358817575 0.0787585497
[66] 0.0776506816 0.1227480494 0.0809027890 -0.1029299303 0.0733569716
[71] 0.0695671772 -0.0739436751 -0.0614158975 0.0362911940 0.0960662777
[76] 0.0505582896 0.1751830458 0.0753878923 0.0012809378 0.1080359369
[81] 0.1276328861 0.1160432669 -0.1387371489 0.0891342990 0.0565193663
[86] 0.1864235394 0.1547382152 0.1663916156 -0.0151275471 0.0722751502
[91] 0.0200290214 -0.0233323415 0.2440803068 0.1521616741 0.1188900243
[96] 0.1338587673 -0.0100580282 -0.0539397797 -0.0001000065 0.0668474832
Great! Now let’s plot the distribution. We need to convert betas
to a tibble so we can use ggplot
:
tibble(betas) betas_tbl =
Now plot:
ggplot(data = betas_tbl, aes(x = betas)) +
geom_histogram()
The Central Limit Theorem says that if we keep running simulations (i.e. infinite simulations), eventually this distribution will become perfectly normal and centered at the “true” value of \(\beta_1\).
Use the code above to write lm_sampler2
with functional programming. It should take two arguments, sample_size = 30
and n_sims = 100
, run the loop, and and return a tibble.
function(sample_size = 30, n_sims = 100){
lm_sampler2 = vector("double", length = n_sims)
betas =for(i in seq_along(betas)){
diamonds %>%
betas[i] = slice_sample(n = 30) %>%
lm(formula = log(price) ~ table, data = .) %>%
coef() %>%
.[2]
} tibble(betas)
betas_tbl =return(betas_tbl)
}
Run lm_sampler2
for 10 simulations and plot the sampling distribution. Since lm_sampler2
returns a tibble you can pipe the output straight to ggplot
:
lm_sampler2(sample_size = 30, n_sims=10) %>%
ggplot(data = ., aes(x = betas)) +
geom_histogram()
Run lm_sampler2
for 1000 simulations and plot the sampling distribution:
lm_sampler2(sample_size = 30, n_sims=1000) %>%
ggplot(data = ., aes(x = betas)) +
geom_histogram()
That’s the Central Limit Theorem!
A nice extension is to re-write lm_sampler2
so it also extracts the p-values of each regression. Then you can plot the hypothesis tests of the simulations and see how the false-positive error rate varies with a) the sample size in each simulation and b) the number of simulations. We leave this to the reader.
Consider a program that loops over a vector with 10^8 elements and squares each element.
1:10^3 big_vector =
No pre-allocation:
c()
output_naive =length(output_naive)
[1] 0
With pre-allocation:
vector(mode = "double", length = length(big_vector))
output_preallocated =length(output_preallocated)
[1] 1000
Run the speed tests. This requires the package microbenchmark
.
Heads up! This might take awhile to run on your machine.
microbenchmark::microbenchmark(
speed_tests =## the naive loop
"naive" = for(i in seq_along(big_vector)){
append(output_naive, big_vector[i]^2)
output_naive =
},## the preallocated loop
"preallocated" = for(i in seq_along(big_vector)){
big_vector[i]^2
output_preallocated[i] =
},times = 100 # number of times to run the test
)# plot the output
::autoplot(speed_tests) ggplot2
replicate()
R has a built-in function called replicate
that will replicate a function many times. Learn more here.
map
with regular expressionsHere is an example of mapping in a question I answered on Piazza.
The OP has a data set where each row is a firm and a column records violations. An example cell entry looks like this:
"23. PROPER DATE MARKING AND DISPOSITION - Comments: MUST PROVIDE DATE MARKINGS TO TCS/RTE FOODS, PREPARED ONSITE AND HELD IN COOLERS OVER 24 HRS. INSTRUCTED TO PROVIDE CONSUME BY DATES AND MAINTAIN. PRIORITY FOUNDATION 7-38-005 NO CITATION ISSUED. | 37. FOOD PROPERLY LABELED; ORIGINAL CONTAINER - Comments: MUST LABEL FOOD STORAGE CONTAINERS WITH COMMON FOOD NAMES; FLOUR, SUGAR, ETC., WHEN FOOD HAS BEEN REMOVED FROM ORIGINAL PACKAGING. INSTRUCTED TO MAINTAIN. | 41. WIPING CLOTHS: PROPERLY USED & STORED - Comments: MUST STORE WIPING CLOTHS IN PREP AREAS IN SANITIZER BUCKET BETWEEN USES TO PREVENT CROSS CONTAMINATION. INSTRUCTED TO MAINTAIN. | 47. FOOD & NON-FOOD CONTACT SURFACES CLEANABLE, PROPERLY DESIGNED, CONSTRUCTED & USED - Comments: FOUND TORN REFRIGERATION GASKETS ON MULTIPLE REFRIGERATION UNITS THROUGHOUT FACILITY. INSTRUCTED FACILITY TO REPLACE ALL TORN GASKETS AND MAINTAIN" x1 =
"36. THERMOMETERS PROVIDED & ACCURATE - Comments: 4-204.112(B) NOTED NO THERMOMETERS INSIDE FOUR DISPLAY COOLER UNITS CONSPICUOUSLY POSTED TO MONITOR THE AMBIENT AIR TEMPERATURE OF EQUIPMENT. INSTRUCTED TO EQUIP ALL REFRIGERATION UNITS WITH ACCURATE AND WORKING THERMOMETERS. | 53. TOILET FACILITIES: PROPERLY CONSTRUCTED, SUPPLIED, & CLEANED - Comments: 5-501.17 OBSERVED NO COVERED RECEPTACLE (WASTE CAN WITH LID) IN UNISEX EMPLOYEE WASHROOM. INSTRUCTED TO PROVIDE COVERED WASTE RECEPTACLE. " x2 =
Suppose you wanted to count the number of unique violations for each firm and make that into a new column. Looks like each violation is uniquely identified with a number (e.g., “23. PROPER DATE MARKING AND DISPOSITION”) and specifically this pattern: “##.” (i.e. number then period then space).
This is a problem of regular expressions You can use stringr
to do regular expressions. It ships with the tidyverse
.
You can solve this problem with str_match_all()
, a function that looks for and returns all instances of a text pattern. The hardest part is to correctly specifying the pattern (regex is a huge pain). In our case, we need to use "[0-9]+\\.\\s"
. This says:
[0-9]
)…\\.
)…\\s
)%>%
x1 str_match_all("[0-9]+\\.\\s")
[[1]]
[,1]
[1,] "23. "
[2,] "37. "
[3,] "41. "
[4,] "47. "
and
%>%
x2 str_match_all("[0-9]+\\.\\s")
[[1]]
[,1]
[1,] "36. "
[2,] "53. "
str_match_all
returns a list. You can convert it to a vector with unlist()
:
%>%
x1 str_match_all("[0-9]+\\.\\s") %>%
unlist()
[1] "23. " "37. " "41. " "47. "
and now you can count the unique violations just by counting the length of the list:
%>%
x1 str_match_all("[0-9]+\\.\\s") %>%
unlist() %>%
length()
[1] 4
Now, these entries are in columns of a tibble:
tibble("violations" = c(x1,x2))
df = df
So the challenge is to map this procedure onto each element of the column.
First code-up the violations-counter in a function:
function(x){
get_violations = vector("double", length = length(x))
n_violations =for(i in seq_along(x)){
x[i] %>%
n_violations[i] = str_match_all("[0-9]+\\.\\s") %>%
unlist() %>%
length()
}return(n_violations)
}
then map it to the data and add a new column:
$n_violations = df %>%
df select(violations) %>%
map(get_violations) %>%
unlist()
giving you:
%>%
df select(n_violations)