Directions

There are 37 questions. Each question is tagged with [QUESTION]. Each question is worth 3 points. There are a total of 111 points.

Enter your answers in the chunks provided. Do not create additional chunks. Do not code or type anywhere outside the chunks.

There are two types of chunks. One for code, the other for writing.

Code chunks

Code chunks look like this:

# your code here

Delete # your code here and replace it with your code.

Writing chunks

The writing chunks look like this:

your answer here

You can write normal Markdown inside the writing chunks.

Delete “your answer here” and replace it with your written answer.

Load the following packages

Load the following packages:

library(tidyverse)
library(broom)

1 Warm up

[QUESTION] Calculate:

\[ \frac{10^2 + 4^3}{10^3 + 4^2} \]

(10^2 + 4^3)/(10^3 + 4^2)
## [1] 0.1614173

[QUESTION] Re-write the following code using pipes:

mean(exp(sqrt(1:10)))
## [1] 11.55345
1:10 %>% 
  sqrt() %>% 
  exp() %>% 
  mean() 
## [1] 11.55345

[QUESTION] Draw 1000 random normal numbers with a mean of 3 and a standard deviation of 1. Then calculate the average of the 30th to 300th observations.

set.seed(2020)
rnorm(n = 1000, mean = 3, sd = 1) %>% 
  .[30:300] %>% 
  mean()
## [1] 2.929095

[QUESTION] Using diamonds, filter out (i.e., remove) diamonds whose cut is “Ideal”, then plot the distribution of log prices by cut. Put each distribution in its own panel.

diamonds %>% 
  filter(cut != "Ideal") %>% 
  ggplot(data = ., aes(x = log(price))) + 
  geom_density()+
  facet_wrap(~cut)

2 The value of (statistical) life

The Environmental Protection Agency (EPA) estimates the value of human life in US dollars. This allows the government to weight the costs and benefits of policies when human lives are at stake.

While the EPA cannot place a value on each and every life, it can value statistical life – the average life.

The EPA estimate of an unconditional statistical life (i.e., not depending on factors like age, income and other characteristics) is $7.4 million.

2.1 Converting values

However, this estimate of $7.4 million is in 2006 dollars. We can convert this to dollars in another year with the following formula:

\[ X_b = X_a\frac{CPI_b}{CPI_a} \] where:

  • \(X_a\) is some dollar value in year a
  • \(X_b\) is the converted dollar value in year b
  • \(CPI_b\) is the consumer price index (CPI) in year b
  • \(CPI_a\) is the consumer price index (CPI) in year a

[QUESTION] Write a function called dollars_converter that the implements the formula above. It should take three arguments:

  • value for \(X_a\)
  • cpi_a for \(CPI_a\)
  • cpi_b for \(CPI_b\)
dollars_converter = function(value, cpi_a, cpi_b){
  new_value = value*(cpi_b / cpi_a)
  return(new_value)
}

The CPI index in 2006 was 201.6. In 2018 it was 251.1. Use dollars_converter to convert the the value of a statistical life from 2006 dollars (7.4) to 2018 dollars.

dollars_converter(value = 7.4, cpi_a = 201.6, cpi_b = 251.1)
## [1] 9.216964

2.2 Present value

Suppose a policy is proposed that will save future lives. How much are those lives worth right now?

The present value of an asset (including a human life) is the asset’s future value expressed in the present. The general formula is:

\[ PV = \frac{R_t}{(1 + d)^t} \]

where:

  • \(R_t\) is the future value of some asset
  • \(d\) is the discount rate (a number between 0 and 1) – the rate at which people discount benefits realized only in the future
  • \(t\) is time in years

[QUESTION] Write a function called present_value that calculates the present value of a statistical life as a function of time and the discount rate.

The function present_value should take three arguments:

  • r for \(R_t\) (defaulting to the value of a statistical life in 2018 dollars)
  • d for \(d\)
  • t for \(t\)
present_value = function(r=9.216964, d, t){
  pv = r / ((1 + d)^t)
  return(pv)
}

[QUESTION] Calculate the present value of a statistical life five years from now. Assume a discount rate of 5%.

present_value(d = 0.05, t = 5)
## [1] 7.221732

[QUESTION] Calculate the present value of a statistical life fifty years from now. Assume a discount rate of 5%.

present_value(d = 0.05, t = 50)
## [1] 0.8037536

[QUESTION] Calculate the present value of a statistical life five years from now, but this time assume a discount rate of 10%.

present_value(d = 0.10, t = 5)
## [1] 5.723009

[QUESTION] Create a vector called time with the values one to fifty. Create an empty numerical vector called pvs the same length as time. Then write a loop that calculates the present value of a statistical for one to fifty years into the future. Assume a 5% discount rate.

time = 1:50
pvs = vector(mode = "double", length = length(time))
for(i in seq_along(time)){
  pvs[i] = present_value(d = 0.05, t = i)
}

[QUESTION] Create a tibble out of time and pvs. Then plot pvs over time.

tibble(time, pvs) %>% 
  ggplot(data = ., aes(x = time, y = pvs)) + 
  geom_line()

[QUESTION] Write a function called present_value_simulator that calculates the present value of a statisitical life. The function will loop over a user-specified time interval and calculate the present value of a statistical life for a given discount rate.

The function should have three arguments:

  • `r``: the value of a statistical life (defaulting the value of a statistical life in 2018 dollars)
  • d: the discount rate (a number between 0 and 1)
  • t_start: the starting time period (an integer)
  • t_end: the ending time period (an integer)

The function should return a tibble where the first column is the time vector and the second column are the present values.

present_value_simulator = function(r = 9.216964, d, t_start, t_end){
  time = t_start:t_end
  pvs = vector(mode = "double", length = length(time))
  for(i in seq_along(time)){
    pvs[i] = r / ((1 + d)^time[i])
  }
  pvs_tble = tibble(time, pvs)
  return(pvs_tble)
}

[QUESTION] Use present_value_simulator to simulate the present value of a statistical life in 2018 dollars for t_start = 0, t_end = 30, and a discount rate of 10%. Pipe the output to ggplot and plot the present values over time.

present_value_simulator(d = 0.10, t_start = 0, t_end = 30) %>% 
  ggplot(data = ., aes(x = time, y = pvs)) + 
  geom_line()

3 COVID-19

Johns Hopkins collects and publishes data every day on COVID-19 (see here).

Load the data from 11/30/20:

covid19 = read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/11-30-2020.csv")

[QUESTION] Write code that finds the five countries with the highest number of confirmed cases as of 11/30/20.

covid19 %>% 
  group_by(Country_Region) %>% 
  summarise(total_cases = sum(Confirmed)) %>% 
  slice_max(total_cases, n=5)
## # A tibble: 5 x 2
##   Country_Region total_cases
##   <chr>                <dbl>
## 1 US                13670332
## 2 India              9462809
## 3 Brazil             6335878
## 4 France             2276874
## 5 Russia             2275936

[QUESTION] Filter the data to keep only observations from the US and assign them to the object covid_us:

covid_us = covid19 %>% 
  filter(Country_Region == "US")

For the rest of this section use covid_us.

[QUESTION] Which five states have the most cases?

covid_us %>% 
  group_by(Province_State) %>% 
  summarise(total_cases = sum(Confirmed)) %>% 
  slice_max(total_cases, n=5)
## # A tibble: 5 x 2
##   Province_State total_cases
##   <chr>                <dbl>
## 1 Texas              1291458
## 2 California         1244952
## 3 Florida             999319
## 4 Illinois            726304
## 5 New York            656413

[QUESTION] Which five states have the most deaths?

covid_us %>% 
  group_by(Province_State) %>% 
  summarise(total_cases = sum(Deaths)) %>% 
  slice_max(total_cases, n=5)
## # A tibble: 5 x 2
##   Province_State total_cases
##   <chr>                <dbl>
## 1 New York             34625
## 2 Texas                22271
## 3 California           19308
## 4 Florida              18597
## 5 New Jersey           16993

[QUESTION] In the state with most cases, select the columns for state, county, confirmed cases and deaths, then find the county with the most cases:

covid_us %>% 
  filter(Province_State == "Texas") %>% 
  select(Province_State, Admin2, Confirmed, Deaths) %>% 
  slice_max(Confirmed)
## # A tibble: 1 x 4
##   Province_State Admin2 Confirmed Deaths
##   <chr>          <chr>      <dbl>  <dbl>
## 1 Texas          Harris    191513   3009

[QUESTION] Create a variable that calculates deaths over confirmed cases. Call this variable “cfr” (for “case fatality ratio”). You should get same results as “Case_Fatality_Ratio”. heads up! There are some entries with either zero cases and non-zero deaths, and vice versa (this could be a data entry issue). Make sure you account for this when creating cfr. Use ifelse and give the value NA to entries for which you cannot calculate a case-fatality ratio.

covid_us = covid_us %>% 
  mutate(cfr = ifelse(Confirmed > 0, 100*(Deaths / Confirmed), NA))

[QUESTION] What are the five states or territories with the highest case-fatality ratios?

covid_us %>% 
  select(Province_State, Admin2, cfr) %>% 
  slice_max(cfr, n = 5)
## # A tibble: 5 x 3
##   Province_State Admin2       cfr
##   <chr>          <chr>      <dbl>
## 1 Illinois       Unassigned 388. 
## 2 Puerto Rico    Unassigned  78.3
## 3 Minnesota      Unassigned  15.3
## 4 Montana        Petroleum   12.5
## 5 Texas          Kenedy      12.5

[QUESTION] Filter out the maximum value of cfr, then plot the cumulative distribution of cfr and use geom_vline to indicate what percent of the data is below a 5% mortality rate. Color the vertical line red.

covid_us %>% 
  filter(cfr < 78) %>% 
  ggplot(data = ., aes(x = cfr)) + 
  stat_ecdf() + 
  geom_vline(xintercept = 5, color = "red")

4 Boston Property

In this section you will explore data on property assessments in Boston. The data come from Analyze Boston. You can read more about the data here

4.1 Data

Load all the data from FY18:

boston_property = read_csv("https://raw.githubusercontent.com/lrdegeest/r_for_data_science/main/data/boston_property.csv")

The variable AV_TOTAL shows property assessments in US dollars.

[QUESTION] The data are very large and messy so let’s look at a sample. In the following chunk fill in the filter call to filter out (i.e., remove) observations with no property assessment or dwelling (i.e., where AV_TOTAL, AV_LAND, LIVING_AREA and YR_BUILT are equal to 0):

set.seed(02132)
boston_property_sample = boston_property %>% 
  filter(AV_TOTAL != 0 & AV_LAND != 0 & LIVING_AREA != 0 & YR_BUILT != 0) %>% # FILL THIS IN
  drop_na(R_TOTAL_RMS) %>% # remove "properties" with no rooms
  slice_sample(prop = 0.8) # draw an 80% sample

For the next several questions make sure you are using boston_property_sample.

[QUESTION] Plot the distribution of property assessments (AV_TOTAL).

ggplot(data = boston_property_sample, aes(x = AV_TOTAL)) + 
  geom_histogram()

[QUESTION] Next, create a variable called log_AV_TOTAL that calculates the log of property prices and add it to the data.

boston_property_sample = boston_property_sample %>% 
  mutate(log_AV_TOTAL = log(AV_TOTAL))

[QUESTION] Plot the distribution of log property assessments (log_AV_TOTAL).

ggplot(data = boston_property_sample, aes(x = log_AV_TOTAL)) + 
  geom_histogram()

[QUESTION] Why bother calculating log property assessments?

To normalize the distribution. That way the average property assessment is representative and we can run regressions on in it (or do other statistics).

[QUESTION] Confirm that log_AV_TOTAL is normally distributed by showing that the mean and median are similar:

boston_property_sample %>% 
  summarise(mean(log_AV_TOTAL), median(log_AV_TOTAL))
## # A tibble: 1 x 2
##   `mean(log_AV_TOTAL)` `median(log_AV_TOTAL)`
##                  <dbl>                  <dbl>
## 1                 13.2                   13.1

Let’s focus the rest of our analysis on the zipcode with the most observations.

[QUESTION] Group observations by ZIPCODE, count the number of observations in each ZIPCODE, then identify the ZIPCODE with the most observations.

boston_property_sample %>% 
  group_by(ZIPCODE) %>% 
  summarise(n_zipcode = n()) %>% 
  slice_max(n_zipcode, n = 1)
## # A tibble: 1 x 2
##   ZIPCODE n_zipcode
##   <chr>       <int>
## 1 02132        6050

[QUESTION] Create a new data set called boston_property_sample_ZIPCODE but replace “ZIPCODE” with the actual zipcode (e.g., if the zipcode with the most observations was 02138, the data would be boston_property_sample_02138). Create this data set by filtering observations in that zipcode.

boston_property_sample_02132 = boston_property_sample %>% 
  filter(ZIPCODE == "02132")

4.2 Summarizing the data

[QUESTION] Calculate mean, median standard deviation and the count of log property assessments in boston_property_sample_ZIPCODE.

boston_property_sample_02132 %>% 
  summarise(mean(log_AV_TOTAL), median(log_AV_TOTAL), sd(log_AV_TOTAL), n())
## # A tibble: 1 x 4
##   `mean(log_AV_TOTAL)` `median(log_AV_TOTAL)` `sd(log_AV_TOTAL)` `n()`
##                  <dbl>                  <dbl>              <dbl> <int>
## 1                 13.1                   13.1              0.258  6050

[QUESTION] Make of a scatter plot of log prices on the y-axis and the number of bedrooms (R_BDRMS) on the x-axis. Include a regression line.

boston_property_sample_02132 %>% 
  ggplot(data = ., aes(x = R_BDRMS, y = log_AV_TOTAL)) + 
  geom_point() + 
  geom_smooth(method = 'lm')

[QUESTION] Interpret the plot.

There seems to be a positive, linear relationship between the number of bedrooms and a property’s log price.

4.3 Modeling the data

[QUESTION] Estimate a linear model of log property prices with lm and store the model in object prices_model. Model log prices as a function of:

  • the total living area (LIVING_AREA)
  • the number of bedrooms (R_BDRMS)
  • the number of full bathrooms (R_FULL_BTH)
  • the number of half bathrooms (R_HALF_BTH)
  • the number of fireplaces (R_FPLACE)
prices_model = boston_property_sample_02132 %>% 
  lm(formula = log_AV_TOTAL ~ LIVING_AREA + R_BDRMS + R_FULL_BTH + R_HALF_BTH  + R_FPLACE, 
    data = .)

[QUESTION] Use summary to view the results.

prices_model %>% 
  summary()
## 
## Call:
## lm(formula = log_AV_TOTAL ~ LIVING_AREA + R_BDRMS + R_FULL_BTH + 
##     R_HALF_BTH + R_FPLACE, data = .)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27291 -0.08016  0.00640  0.08874  0.94724 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 1.243e+01  6.965e-03 1784.965   <2e-16 ***
## LIVING_AREA 2.763e-04  4.421e-06   62.507   <2e-16 ***
## R_BDRMS     1.858e-03  2.508e-03    0.741    0.459    
## R_FULL_BTH  4.455e-02  3.853e-03   11.563   <2e-16 ***
## R_HALF_BTH  8.025e-02  3.550e-03   22.605   <2e-16 ***
## R_FPLACE    6.523e-02  3.086e-03   21.136   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1413 on 6044 degrees of freedom
## Multiple R-squared:  0.6996, Adjusted R-squared:  0.6993 
## F-statistic:  2815 on 5 and 6044 DF,  p-value: < 2.2e-16

[QUESTION] Interpret the coefficient on R_BDRMS and indicate whether the coefficient is statistically significant and why (or why not). Be sure to state the null and alternative hypotheses.

An additional bedroom is associated with a 1.858e-03 (0.00185) increase in log price, on average and controling for other variables. This is the same as a \(e^{0.00185} = 1.001852\) or about a 0.2% increase in raw price.

The coefficient is not statistically significant at any conventional false positive rate (10%, 5%, 1%). This is because the p-value (0.459) is above these thresholds. We fail to reject the null hypothesis (no relationship between bedrooms and prices) and thus cannot support the alternative hypothesis (there is a relationship between bedrooms and prices).

4.4 Evaluating the model

[Q] Predict the raw price (i.e. in regular dollars, not log dollars) of a property with a living area of 2154 square feet, 4 bedrooms, 2 full bathrooms, 1 half bathroom and 1 fireplace. Hint:

\[ e^{log(x)} = x \]

predicted_log_price = 1.243e+01 + 2.765e-04*2154 + 1.858e-03*4 + 4.455e-02*2 + 8.025e-02*1 + 6.523e-02*1
exp(predicted_log_price)
## [1] 578152.5

[QUESTION] Use broom::augment to calculate predicted prices and the residuals. Then plot the distribution of residuals.

prices_model %>% 
  augment() %>% 
  ggplot(data = ., aes(x = .resid)) + 
  geom_histogram()

[QUESTION] Interpret the plot.

The distribution of residuals is normal and centered at zero. That means on average our model makes no errors when predicting log prices of properties.