There are 37 questions. Each question is tagged with [QUESTION]. Each question is worth 3 points. There are a total of 111 points.
Enter your answers in the chunks provided. Do not create additional chunks. Do not code or type anywhere outside the chunks.
There are two types of chunks. One for code, the other for writing.
Code chunks look like this:
# your code here
Delete # your code here
and replace it with your code.
The writing chunks look like this:
your answer here
You can write normal Markdown inside the writing chunks.
Delete “your answer here” and replace it with your written answer.
Load the following packages:
library(tidyverse)
library(broom)
[QUESTION] Calculate:
\[ \frac{10^2 + 4^3}{10^3 + 4^2} \]
10^2 + 4^3)/(10^3 + 4^2) (
## [1] 0.1614173
[QUESTION] Re-write the following code using pipes:
mean(exp(sqrt(1:10)))
## [1] 11.55345
1:10 %>%
sqrt() %>%
exp() %>%
mean()
## [1] 11.55345
[QUESTION] Draw 1000 random normal numbers with a mean of 3 and a standard deviation of 1. Then calculate the average of the 30th to 300th observations.
set.seed(2020)
rnorm(n = 1000, mean = 3, sd = 1) %>%
.[30:300] %>%
mean()
## [1] 2.929095
[QUESTION] Using
diamonds
, filter out (i.e., remove) diamonds whose cut is “Ideal”, then plot the distribution of log prices by cut. Put each distribution in its own panel.
%>%
diamonds filter(cut != "Ideal") %>%
ggplot(data = ., aes(x = log(price))) +
geom_density()+
facet_wrap(~cut)
The Environmental Protection Agency (EPA) estimates the value of human life in US dollars. This allows the government to weight the costs and benefits of policies when human lives are at stake.
While the EPA cannot place a value on each and every life, it can value statistical life – the average life.
The EPA estimate of an unconditional statistical life (i.e., not depending on factors like age, income and other characteristics) is $7.4 million.
However, this estimate of $7.4 million is in 2006 dollars. We can convert this to dollars in another year with the following formula:
\[ X_b = X_a\frac{CPI_b}{CPI_a} \] where:
[QUESTION] Write a function called
dollars_converter
that the implements the formula above. It should take three arguments:
value
for \(X_a\)cpi_a
for \(CPI_a\)cpi_b
for \(CPI_b\) function(value, cpi_a, cpi_b){
dollars_converter = value*(cpi_b / cpi_a)
new_value =return(new_value)
}
The CPI index in 2006 was 201.6. In 2018 it was 251.1. Use dollars_converter
to convert the the value of a statistical life from 2006 dollars (7.4) to 2018 dollars.
dollars_converter(value = 7.4, cpi_a = 201.6, cpi_b = 251.1)
## [1] 9.216964
Suppose a policy is proposed that will save future lives. How much are those lives worth right now?
The present value of an asset (including a human life) is the asset’s future value expressed in the present. The general formula is:
\[ PV = \frac{R_t}{(1 + d)^t} \]
where:
[QUESTION] Write a function called
present_value
that calculates the present value of a statistical life as a function of time and the discount rate.
The function present_value
should take three arguments:
r
for \(R_t\) (defaulting to the value of a statistical life in 2018 dollars)d
for \(d\)t
for \(t\) function(r=9.216964, d, t){
present_value = r / ((1 + d)^t)
pv =return(pv)
}
[QUESTION] Calculate the present value of a statistical life five years from now. Assume a discount rate of 5%.
present_value(d = 0.05, t = 5)
## [1] 7.221732
[QUESTION] Calculate the present value of a statistical life fifty years from now. Assume a discount rate of 5%.
present_value(d = 0.05, t = 50)
## [1] 0.8037536
[QUESTION] Calculate the present value of a statistical life five years from now, but this time assume a discount rate of 10%.
present_value(d = 0.10, t = 5)
## [1] 5.723009
[QUESTION] Create a vector called
time
with the values one to fifty. Create an empty numerical vector calledpvs
the same length astime
. Then write a loop that calculates the present value of a statistical for one to fifty years into the future. Assume a 5% discount rate.
1:50
time = vector(mode = "double", length = length(time))
pvs =for(i in seq_along(time)){
present_value(d = 0.05, t = i)
pvs[i] = }
[QUESTION] Create a tibble out of
time
andpvs
. Then plotpvs
overtime
.
tibble(time, pvs) %>%
ggplot(data = ., aes(x = time, y = pvs)) +
geom_line()
[QUESTION] Write a function called
present_value_simulator
that calculates the present value of a statisitical life. The function will loop over a user-specified time interval and calculate the present value of a statistical life for a given discount rate.
The function should have three arguments:
d
: the discount rate (a number between 0 and 1)t_start
: the starting time period (an integer)t_end
: the ending time period (an integer)The function should return a tibble where the first column is the time vector and the second column are the present values.
function(r = 9.216964, d, t_start, t_end){
present_value_simulator = t_start:t_end
time = vector(mode = "double", length = length(time))
pvs =for(i in seq_along(time)){
r / ((1 + d)^time[i])
pvs[i] =
} tibble(time, pvs)
pvs_tble =return(pvs_tble)
}
[QUESTION] Use
present_value_simulator
to simulate the present value of a statistical life in 2018 dollars fort_start = 0
,t_end = 30
, and a discount rate of 10%. Pipe the output toggplot
and plot the present values over time.
present_value_simulator(d = 0.10, t_start = 0, t_end = 30) %>%
ggplot(data = ., aes(x = time, y = pvs)) +
geom_line()
Johns Hopkins collects and publishes data every day on COVID-19 (see here).
Load the data from 11/30/20:
read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/11-30-2020.csv") covid19 =
[QUESTION] Write code that finds the five countries with the highest number of confirmed cases as of 11/30/20.
%>%
covid19 group_by(Country_Region) %>%
summarise(total_cases = sum(Confirmed)) %>%
slice_max(total_cases, n=5)
## # A tibble: 5 x 2
## Country_Region total_cases
## <chr> <dbl>
## 1 US 13670332
## 2 India 9462809
## 3 Brazil 6335878
## 4 France 2276874
## 5 Russia 2275936
[QUESTION] Filter the data to keep only observations from the US and assign them to the object
covid_us
:
covid19 %>%
covid_us = filter(Country_Region == "US")
For the rest of this section use covid_us
.
[QUESTION] Which five states have the most cases?
%>%
covid_us group_by(Province_State) %>%
summarise(total_cases = sum(Confirmed)) %>%
slice_max(total_cases, n=5)
## # A tibble: 5 x 2
## Province_State total_cases
## <chr> <dbl>
## 1 Texas 1291458
## 2 California 1244952
## 3 Florida 999319
## 4 Illinois 726304
## 5 New York 656413
[QUESTION] Which five states have the most deaths?
%>%
covid_us group_by(Province_State) %>%
summarise(total_cases = sum(Deaths)) %>%
slice_max(total_cases, n=5)
## # A tibble: 5 x 2
## Province_State total_cases
## <chr> <dbl>
## 1 New York 34625
## 2 Texas 22271
## 3 California 19308
## 4 Florida 18597
## 5 New Jersey 16993
[QUESTION] In the state with most cases, select the columns for state, county, confirmed cases and deaths, then find the county with the most cases:
%>%
covid_us filter(Province_State == "Texas") %>%
select(Province_State, Admin2, Confirmed, Deaths) %>%
slice_max(Confirmed)
## # A tibble: 1 x 4
## Province_State Admin2 Confirmed Deaths
## <chr> <chr> <dbl> <dbl>
## 1 Texas Harris 191513 3009
[QUESTION] Create a variable that calculates deaths over confirmed cases. Call this variable “cfr” (for “case fatality ratio”). You should get same results as “Case_Fatality_Ratio”. heads up! There are some entries with either zero cases and non-zero deaths, and vice versa (this could be a data entry issue). Make sure you account for this when creating
cfr
. Useifelse
and give the valueNA
to entries for which you cannot calculate a case-fatality ratio.
covid_us %>%
covid_us = mutate(cfr = ifelse(Confirmed > 0, 100*(Deaths / Confirmed), NA))
[QUESTION] What are the five states or territories with the highest case-fatality ratios?
%>%
covid_us select(Province_State, Admin2, cfr) %>%
slice_max(cfr, n = 5)
## # A tibble: 5 x 3
## Province_State Admin2 cfr
## <chr> <chr> <dbl>
## 1 Illinois Unassigned 388.
## 2 Puerto Rico Unassigned 78.3
## 3 Minnesota Unassigned 15.3
## 4 Montana Petroleum 12.5
## 5 Texas Kenedy 12.5
[QUESTION] Filter out the maximum value of
cfr
, then plot the cumulative distribution ofcfr
and usegeom_vline
to indicate what percent of the data is below a 5% mortality rate. Color the vertical line red.
%>%
covid_us filter(cfr < 78) %>%
ggplot(data = ., aes(x = cfr)) +
stat_ecdf() +
geom_vline(xintercept = 5, color = "red")
In this section you will explore data on property assessments in Boston. The data come from Analyze Boston. You can read more about the data here
Load all the data from FY18:
read_csv("https://raw.githubusercontent.com/lrdegeest/r_for_data_science/main/data/boston_property.csv") boston_property =
The variable AV_TOTAL
shows property assessments in US dollars.
[QUESTION] The data are very large and messy so let’s look at a sample. In the following chunk fill in the
filter
call to filter out (i.e., remove) observations with no property assessment or dwelling (i.e., whereAV_TOTAL
,AV_LAND
,LIVING_AREA
andYR_BUILT
are equal to 0):
set.seed(02132)
boston_property %>%
boston_property_sample = filter(AV_TOTAL != 0 & AV_LAND != 0 & LIVING_AREA != 0 & YR_BUILT != 0) %>% # FILL THIS IN
drop_na(R_TOTAL_RMS) %>% # remove "properties" with no rooms
slice_sample(prop = 0.8) # draw an 80% sample
For the next several questions make sure you are using boston_property_sample
.
[QUESTION] Plot the distribution of property assessments (
AV_TOTAL
).
ggplot(data = boston_property_sample, aes(x = AV_TOTAL)) +
geom_histogram()
[QUESTION] Next, create a variable called
log_AV_TOTAL
that calculates the log of property prices and add it to the data.
boston_property_sample %>%
boston_property_sample = mutate(log_AV_TOTAL = log(AV_TOTAL))
[QUESTION] Plot the distribution of log property assessments (
log_AV_TOTAL
).
ggplot(data = boston_property_sample, aes(x = log_AV_TOTAL)) +
geom_histogram()
[QUESTION] Why bother calculating log property assessments?
To normalize the distribution. That way the average property assessment is representative and we can run regressions on in it (or do other statistics).
[QUESTION] Confirm that
log_AV_TOTAL
is normally distributed by showing that the mean and median are similar:
%>%
boston_property_sample summarise(mean(log_AV_TOTAL), median(log_AV_TOTAL))
## # A tibble: 1 x 2
## `mean(log_AV_TOTAL)` `median(log_AV_TOTAL)`
## <dbl> <dbl>
## 1 13.2 13.1
Let’s focus the rest of our analysis on the zipcode with the most observations.
[QUESTION] Group observations by
ZIPCODE
, count the number of observations in each ZIPCODE, then identify the ZIPCODE with the most observations.
%>%
boston_property_sample group_by(ZIPCODE) %>%
summarise(n_zipcode = n()) %>%
slice_max(n_zipcode, n = 1)
## # A tibble: 1 x 2
## ZIPCODE n_zipcode
## <chr> <int>
## 1 02132 6050
[QUESTION] Create a new data set called
boston_property_sample_ZIPCODE
but replace “ZIPCODE” with the actual zipcode (e.g., if the zipcode with the most observations was 02138, the data would beboston_property_sample_02138
). Create this data set by filtering observations in that zipcode.
02132 = boston_property_sample %>%
boston_property_sample_ filter(ZIPCODE == "02132")
[QUESTION] Calculate mean, median standard deviation and the count of log property assessments in
boston_property_sample_ZIPCODE
.
02132 %>%
boston_property_sample_ summarise(mean(log_AV_TOTAL), median(log_AV_TOTAL), sd(log_AV_TOTAL), n())
## # A tibble: 1 x 4
## `mean(log_AV_TOTAL)` `median(log_AV_TOTAL)` `sd(log_AV_TOTAL)` `n()`
## <dbl> <dbl> <dbl> <int>
## 1 13.1 13.1 0.258 6050
[QUESTION] Make of a scatter plot of log prices on the y-axis and the number of bedrooms (
R_BDRMS
) on the x-axis. Include a regression line.
02132 %>%
boston_property_sample_ ggplot(data = ., aes(x = R_BDRMS, y = log_AV_TOTAL)) +
geom_point() +
geom_smooth(method = 'lm')
[QUESTION] Interpret the plot.
There seems to be a positive, linear relationship between the number of bedrooms and a property’s log price.
[QUESTION] Estimate a linear model of log property prices with
lm
and store the model in objectprices_model
. Model log prices as a function of:
LIVING_AREA
)R_BDRMS
)R_FULL_BTH
)R_HALF_BTH
)R_FPLACE
) boston_property_sample_02132 %>%
prices_model = lm(formula = log_AV_TOTAL ~ LIVING_AREA + R_BDRMS + R_FULL_BTH + R_HALF_BTH + R_FPLACE,
data = .)
[QUESTION] Use
summary
to view the results.
%>%
prices_model summary()
##
## Call:
## lm(formula = log_AV_TOTAL ~ LIVING_AREA + R_BDRMS + R_FULL_BTH +
## R_HALF_BTH + R_FPLACE, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.27291 -0.08016 0.00640 0.08874 0.94724
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.243e+01 6.965e-03 1784.965 <2e-16 ***
## LIVING_AREA 2.763e-04 4.421e-06 62.507 <2e-16 ***
## R_BDRMS 1.858e-03 2.508e-03 0.741 0.459
## R_FULL_BTH 4.455e-02 3.853e-03 11.563 <2e-16 ***
## R_HALF_BTH 8.025e-02 3.550e-03 22.605 <2e-16 ***
## R_FPLACE 6.523e-02 3.086e-03 21.136 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1413 on 6044 degrees of freedom
## Multiple R-squared: 0.6996, Adjusted R-squared: 0.6993
## F-statistic: 2815 on 5 and 6044 DF, p-value: < 2.2e-16
[QUESTION] Interpret the coefficient on R_BDRMS and indicate whether the coefficient is statistically significant and why (or why not). Be sure to state the null and alternative hypotheses.
An additional bedroom is associated with a 1.858e-03 (0.00185) increase in log price, on average and controling for other variables. This is the same as a \(e^{0.00185} = 1.001852\) or about a 0.2% increase in raw price.
The coefficient is not statistically significant at any conventional false positive rate (10%, 5%, 1%). This is because the p-value (0.459) is above these thresholds. We fail to reject the null hypothesis (no relationship between bedrooms and prices) and thus cannot support the alternative hypothesis (there is a relationship between bedrooms and prices).
[Q] Predict the raw price (i.e. in regular dollars, not log dollars) of a property with a living area of 2154 square feet, 4 bedrooms, 2 full bathrooms, 1 half bathroom and 1 fireplace. Hint:
\[ e^{log(x)} = x \]
1.243e+01 + 2.765e-04*2154 + 1.858e-03*4 + 4.455e-02*2 + 8.025e-02*1 + 6.523e-02*1
predicted_log_price =exp(predicted_log_price)
## [1] 578152.5
[QUESTION] Use
broom::augment
to calculate predicted prices and the residuals. Then plot the distribution of residuals.
%>%
prices_model augment() %>%
ggplot(data = ., aes(x = .resid)) +
geom_histogram()
[QUESTION] Interpret the plot.
The distribution of residuals is normal and centered at zero. That means on average our model makes no errors when predicting log prices of properties.