PS05

*Note:* I worked through this problem set with Pietro and Filippo.

# Review of Key Theoretical Concepts

## Question 1

##### In the 2008 presidential election, in the 29 states won by Barack Obama, average annual per capita income was \$44,500, with a standard deviation of about \$5,400. In the 21 states won by John McCain, average annual per capita income was \$38,500, with a standard deviation of about \$4,600.

##### a) Test the hypothesis that average annual per capita income is the same in Obama and McCain states. State the type of hypothesis test you perform. Obtain the t score and compare the p-value to conventional significance benchmarks.

Since $H_0$ is that the two values are the same, the expected values are too and we can use $\bar X = \bar X_1 + \bar X_2$:

$$
Z= \frac {\bar X - E(X)}{ \frac {Se_1}{\sqrt n_1}+ \frac {Se_2}{\sqrt n_2}} \\
Z = 2.95
$$

Since Z \> 1.96, we can reject $H_0$.

Now with the t score, it's similar but it is not longer a normal distribution. The threshold is 1.6759.

P-value:

$$
Pr(\bar X -1.96* \frac {\sigma}{\sqrt N} < \mu < \bar X + 1.96* \frac {\sigma}{\sqrt N})
$$

$$
Pr(2016.5 < \mu < 9983.84)
$$

##### b) It is often supposed that the higher a person's income, the more likely they are to vote Republican. Does the result of the hypothesis test above necessarily contradict this supposition? If so, explain why; if not explain how they might be consistent with each other.

We cannot infer that now because we would commit ecological fallacy. Since we don't have micro-data, we cannot infer causality. One possibility could be that poor people in rich states vote for Democrats.

## Question 2

##### Consider the linear regression model $E(Y |X_1, X_2) = α + β_1X_1 + β_2X_2 + β_3X_1X_2$, estimated by ordinary least squares.

##### a) What null and alternative hypotheses and what test statistic would you use to investigate whether $X_2$ adds any explanatory power to the model?

Null hypothesis $H_0:β3=0$, and use a t-test $t= \frac {β_3}{SE(β_3)}$

##### b) What null and alternative hypotheses and what test statistic would you use to investigate whether X2 affects the effect of X1 on E(Y \|X)?

For the interaction we have to look at the $\beta_3$ and use the t-test again.

##### c) In the regression model as written, what is the standard error of the total effect of X1 on E(Y \|X)? Express it in general terms that can be obtained from the variance-covariance matrix.

$$
SE = \sqrt {Var(X) + Var(Y) + 2Cov(X,Y)}
$$

# R session

## Question 1

##### We exploit the linear relationship between the Nazis' vote share Yi and the proportion of blue-collar voters Xi given in the equation above by regressing the former on the latter. That is, fit the following linear regression model,

##### $$E(Yi|Xi) = α + βXi$$

##### Compute the estimated slope coefficient, its standard error, and the 95% confidence interval. Give a substantive interpretation of each quantity.

```{r}
#| message: false
#| warning: false
#| echo: false

library(tidyverse)
library(dbplyr)
library(modelsummary)
library(kableExtra)
library(broom)
library(knitr)
```

```{r}
#| message: false
#| warning: false

nazisdata <- read_csv("nazis.csv")

# nºs were way to big, so let's do the proportion of nazis vote

nazisdata$prop_votes <- nazisdata$nazivote / nazisdata$nvoter

# we do a classic lm model

modelq1 <- lm(prop_votes ~ shareblue, data = nazisdata)

# now we present it

list(
"Nazi's vote share" = modelq1) |>
  modelsummary(
    coef_map = c("(Intercept)" = "Constant",
    "shareblue" = "Blue-collar voters"),
    conf_level = 0.95,
    statistic = c("std.error", "conf.int"),
    gof_map = c("nobs", "r.squared", "p.value"),
    stars = TRUE,
    title = "Linear regression")
```
There seems to not being any relationship. Constant (Intercept): The constant term is 0.396 with a standard error of 0.017. The 95% confidence interval for the constant is \[0.363, 0.428\]. The significance levels are denoted by asterisks, and in this case, three asterisks (\*\*\*), indicating a very high level of significance (p \< 0.001). This means that the constant term is significantly different from zero.

Blue-collar voters: The coefficient for the variable "Blue-collar voters" is 0.065 with a standard error of 0.052. The 95% confidence interval for this coefficient is \[-0.037, 0.168\]. The coefficient is not statistically significant at conventional levels (no asterisks), as the p-value is likely greater than 0.05.

The standar error is almost as big as the coefficient. The coefficient for "Blue-collar voters" is 0.065, and its standard error is 0.052. This means that the estimate of the effect of blue-collar voters on the Nazi vote share is imprecise---there is a relatively large margin of error around the estimated effect.

## Question 2

##### Based on the fitted regression model from the previous question, predict the average Nazi vote share Yi given various proportions of blue-collar voters Xi. Specifically, plot the predicted value of Yi (the vertical axis) against various values of Xi within its observed range (the horizontal axis) as a solid line. Add 95% confidence intervals as dashed lines. Give a substantive interpretation of the plot.

```{r}
#| message: false
#| warning: false

# first we sequence the share of blue-collar workers

model2_pred <- expand.grid(
shareblue = seq(from = 0, to = 1, by = .01),
nazivote = nazisdata$nazivote)

# prediction

model2_pred$nazivote = predict(modelq1, newdata = model2_pred)

# we also need the coef intervals:

conf_interval <- predict(modelq1, newdata = model2_pred,  interval = "confidence",
                        level = 0.95)

# and now we create the data frame for our plot

merged_predict <- cbind(conf_interval, model2_pred)

# plotting what we need

ggplot(merged_predict, aes(x = shareblue, y = nazivote)) +
  geom_line() +
  geom_line(aes(x = shareblue, y = lwr), linetype = "dashed", 
             color = "brown3") +
  geom_line(aes(x = shareblue, y = upr), linetype = "dashed",
             color = "brown3") +
  labs(
    title = "Prediction of average Nazi vote share",
    subtitle = "given various proportion of blue-collars voters",
    x = "Proportion of blue-collar voters",
    y = "Nazi vote share"
  ) +
  theme_bw() +
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))
```
The lower conf. interval is positive and then negative so we cannot know about the relationship. Probably theres few data for bigger share of blue so that's the problem.

## Question 3

##### Fit the following alternative linear regression model,

##### E(Yi\|Xi) = α∗Xi + (1 − Xi)β∗

##### Note that this model does not have an intercept. How should one interpret α∗ and β∗? How are these parameters related to the linear regression model given in Question 1?

```{r}
#| message: false
#| warning: false

nazisdata$other <- (1 - nazisdata$shareblue)
m2a <- lm(nazivote/nvoter ~ 0 + shareblue + other, nazisdata)

list(
"Model" = m2a) |>
  modelsummary(
    coef_map = c("other" = "Others",
    "shareblue" = "Blue-collar voters"),
    gof_map = c("nobs", "r.squared", "p.value"),
    stars = TRUE,
    title = "Alternative regression")
```
Without the intercept we cannot know for sure anything. If we had a graph the lines for blue and other would start at 0. R2 is 0.9 so its way better this way in the explanatory power.

Both coefficients are positive, in the same direcctions, but it doesnt make sense.

## Question 4

##### Fit a linear regression model where the overall Nazi vote share is regressed on the proportion of each occupation. The model should contain no intercept and five predictors, each representing the proportion of a certain occupation type. Interpret the estimate of each coefficient and its 95% confidence interval. What assumption is necessary to permit your interpretation?

```{r}
#| message: false
#| warning: false

modelq4 <- lm(prop_votes ~ 0 + shareblue + shareself + sharewhite + sharedomestic + shareunemployed, data = nazisdata)

list(
  "Model" = modelq4) |>
  modelsummary(
    coef_map = c("shareblue" = "Proportion of blue-collar voters",
                 "shareself" = "Proportion of self-employed voters",
                 "sharewhite" = "Proportion of white-collar voters",
                 "sharedomestic" = "Proportion of domestically employed voters",
                 "shareunemployed" = "Proportion of unemployed voters"),
    conf_level = 0.95,
    statistic = c("std.error", "conf.int"),
    gof_map = c("nobs", "r.squared", "p.value", "conf.int"),
    stars = TRUE,
    title = "Linear regression model")
```
This model is statistically significant for blue white and self, with a p value \< 0.001. It is also positive, being self the one with the higher impact in

We didnt do eco intf so they vote according theyre share of pop in the region, which might be misleading.

## Question 5

##### Finally, we consider a model-free approach to ecological inference. That is, we ask how much we can learn from the data alone without making an additional modeling assumption. For each precinct, obtain the smallest value that is logically possible for Wi1 by considering the scenario in which all non-blue-collar voters in precinct i vote for the Nazis. Express this value as a function of Xi and Yi. Similarly, what is the largest possible value for Wi1? Calculate these bounds, keeping in mind that the value for Wi1 cannot be negative or greater than 1. Finally, compute the bounds for the nation-wide proportion of blue-collar voters who voted for the Nazis (i.e., combining the blue-collar voters from all precincts by computing their weighted average based on the number of blue-collar voters). Give a brief substantive interpretation of the results.

```{r}
#| message: false
#| warning: false

modelq5 <- lm(nazivote/nvoter ~ other, nazisdata)

modelq5_pred <- expand.grid(
other = seq(from = 0, to = 1, by = .01),
nazivote = nazisdata$nazivote)

modelq5_pred$nazivote = predict(modelq5, newdata = modelq5_pred)

conf_interval2 <- predict(modelq5, newdata = modelq5_pred,  
                         interval = "confidence",
                         level = 0.95)

merged_predict2 <- cbind(conf_interval2, modelq5_pred)

merged_predict2$bad_case <- 1 - merged_predict2$lwr 
merged_predict2$good_case <- 1 - merged_predict2$upr

ggplot(merged_predict2, aes(x = other, y = bad_case)) +
  geom_line() +
  labs(
    title = "Min. nº of blue-collars vote for nazis",
    x = "Other workers",
    y = "Blue collar"
  ) +
  theme_bw() +
   theme(plot.title = element_text(hjust = 0.5))

ggplot(merged_predict2, aes(x = other, y = good_case)) +
  geom_line()  +
  labs(
    title = "Max. nº of blue-collars vote for nazis",
    x = "Other workers",
    y = "Blue collar"
  ) +
  theme_bw() +
   theme(plot.title = element_text(hjust = 0.5))

```
```{r}
#| message: false
#| warning: false

denominator <- sum(nazisdata$nazivote * nazisdata$shareblue)
average_mean <- (nazisdata$nazivote* nazisdata$shareblue) / denominator
nazisdata$average_mean <- average_mean

modelq52 <- lm(nazivote/nvoter ~ average_mean, nazisdata)

modelq52_pred <- expand.grid(
other = seq(from = 0, to = 1, by = .01),
nazivote = nazisdata$nazivote)

modelq52_pred$nazivote = predict(modelq52, newdata = modelq52_pred)

conf_interval3 <- predict(modelq52, newdata = modelq52_pred,  
                         interval = "confidence",
                         level = 0.95)  

lower_bound <- conf_interval3[, "lwr"]
upper_bound <- conf_interval3[, "upr"]

min_lower_bound <- min(lower_bound)
max_upper_bound <- max(upper_bound)

min_lower_bound
max_upper_bound
```
Editor is loading...