Untitled

---
title: "Problem Set 2"
author: "Pablo Pardavila Romero"
format: pdf
editor: visual
---

# R Session

# Review of Key Theoretical Concepts

## Question 1: Examples of specification error

**Read a published article that uses regression modeling and is on a topic of interest to you. Write a few paragraphs evaluating and criticizing the article, addressing some of the issues we have discussed--- such as measurement, data visualization, modeling, assumptions, model checking, interactions, and transformations. The point of this exercise is not to come up with a comprehensive critique of the article but rather to review the key points of our course content so far in the context of a live example.**

In the article "Measuring the spiral of silence in contexts of political violence: The Basque case", Llera, Rabadán & León (2022) aim to test the effect of terrorist attacks on the so-called "spiral of silence." They hypothesize that terrorist have a positive effect on the probability of being afraid to talk about politics. They test so by developing a conceptual framework in which they make clear how terrorism causes fear and and lack of freedom, and defines what the spiral of silence is. In their article they also express the main methodological concerns about the measurement of variables and how to solve that. Finally they carry out a regression analysis and they present us with the results.

The novelty of this article is that Llera, Rabadán & León are using a direct measurement of fear ("fear to talk about politics), in contrast with past research using *fear* as a perception of it in the environment. This directly accounts for individual preferences to talk or not about politics depending on the virulence of terrorist attacks for each individual.

\newpage

# R Session: Estimating the Effects of Wealth

## Question 1: Wealth and Corruption

**a) Provide a brief overview of the data. How many countries does the data set include? Which countries are the least and most corrupt? Which countries are the least and most wealthy?**

```{r}
#| message: false
#| warning: false

library(tidyverse)
library(modelsummary)
library(haven)
library(skimr)
library(marginaleffects)

corruption <- read_dta("corruption.dta")

datasummary_skim(corruption)

most_corrupt <- corruption$cname[which.min(corruption$ti_cpi)]
most_corrupt
least_corrupt <- corruption$cname[which.max(corruption$ti_cpi)]
least_corrupt

least_gdp <- corruption$cname[which.min(corruption$undp_gdp)]
least_gdp
most_gdp <- corruption$cname[which.max(corruption$undp_gdp)]
most_gdp
```

The dataset contains information for 170 countries. The most corrupt country in our data is Bangladesh, and the least is Finland. Regarding wealth, the country with the lowest GDP is Sierra Leone and the one with the highest is Luxembourg.

\newpage

**b) Run a regression of corruption on GDP per capita. Make a nicely formatted table from the output. Write down the regression line equation. Interpret the intercept and the regression coefficient.**

```{r}
#| message: false
#| warning: false

model_q1b <- lm(ti_cpi ~ undp_gdp, data = corruption)

modelsummary(model_q1b,
             coef_rename = 
               c("(Intercept)" = "Intercept",
                 "undp_gdp" = "GDP/Capita"),
             statistic = "p.value",
             stars = TRUE
                 )
```

\newpage

**c) Interpret the substantive relevance of the results. To do so, predict the level of corruption in countries with GDP per capita that corresponds to the 25th percentile and a country with GDP per capita in the 75th percentile (including confidence intervals). Would you describe the effect as large or small?**

```{r}
#| message: false
#| warning: false

residuals <- broom::augment(model_q1b, data = corruption, interval = "confidence")

ifelse(quantile(residuals$undp_gdp,probs = c(0.25,0.75)), residuals$.fitted, NA)
ifelse(quantile(residuals$undp_gdp,probs = c(0.25,0.75)), residuals$.lower, NA)
ifelse(quantile(residuals$undp_gdp,probs = c(0.25,0.75)), residuals$.upper, NA)
```

They look similar so the effect seems to be small.

**d) Make a scatterplot of the corruption variable (Y) versus GDP per capita (X). Label the points with country names or abbreviations and add the regression line.**

```{r}
#| message: false
#| warning: false

ggplot(corruption, aes(x = undp_gdp, y = ti_cpi, label = ccodealp)) +
  geom_point() +
  geom_smooth(method = "lm", color = "maroon2") +
  geom_text(hjust=0, vjust=0) +
  labs(x = "GDP/capita", y = "Corruption") +
  theme_bw()
```

**e) What countries are unusually corrupt and lacking in corruption given their level of GDP? Study the residual values of corruption (i.e. the values that cannot be explained or predicted by using information about GDP). Include a residual plot, labeling any potentially interesting outliers.**

```{r}
#| message: false
#| warning: false

model_q1e <- broom::augment(model_q1b, data = corruption)

ggplot(data = model_q1e, mapping = aes(y = .resid, 
                                       x = .fitted)) +
  geom_point() +
  geom_smooth(color = "maroon2") +
  geom_text(data = model_q1e |> 
              top_n(6, abs(.resid)),
              mapping = aes(label = ccodealp)) +
  theme_bw()
```

**f) Include a squared term to account for a potential non-linear relationship. Is there evidence for a u-shaped or reverse u-shaped relationship? According to the model, at what GDP per capita do you expect the highest/lowest level of corruption?**

```{r}
#| message: false
#| warning: false

model_q1f <- lm(ti_cpi ~ undp_gdp + I(undp_gdp^2), data = corruption)

modelsummary(model_q1f,
             fmt = fmt_decimal(digits = 10), # we need to see the 
             coef_map =  
               c("(Intercept)" = "Coefficient",
                 "undp_gdp" = "Corruption",
                 "I(undp_gdp^2)" = "Corruption sqr"),
             gof_map = c("nobs", "r.squared"),
             stars = TRUE)

#let's visualize it
plot_predictions(model_q1f, condition = c("undp_gdp"))+
  theme_bw()
```

**g) Run a regression of corruption on logged GDP per capita. Interpret the regression coefficient. Which model do you prefer?**

```{r}
#| message: false
#| warning: false

model_q1g <- lm(ti_cpi ~ log(undp_gdp), data = corruption)

modelsummary(model_q1g, coef_rename = c("(Intercept)" = "Coefficient", "log(undp_gdp)" = "Log GDP/Capita"), statistic = "p.value", stars = TRUE )
```

**h) Given your analysis, write up a short paragraph describing what the causal mechanism between wealth and corruption could be that would explain the effect that you observe in the data.**

\newpage

## Question 2: Wealth and Infant Mortality

**a) Examine the distribution of per capita income and infant mortality. Make a scatter plot of per capita income and infant mortality. Then compare this to a scatter plot of logged per capita income and logged infant mortality. Which plot appears to represent a linear relationship between the variables?**

```{r}
#| message: false
#| warning: false

infmort <- read_dta("infantmortality.dta")

ggplot(infmort, aes(x = income, y = infant)) +
  geom_point() +
  labs(x = "GDP/capita", y = "Infant Mortality") +
  theme_bw()

ggplot(infmort, aes(x = log(income), y = log(infant))) +
  geom_point() +
  labs(x = "Log GDP/capita", y = "Log Infant Mortality") +
  theme_bw()
```

The second plot, with logged data, seems to represent more clearly a linear relationship.

**b) Run a regression of log infant mortality on log income controlling for the region of the world (using Asia as the baseline) and whether countries are oil-exporting or not. Interpret the coefficients carefully.**

```{r}
#| message: false
#| warning: false

#Asia as a baseline: factor + relevel
infmort$region <- as.factor(infmort$region)
infmort$region <- relevel(infmort$region, ref = "Asia")

model_q2b <- lm(log(infant) ~ log(income) + region + oil, data = infmort)

modelsummary(model_q2b,
             coef_rename = 
               c("(Intercept)" = "Intercept",
                 "log(infant)" = "Log Infant mortality",
                 "log(income)" = "Log GDP/Capita",
                 "regionAfrica" = "Africa",
                 "regionAmericas" = "Americas",
                 "regionEurope" = "Europe",
                 "oilyes" = "Oil exporting country"),
             statistic = "p.value",
             stars = TRUE
                 )
```

**c) Now include an interaction between the oil dummy and income. Interpret the results and try to include informative plots in your writeup. Which model specification do you prefer?**

```{r}
#| message: false
#| warning: false

model_q2c <- lm(log(infant) ~ log(income)*oil + region, data = infmort)

modelsummary(model_q2c,
             coef_map = 
               c("(Intercept)" = "Intercept",
                 "log(infant)" = "Log Infant mortality",
                 "log(income)" = "Log GDP/Capita",
                 "regionAfrica" = "Africa",
                 "regionAmericas" = "Americas",
                 "regionEurope" = "Europe",
                 "oilyes" = "Oil x GDP"),
             statistic = "p.value",
             stars = TRUE
                 )
```

Preferred: interaction term model.

**d) Based on your preferred model specification, calculate the expected levels of infant mortality in European countries for mean levels of income and oil export (including confidence intervals). Describe this result in one or two short sentences. Compare the model prediction with the actual average among European countries. Are there discrepancies? Why or why not?**

```{r}
#| message: false
#| warning: false

infmort <- infmort |>
  mutate(oil = ifelse(oil == "no", 0,1),
         region = fct_relevel(region, "Asia"))

infmort$log_income <- log(infmort$income)

model_q2d <- lm(log(infant) ~ log_income*oil, data = infmort, subset = region == "Europe")

list(model_q2d) |> 
  modelsummary(coef_map = c(
                            "(Intercept)"= "Costant",
                            "log_income"= "income logged",
                            "oil_dummy"= "Oil ",
                            "log_income:oil_dummy"= 
                              "log_income * Oil"
                            
                            ),
               gof_map = c("nobs", "r.squared"),
               stars=TRUE)


plot_predictions(model_q2d, condition = list("log_income", "oil"))+
  theme_bw()
```

**e) A journalist working for The Economist has heard about your fascinating work and is interested in reporting your result in an article. He approaches you and asks you to provide him with an easily understandable plot of how income in dollars affects infant mortality rates in oil exporting vs. non-oil exporting countries including any uncertainty in your estimates. Create such a visualization and explain it in a few sentences.**
Editor is loading...