Untitled

# Review of Key Theoretical Concepts

## Question 1: Examples of specification error

**Read a published article that uses regression modeling and is on a topic of interest to you. Write a few paragraphs evaluating and criticizing the article, addressing some of the issues we have discussed— such as measurement, data visualization, modeling, assumptions, model checking, interactions, and transformations. The point of this exercise is not to come up with a comprehensive critique of the article but rather to review the key points of our course content so far in the context of a live example.**

\newpage

# R Session: Estimating the Effects of Wealth

## Question 1: Wealth and Corruption
**a) Provide a brief overview of the data. How many countries does the data set include? Which countries are the least and most corrupt? Which countries are the least and most wealthy?**

```{r}
#| message: false
#| warning: false

library(tidyverse)
library(modelsummary)
library(haven)
library(skimr)

corruption <- read_dta("Data/corruption.dta")

datasummary_skim(corruption)

most_corrupt <- corruption$cname[which.min(corruption$ti_cpi)]
most_corrupt
least_corrupt <- corruption$cname[which.max(corruption$ti_cpi)]
least_corrupt

least_gdp <- corruption$cname[which.min(corruption$undp_gdp)]
least_gdp
most_gdp <- corruption$cname[which.max(corruption$undp_gdp)]
most_gdp
```
The dataset contains information for 170 countries. The most corrupt country in our data is Bangladesh, and the least is Finland. Regarding wealth, the country with the lowest GDP is Sierra Leone and the one with the highest is Luxembourg.

\newpage

**b) Run a regression of corruption on GDP per capita. Make a nicely formatted table from the output. Write down the regression line equation. Interpret the intercept and the regression coefficient.**
```{r}
#| message: false
#| warning: false

model_q1b <- lm(ti_cpi ~ undp_gdp, data = corruption)

modelsummary(model_q1b,
             coef_rename = 
               c("(Intercept)" = "Intercept",
                 "undp_gdp" = "GDP/Capita"),
             statistic = "p.value",
             stars = TRUE
                 )
```

\newpage

**c) Interpret the substantive relevance of the results. To do so, predict the level of corruption in countries with GDP per capita that corresponds to the 25th percentile and a country with GDP per capita in the 75th percentile (including confidence intervals). Would you describe the effect as large or small?**
```{r}
#| message: false
#| warning: false

residuals <- broom::augment(model_q1b, data = corruption, interval = "confidence")

ifelse(quantile(residuals$undp_gdp,probs = c(0.25,0.75)), residuals$.fitted, NA)
ifelse(quantile(residuals$undp_gdp,probs = c(0.25,0.75)), residuals$.lower, NA)
ifelse(quantile(residuals$undp_gdp,probs = c(0.25,0.75)), residuals$.upper, NA)
```
They look similar so the effect seems to be small.

**d) Make a scatterplot of the corruption variable (Y) versus GDP per capita (X). Label the points with country names or abbreviations and add the regression line.**
```{r}
#| message: false
#| warning: false

ggplot(corruption, aes(x = undp_gdp, y = ti_cpi, label = ccodealp)) +
  geom_point() +
  geom_smooth(method = "lm", color = "orange") +
  geom_text(hjust=0, vjust=0) +
  labs(x = "GDP/capita", y = "Corruption") +
  theme_bw()
```

**e) What countries are unusually corrupt and lacking in corruption given their level of GDP? Study the residual values of corruption (i.e. the values that cannot be explained or predicted by using information about GDP). Include a residual plot, labeling any potentially interesting outliers.**
```{r}
#| message: false
#| warning: false

model_q1e <- broom::augment(model_q1b, data = corruption)

ggplot(data = model_q1e, mapping = aes(y = .resid, 
                                       x = .fitted)) +
  geom_point() +
  geom_smooth(color = "orange") +
  geom_text(data = model_q1e |> 
              top_n(4, abs(.resid)),
              mapping = aes(label = ccodealp)) +
  theme_bw()
```


**f) Include a squared term to account for a potential non-linear relationship. Is there evidence for a u-shaped or reverse u-shaped relationship? According to the model, at what GDP per capita do you expect the highest/lowest level of corruption?**
```{r}
#| message: false
#| warning: false

model_q1f <- lm(ti_cpi ~ undp_gdp + I(undp_gdp^2), data = corruption)

modelsummary(model_q1f,
             fmt = fmt_decimal(digits = 10), # we need to see the 
             coef_map =  
               c("(Intercept)" = "Coefficient",
                 "undp_gdp" = "Corruption",
                 "I(undp_gdp^2)" = "Corruption sqr"),
             gof_map = c("nobs", "r.squared"),
             stars = TRUE)

#let's visualize it
marginaleffects::plot_predictions(model_q1f, condition = c("undp_gdp"))+
  theme_bw()
```


**g) Run a regression of corruption on logged GDP per capita. Interpret the regression coefficient. Which model do you prefer?**

```{r}
#| message: false
#| warning: false

model_q1g <- lm(ti_cpi ~ log(undp_gdp), data = corruption)

modelsummary(model_q1g,
             coef_rename = 
               c("(Intercept)" = "Coefficient",
                 "log(undp_gdp)" = "Log GDP/Capita"),
             statistic = "p.value",
             stars = TRUE
                 )
Editor is loading...