Untitled

---
title: "Problem Set 1"
author: "Jesús Martín Godoy"
format: pdf
editor: visual
---

# Review of Key Theoretical Concepts

## Question 1: Examples of specification error

a)  **Describe a nonexperimental research situation–real or contrived–in which failure to control statistically for an omitted variable induces a correlation between the error and an explanatory variable, producing erroneous conclusions.**

In a hypothetical scenario, let us consider a study examining the relationship between job satisfaction and job performance among employees in a company. The researcher collects data on these two variables and finds a significant negative correlation between job satisfaction and job performance, concluding that employees who report higher levels of job satisfaction tend to have lower job performance.

However, the researcher fails to control statistically for the variable of workload, which is known to influence both job satisfaction and job performance. Employees with higher workloads may experience lower job satisfaction due to stress and exhaustion, leading to a perceived correlation between job satisfaction and job performance.

Without controlling for workload, the correlation observed between job satisfaction and job performance may be spurious. In reality, it is not the level of job satisfaction directly impacting job performance, but rather the workload acting as a confounding variable. Failure to account for workload could lead to erroneous conclusions and misguided interventions aimed at improving job performance by solely focusing on increasing job satisfaction.

b)  **Describe an experiment–real or contrived–in which faulty experimental practice induces an explanatory variable to become correlated with the error, compromising the validity of the results produced by the experiment.**

Let us imagine a pharmaceutical company conducting a clinical trial to test the effectiveness of a new drug for reducing cholesterol levels in patients with high cholesterol. The experiment involves two groups: the treatment group receiving the new drug, and the control group receiving a placebo.

However, due to faulty experimental practice, the blinding protocol is not strictly followed. Some researchers inadvertently become aware of which participants are in the treatment group and which are in the control group. As a result, these researchers might unintentionally treat the two groups differently, providing more attention, encouragement, or support to the treatment group because they believe in the efficacy of the new drug.

This unintentional differential treatment introduces a systematic bias into the experiment. The level of encouragement or support provided by the researchers becomes correlated with the error term in the experiment, compromising the validity of the results. In this sense, any observed differences in cholesterol levels between the treatment and control groups may not solely be due to the effects of the drug but could also be influenced by the unintended differences in treatment by the researchers.

c)  **Is it fair to conclude that a researcher is never able absolutely to rule out the possibility that an explanatory variable of interest is correlated with the error? Is experimental research no better than observational research in this respect? Explain your answer.**

In experimental research, rigorous design features such as randomization and control groups are implemented to minimize the influence of confounding variables and errors. However, even with these measures, it is practically impossible to entirely rule out the possibility of correlation between an explanatory variable and error. Similarly, in observational research, researchers employ statistical techniques like regression analysis to control for potential confounders, but there is always a risk of unmeasured or unknown variables influencing the results. Thus, while experimental research generally provides stronger evidence of causality due to its ability to manipulate variables, both types of research face challenges in completely eliminating correlations between explanatory variables and errors.

# R Session: Efficacy of Small-class Size in Early Education

```{r}
library(tidyverse)
library(dplyr)
```

```{r}
star <- read.csv("star.csv")
```

## Question 1

Create a new factor variable called kinder in the data frame. This variable should recode classtype by changing integer values to their corresponding informative labels (e.g., change 1 to small etc.). Similarly, recode the race variable into a factor variable with four levels (white, black, hispanic, others) by combining Asians and Native Americans as the others category. For the race variable, overwrite the original variable in the data frame rather than creating a new one.

```{r}
# Adding variable "kinder"
recode_a <- c("small", "regular", "regular_aid")
star$kinder <- factor(star$classtype, levels = 1:3, labels = recode_a)
```

```{r}
# Recoding "race" variable
star$race <- factor(star$race)

star$race <- as.character(star$race)
star$race <- ifelse(star$race %in% c('3', '5'), '6', star$race)
star$race <- factor(star$race)
```

## Question 2

How does performance on fourth grade reading and math tests for those students assigned to a small class in kindergarten compare with those assigned to a regular-sized class? Do students in the smaller classes perform better? Use a difference-in-means test to make this comparison while removing missing values. Give a brief substantive interpretation of the results. Then replicate the same analysis using a regression approach.

```{r}
# Difference-in-means test

small_class <- star |> filter(kinder == "small")
regular_class <- star |> filter(kinder == "regular")

# Difference-in-means for reading
reading_test <- t.test(small_class$g4reading, regular_class$g4reading, diff = TRUE, na.rm = TRUE)
reading_test

# Difference-in-means for math
math_test <- t.test(small_class$g4math, regular_class$g4math, diff = TRUE, na.rm = TRUE)
math_test
```

```{r}
# Reading
lm(g4reading ~ relevel(kinder, ref = "regular_aid"), data = na.omit(star))

# Maths
lm(g4math ~ relevel(kinder, ref = "regular_aid"), data = na.omit(star))
```

## Question 3

Instead of comparing just average scores of reading and math tests between those students assigned to small classes and those assigned to regular-sized classes, look at the entire range of possible scores. To do so, compare a high score, defined as the 66th percentile, and a low score (the 33rd percentile) for small classes with the corresponding score for regular classes. These are examples of quantile treatment effects. Does this analysis add anything to the analysis based on mean in the previous question?

```{r}
quantiles_data_reading <- star |>
  group_by(kinder) |>
  summarise(quantile_66 = quantile(g4reading, probs = 0.66, na.rm = TRUE),
            quantile_33 = quantile(g4reading, probs = 0.33, na.rm = TRUE),
            .groups = "drop")

quantiles_data_reading
```

```{r}
quantiles_data_math <- star |>
  group_by(kinder) |>
  summarise(quantile_66 = quantile(g4math, probs = 0.66, na.rm = TRUE),
            quantile_33 = quantile(g4math, probs = 0.33, na.rm = TRUE),
            .groups = "drop")

quantiles_data_math
```

While the analysis based on means provides insight into overall performance differences, the quantile analysis reveals nuances in performance at different points of the score distribution. For reading, while the 66th percentile scores are similar across class sizes, students in regular-sized classes exhibit slightly lower scores at the 33rd percentile compared to those in small classes. In math, while the 66th percentile scores are consistent across class sizes, small classes demonstrate slightly higher scores at the 33rd percentile compared to regular-sized classes. This suggests that class size may have differential effects on student performance, particularly at the lower end of the score distribution, which may not be apparent from mean-based comparisons alone.

## Question 4

Some students were in small classes for all four years that the STAR program ran. Others were assigned to small classes for only one year and had either regular classes or regular classes with an aid for the rest. How many such students of each type are in the data set? Create a contingency table of proportions using the kinder and yearssmall variables. Does participation in more years of small classes make a greater difference in test scores? Compare the average and median reading and math test scores across students who spent different numbers of years in small classes.

```{r}
# Contingency table of proportions using kinder and yearsmall variables
contingency_table <- table(star$kinder, star$yearssmall)
prop.table(contingency_table)

```

```{r}
scores_summary <- star |> 
  group_by(yearssmall) |> 
  summarise(avg_reading = mean(g4reading, na.rm = TRUE),
            median_reading = median(g4reading, na.rm = TRUE),
            avg_math = mean(g4math, na.rm = TRUE),
            median_math = median(g4math, na.rm = TRUE))

# Print summary statistics
print(scores_summary)
```

## Question 5

We examine whether the STAR program reduced the achievement gaps across different racial groups. Begin by comparing the **average reading and math** test scores **between white and minority students** (i.e., Blacks and hispanics) among those students who were assigned to **regular classes** with no aid. Conduct the same comparison among those students who were assigned to small classes. Give a brief substantive interpretation of the results of your analysis. Then replicate the same analysis using a regression approach.

```{r}
average_scores_race_class <- list(
  Regular_No_Aid = star |> 
    filter(kinder == "regular" & race %in% c(1, 2, 4)) |> 
    group_by(race) |> 
    summarise(avg_reading_regular = mean(g4reading, na.rm = TRUE),
              avg_math_regular = mean(g4math, na.rm = TRUE)),
  Small_Class = star |> 
    filter(kinder == "small" & race %in% c(1, 2, 4)) |> 
    group_by(race) |> 
    summarise(avg_reading_small = mean(g4reading, na.rm = TRUE),
              avg_math_small = mean(g4math, na.rm = TRUE))
)

average_scores_race_class
```

```{r}
# Fit linear regression models for reading scores using the pipe operator
reading_model_regular <- star |>
  filter(kinder == "regular" & race %in% c(1, 2, 4)) %>%
  lm(formula = g4reading ~ race, data = .)

reading_model_small <- star |>
  filter(kinder == "small" & race %in% c(1, 2, 4)) %>%
  lm(formula = g4reading ~ race, data = .)

# Fit linear regression models for math scores using the pipe operator
math_model_regular <- star |>
  filter(kinder == "regular" & race %in% c(1, 2, 4)) %>%
  lm(formula = g4math ~ race, data = .)

math_model_small <- star |>
  filter(kinder == "small" & race %in% c(1, 2, 4)) %>%
  lm(formula = g4math ~ race, data = .)

# Summarize regression models
summary(reading_model_regular)
summary(reading_model_small)
summary(math_model_regular)
summary(math_model_small)
```

## Question 6

We consider the long term effects of kindergarden class size. Compare high school graduation rates across students assigned to different class types. Also, examine whether graduation rates differ by the number of years spent in small classses. Finally, as done in the previous question, investigate whether the STAR program has reduced the racial gap between white and minority students’ graduation rates. Briefly discuss the results.

```{r}
# Filter data for each class type and remove missing values
small_class <- star %>% filter(classtype == 1 & !is.na(hsgrad))
regular_class <- star %>% filter(classtype == 2 & !is.na(hsgrad))
regular_aid_class <- star %>% filter(classtype == 3 & !is.na(hsgrad))

# Calculate graduation rates for each class type
graduation_rates <- list(
  Small_Class = sum(small_class$hsgrad) / nrow(small_class),
  Regular_Class = sum(regular_class$hsgrad) / nrow(regular_class),
  Regular_Aid_Class = sum(regular_aid_class$hsgrad) / nrow(regular_aid_class)
)

# Print graduation rates
print(graduation_rates)
```

```{r}

# Filter data for each number of years spent in small classes and remove missing values
years_in_small_class <- star %>%
  filter(!is.na(yearssmall)) %>%
  group_by(yearssmall) %>%
  summarise(grad_rate = mean(hsgrad, na.rm = TRUE))

# Print graduation rates by number of years spent in small classes
print(years_in_small_class)

```

```{r}

# Filter data for white and minority students separately
white_students <- star %>% filter(race == 1)
minority_students <- star %>% filter(race != 1)

# Calculate graduation rates for white and minority students across different class types
grad_rates_by_race <- star %>%
  group_by(race, classtype) %>%
  summarise(grad_rate = mean(hsgrad, na.rm = TRUE))

# Print graduation rates by race and class type
print(grad_rates_by_race)
```
Editor is loading...