Untitled

---
title: "Problem Set 3"
subtitle: "AQMSS II - MA in Social Sciences - IC3JM"
author: "Pablo Pardavila Romero"
format: html
editor: visual
---

# Theoretical Concepts

### Question 1: Missing data

*What is the difference between censoring on the dependent variable and a truncated dependent variable? Describe their consequences for estimation in linear regression models and explain potential fixes. Is one of the fixes superior to the other and if so how?*

As we saw in class, there can be a problem of missing data on the dependent variable for various reasons. Two of them are censored data on the dependent variable, and a truncated dependent variable.

**Censoring of the DV** happens when the values are only observed above or below a certain threshold. This can lead to biased and inconsistent estimates if we do not apply a Tobit model assuming a normal error distribution (e.g., in many surveys, income groups that earn a looot of money put in a category that is `>ABCD million USD$`, however we cannot know how much higher that category gets, we are "loosing data," by putting it in one basket). If our dataset has censored data and we run an OLS regression, we would be assuming that there is not missing data, which could of course lead to biased estimators. As I said, this could be solved using a Tobit model, since it accounts for the threshold in is specification. This model could be used both for censored data below, above or in the middle of our distribution.

**Truncated DV** means that there is censoring affecting both the dependent and independent variables without a threshold. That is, when we do not observe `x` we can't observe `y` either (e.g., in some violence databases about political violence, only those conflicts with more tan 25 deaths per year are taken into account, therefore, below 25 deaths/year we will not see any data. We know that that country may have less than 25 deaths per year but we cannot know anything else about that country because it is outside of the dataset). As with censored data, and I assume that depending on the distribution of the missing data, having a truncated DV can also lead to biased estimators in our regressions. To solve this we could use a truncated regression (however, while the exogenous censoring does not follow a clear threshold, in order to run the truncated regression we need a threshold).

Based on what we saw in class, I think that none of the fixes is intrinsically superior to the other, and it depends more on our research question, the nature of the data, the underlying DGP, the type of missing data (MCAR, MAR, MANR), etc.

### Question 2: The iid assumption

*Discuss the consequences of endogenous selection on the dependent variable as well as potential fixes. How is endogenous selection related to the three types of missingness we have discussed in class (MCAR, MAR, NI)?*

Endogenous selection on the dependent variable refers to a situation where the probability of observing the dependent variable is influenced by unobserved factors that are also related to the independent variables. While it is not just a case of missing data, it is related to the 3 types of missigness.

If our data are missing completely at random (MCAR), then the missing values are not related to observed and unobserved data, thus leading to just higher uncertainty in our estimators (but no bias problem). This is so because by having randomly missing values, we can assume that our data will have a similar data structure, since the missigness pattern is not related to the data.

If our data are missing at random (MAR), the probability of missigness depends on variables we can observe. In this case we do not necessarily face a bias problem, and with imputation we could increase the efficiency of the estimators. While we could predict the unobserved data with the observed one, I do not think that this could solve the endogenous selection problem (because we wouldn't be solving any problem of omitted variables)

Finally, of our data are missing not at random (i.e., the unobservable values could only be predicted with the data we miss), then we would be facing a serious problem of endogenous selection on the dependent variable. The missingness is related to unobservable factors that affect both the DV and the missingness.

# R Session: Working with Panel Data

### Question 1: Finding a dataset

For this part of the problem set, I will use the **V-Dem dataset** to try to find some interesting relationships among variables that I could use for a future project. This is an opportunity to have a first look at the dataset and play around a little.

To start, I will do some **preparation** to work in R and **load** the main **packages**, as well as load the **data**:

```{r Packages_and_Data}
#| message: false
#| warning: false

library(tidyverse)
library(modelsummary)

vdem_complete <- read_csv("C:/Users/pablo/Desktop/Master Applications/UC3M MA CCSS/1º MA CCSS/AQMSS II/rlabaqmss2/dataaqmss2/vdem.csv")

# Obs: 27192 | Var: 4176
```

This data comes in **panel-data form** containing observations for countries over time, fulfilling one of the requirements of this question.

### Question 2: Doing some tidying

As indicated in the chunk `{r Packages_and_Data}`, the raw dataset contains more than 4,000 variables. Working with such a big dataset is extremely complicated. For that reason I will **keep** only a set of **key variables** to customize my own dataset. Note that I will not use all of these variables for my analyses in this problem set; however, I want to keep all *potential* usefull variables:

-   **Identifier variables**: `country_name` (text), `country_id` (numeric), `country_text_id` (text), `year` (date), `COWcode` (numeric). These variables may be of great help in case I need to merge this dataset with other datasets, in order to identify countries, events, year of events, etc.

-   **V-Dem Democracy Indices**: `v2x_polyarchy` (numeric \[0, 1\]), `v2xel_frefair` (numeric \[0, 1\]), `v2elfrfair` (ordinal \[0 to 4\]), `v2elaccept` (categorical -\> ordinal \[0, 1, 2, 3, 4\]). Macro-level indices that account for features of democracy. I am keeping the electoral democracy index, the clean elections index, and the election free and fair variable. Also, I think it could be interesting to keep the variable that account for election losers accepting results.

    `e_chga_demo` (dichotomous) is Cheibub *et al.*'s (2010) measure of democracy.

-   **Regime**: `v2regoppgroupsact` (ordinal \[0 to 13\]), `v2regimpoppgroup` (ordinal \[0 to 13\]), `v2regoppgroupssize` (ordinal \[0 to 4\]), `v2regopploc` (ordinal \[0 to 4\]). Four variables that indicates if there are (and who are they) explicit and active regime opposition groups, as well as which is the most imposrtant opposition group among all of them and which size this group has, as well as its location (it is not the same facing opposition by 100 landowners, than by 10,000 rural dwellers).

-   **Legislature**: `v2lgbicam` (ordinal \[0 to 2\]) to know if a regime is bicameral or not; `v2lgdsadlo` (ordinal \[0 to 4\]) and `v2lgdsadlobin` (dichotomous) these two account for the representation of disadvantaged social groups.

-   **Civil Liberty**: `v2clacjust` (ordinal \[0 to 4\]) accounts for social class equality in respect for civil liberty; `v2clrgwkch` (categorical \[0 to 21\]) and `v2clsnlpct` (percentage) account for areas of the country in which respect for civil liberties is weaker and % of total population living there.

-   **Political Equality**: `v2pepwrses` (ordinal \[0 to 4\]) and `v2pepwrsoc` (ordinal \[0 to 4\]) account the distribution of power by socioeconomic position and by social group, respectively.

-   **Exclusion**: `v2pepwrgeo` (ordinal \[0 to 4\]) indicates the distribution of power by urban-rural location.

-   **Civic space, Political Violence and Conflict**: `v2cacamps` (ordinal \[0 to 4\]) measures the level of political polarization (note that by the description they provide in the codebook, it is more of an affective polarization index subjective to the survey experts criteria).

    `v2cagenmob` (ordinal \[0 to 4\]) and `v2caconmob` (ordinal \[0 to 3\]) indicate the amount of mass mobilization, and its concentration in the capital. `v2cademmob` (ordinal \[0 to 4\]) indicate the amount of mass mobilizations for democracy.

    `v2caviol`(ordinal \[0 to 4\]) measures non-state actor political violence.

    `e_civil_war` (dichotomous), `e_miinterc`(dichotomous). Variables about political violence within a country. That is, civil wars, and internal conflicts.

    Some of these variables have been added up to indices that could be of good help. All of them are continuous variables (ranging from low-0 to high-1):

    `v2xpe_exlecon`, `v2xpe_exlgeo`, `v2xpe_exlpol`, `v2xpe_exlsocgr` are indices of (political) exclusion by socioeconomic group, urban-rural location, political group, and social group, respectively.

    `v2x_clphy` is an index that indicates to what extent physical integrity is respected (in ordinal form: `e_v2x_clphy`).

    `v2xcl_prpty` is an index of property rights.

    `v2xcs_ccsi` is an index of civil society robustness.

-   **Background factors**:
    `e_peaveduc` (numeric) is the average years of education among citizens older than 15.
    
    `e_area` (numeric) land area of the country in square kilometers.
    
    `e_regiongeo` (categorical [0 to 19]), `e_regionpol` (10 categories), and `e_regionpol_6C` (reduced-6-category variable) indicates the geographical, and geopolitical region of the world in which the country is, respectively.
    
    `e_migdpgro` (numeric) GDP per capita growth rate
    `e_migdppc` (numeric) GDP per capita, and `e_migdppcln` (numeric) trasnformed by the natural logarithm
    
    `e_miinflat` (numeric) annual inflation rate
    
-   **Natural Resource Wealth**: `e_total_resources_income_pc` (numeric) is the real value of a country's petroleum, coal, natural gas, and metal produced per capita.

-   **Demography**: `e_mipopula` (numeric) total population (in thousand people); also from the World Bank `e_wb_pop`.

    `e_miurbpop` (numeric) total urban population.

Since the code chunk is very long and I have already explained the variables that I am intending to keep, Y have added an option to hide the code in the HTML format. You can click below to see the whole chunk of code.

```{r Reduced_Dataset}
#| code-fold: true
#| code-summary: "Show the selected variables" 

vdem_reduced <- select(vdem_complete,
    # Identifier variables
                       country_name,
                       country_id,
                       country_text_id,
                       year,
                       COWcode,
    
    # V-Dem Democracy Indices
                       v2x_polyarchy,
                       v2xel_frefair,
                       v2elfrfair,
                       v2elaccept,
                       e_chga_demo,
    
    # Regime
                       #v2regoppgroupsact,
                       v2regimpoppgroup,
                       v2regoppgroupssize,
                       v2regopploc,
    
    # Legislature
                       v2lgbicam,
                       v2lgdsadlo,
                       v2lgdsadlobin,
    
    # Civil Liberty
                       v2clacjust,
                       #v2clrgwkch,
                       v2clsnlpct,
    
    # Political Equality
                       v2pepwrses,
                       v2pepwrsoc,
    
    #Exclusion
                       v2pepwrgeo,
    
    # Civic space, Political Violence and Conflict
                       v2cacamps,
                       v2cagenmob,
                       v2caconmob,
                       v2cademmob,
                       v2caviol,
                       e_civil_war,
                       e_miinterc,
                       v2xpe_exlecon,
                       v2xpe_exlgeo,
                       v2xpe_exlpol,
                       v2xpe_exlsocgr,
                       v2x_clphy,
                       #e_v2x_clphy,
                       v2xcl_prpty,
                       v2xcs_ccsi,
    
    # Background factors
                       e_peaveduc,
                       e_area,
                       e_regiongeo,
                       e_regionpol,
                       e_regionpol_6C,
                       e_migdpgro,
                       e_migdppc,
                       e_migdppcln,
                       e_miinflat,
    
    # Natural Resource Wealth
                       e_total_resources_income_pc,
    
    # Demography
                       e_mipopula,
                       e_wb_pop,
                       e_miurbpop
                        
) 

# Obs: 27192 | Var: 48

# IMPORTANT NOTE: For some reason, the variables v2regoppgroupsact, v2clrgwkch, e_v2x_clphy are not found in the dataset. (Note to myself: try to download the latest version of the data. If that still does not work, ask Patrick).

```


Once I customized my potential dataset, I will now create a new object with **fewer variables** to do this problem set. In this new object I will just keep a **dependent variable** (`v2caviol`), a key **independent variable** (`v2pepwrgeo`), and a set of **control variables**. 

```{r Problem_Set_Dataset}
#| code-fold: true
#| code-summary: "Show the selected variables for this Problem Set" 

vdem_ps3 <- select(vdem_reduced,
                    
    # Dependent variable (Internal Conflicts by non-state actors)
                    v2caviol,                
    
    # Key Independent variable (Power distribution by urban-rural location)
                    v2pepwrgeo,
    
    # Controls
         ## Country, region, and year
                    country_name,
                    country_id, 
                    e_regionpol_6C,
                    year, 
        
        ## Democracy/Dictatorship
                    e_chga_demo, 
        
        ## Political (affective?) polarization
                    v2cacamps,
    
        ## Mass mobilization
                    v2cagenmob,
    
        ## Index of exclusion by urban-rural location
                    v2xpe_exlgeo,
    
        ## Index of property rights
                    v2xcl_prpty,
    
        ## Index of civil society robustness
                    v2xcs_ccsi,
    
        ## Other demographic and economic controls
            ### Population
                    e_mipopula,
    
            ### Logged GDP per capita
                    e_migdppcln ,
    
            ### Annual inflation rate
                    e_miinflat,
    
            ### Natural resources value
                    e_total_resources_income_pc, 
    
            ### Average years of education
                    e_peaveduc
                    
    )

# Obs: 27192 | Var: 17

# NOTE: I do not know which of the two variables (country_name or country_id) to use to run a FE model. For this reason I am keeping both. 

```


I will also **rename** these variables to make them more readable and work more efficiently with them:

```{r Renaming}
#| message: false
#| results: false

vdem_ps3 <- rename(vdem_ps3,

    # Dependent variable (Internal Conflicts by non-state actors)
                    political_violence = v2caviol,              
    
    # Key Independent variable (Power distribution by urban-rural location)
                    power_distribution = v2pepwrgeo,
    
    # Controls
         ## Country, region, and year
                    country = country_name,
                    country_id = country_id,
                    region = e_regionpol_6C,
                    year = year,
        
        ## Democracy/Dictatorship
                    democracy_dictatorship = e_chga_demo,
        
        ## Political (affective?) polarization
                    polarization = v2cacamps,
    
        ## Mass mobilization
                    mobilization = v2cagenmob,
    
        ## Index of exclusion by urban-rural location
                    exclusion_urb_rur = v2xpe_exlgeo,
    
        ## Index of property rights
                    property_rights = v2xcl_prpty,
    
        ## Index of civil society robustness
                    civil_society_robustness = v2xcs_ccsi,
    
        ## Other demographic and economic controls
            ### Population
                    population = e_mipopula,
    
            ### Logged GDP per capita
                    log_gdp_pc = e_migdppcln,
    
            ### Annual inflation rate
                    inflation_rate = e_miinflat,
    
            ### Natural resources value
                    resources = e_total_resources_income_pc,
    
            ### Average years of education
                    education = e_peaveduc
          )

```

In this reduced dataset for this problem set there are a total of **17 observations**. Here is a brief overview of the dataset:

```{r Overview_of_data}

skimr::skim(vdem_ps3)

```

Finally, I will create some **informative plots** to visualize the distribution of each variable:

```{r Distribution_plot_DV}
#| warning: false
#| message: false

# Qué quiero graficar? Quiero ver, (para las 6 regiones?), cómo ha evolucionado el political violence indicator a lo largo del tiempo.

# Mirar la estructura de political violence, pq cojones aparecen números negativos??????????

ggplot(filter(vdem_ps3, year >= 1950),
       aes(x = year, y = political_violence)) +
  geom_boxplot(mapping = aes(group = year)) +
  labs(x = "Year", y = "Political Violence Index")
  #+ facet_wrap(~region, nrow = 2)

```


```{r Distribution_plot_IV}



```

```{r Correlation_graph_CVs}



```


### Question 3: Estimating a (naïve) regression model

### Question 4: Addressing the data issue and estimating a new model
Editor is loading...