WLS data prep

Merging and processing of data from the Wisconsin Longitudinal Study

Jonatan Pallesen

The long form data from Wisconsin Longitudinal Study is downloaded here, and the edu PGS are downloaded here.

A selection of variables is chosen, and the two data sets are merged, and some other manipulations are performed.


p_load(tidyverse, janitor, haven, magrittr, naniar, rap, feather)

source('../../src/extra.R', echo = F, encoding="utf-8")

save_subset <- function(){
  stata <- read_dta("data/wls_bl_13_06.dta") %>% 
      idpub, rtype, familypub, birthyear = z_brdxdy, gender = z_sexrsp, swiiq_t, gwiiq_bm, 
      ses57, relfml, z_relr75, household_income = z_yfam74, income = z_yrer74, nkids75 = z_kidsno, 
      eduyears = z_edeqyr, ri001re, si001re, spouse_iq = z_spwiiq_bm, attractiveness = meanrat_trunc, 
      health92 = z_mx001rer, z_cmkdb1, z_cmkdb2, z_cmkdb3, z_cmkdb4, z_cmkdb5, z_cmkdb6, z_cmkdb7, 
      z_cmkdb8, z_cmkdb9, z_rd00401, z_rd00402, z_rd00403, z_rd00404, z_rd00405, z_rd00406, z_rd00407,
      z_rd00408, z_rd00409, party_affiliation = z_iz102rer, politics = z_iz103rer, grades = hsrscorq) %>% 
    replace_with_na_all(~.x %in% c(-1, -2, -3)) %>%
    drop_na(gender) %>% 
    mutate(gender = ifelse(gender == 1, "male", "female"))
  pgs <- read_dta("data/Lee_idpub_shuffled.dta") %>% select(pgs_edu = pgs_ea3_mtag, idpub, rtype)
  full_join(stata, pgs, by = c("idpub", "rtype")) %>% write_feather("data/subset.feather")


df <- read_feather("data/subset.feather")

Update number of kids reported in 1975 with the number of kids reported in 1992, if present. Only biological kids are counted from 1992.

df %<>% 
  mutate_at(vars(starts_with("z_rd004")), ~ifelse(. == 2, 1, 0)) %>%
    nkids92 = rowSums(select(., starts_with("z_rd004")), na.rm=T),
    nkids92 = ifelse(nkids92 == 0, NA, nkids92),
    nkids = coalesce(nkids92, nkids75))

The socio-economic status of the family (ses57) and the religion of the family (relfml) is the same among siblings. So NA values are replaced with those known from siblings.

df %<>% group_by(idpub) %>%
  fill(ses57, relfml) %>% 

IQs for siblings and graduates are standardized to have mean 0 and std 1, and a common variable is created, iq_std. Also a common variable iq100 that has mean 100 and std 15.

pgs_ea3_mtag is likewise standardized to have mean 0 and std 1, called pgs_std.

df %<>%  
    iq_siblings_std = stdize(swiiq_t),
    iq_graduates_std = stdize(gwiiq_bm),
    iq_std = ifelse(!is.na(iq_graduates_std), iq_graduates_std, iq_siblings_std),
    iq100 = iq_std * 15 + 100,
    pgs_edu = stdize(pgs_edu)

For studies using PGS, we are not typically interested in multiple siblings from each family. So we choose a single family member in this way:

df_nosibs <- df %>% 
  drop_na(pgs_edu, nkids) %>% 
  group_by(idpub)  %>% 
  top_n(-1, rtype) %>% 

Save the data.

df %>% write_feather("data/wls.f")

df_nosibs %>% write_feather("data/wls_nosibs.f")