WLS data prep

Merging and processing of data from the Wisconsin Longitudinal Study

Jonatan Pallesen
05-19-2019


The long form data from Wisconsin Longitudinal Study is downloaded here, and the edu PGS are downloaded here.

A selection of variables is chosen, and the two data sets are merged, and some other manipulations are performed.



library(pacman)

p_load(tidyverse, janitor, haven, magrittr, naniar, rap, feather)

source('../../src/extra.R', echo = F, encoding="utf-8")

save_subset <- function(){
  stata <- read_dta("data/wls_bl_13_06.dta") %>% 
    select(
      idpub, rtype, familypub, birthyear = z_brdxdy, gender = z_sexrsp, swiiq_t, gwiiq_bm, 
      ses57, relfml, z_relr75, household_income = z_yfam74, income = z_yrer74, nkids75 = z_kidsno, 
      eduyears = z_edeqyr, ri001re, si001re, spouse_iq = z_spwiiq_bm, attractiveness = meanrat_trunc, 
      health92 = z_mx001rer, z_cmkdb1, z_cmkdb2, z_cmkdb3, z_cmkdb4, z_cmkdb5, z_cmkdb6, z_cmkdb7, 
      z_cmkdb8, z_cmkdb9, z_rd00401, z_rd00402, z_rd00403, z_rd00404, z_rd00405, z_rd00406, z_rd00407,
      z_rd00408, z_rd00409, party_affiliation = z_iz102rer, politics = z_iz103rer, grades = hsrscorq) %>% 
    replace_with_na_all(~.x %in% c(-1, -2, -3)) %>%
    drop_na(gender) %>% 
    mutate(gender = ifelse(gender == 1, "male", "female"))
  
  pgs <- read_dta("data/Lee_idpub_shuffled.dta") %>% select(pgs_edu = pgs_ea3_mtag, idpub, rtype)
  
  full_join(stata, pgs, by = c("idpub", "rtype")) %>% write_feather("data/subset.feather")
}  

save_subset()

df <- read_feather("data/subset.feather")


Update number of kids reported in 1975 with the number of kids reported in 1992, if present. Only biological kids are counted from 1992.


df %<>% 
  mutate_at(vars(starts_with("z_rd004")), ~ifelse(. == 2, 1, 0)) %>%
  mutate(
    nkids92 = rowSums(select(., starts_with("z_rd004")), na.rm=T),
    nkids92 = ifelse(nkids92 == 0, NA, nkids92),
    nkids = coalesce(nkids92, nkids75))

The socio-economic status of the family (ses57) and the religion of the family (relfml) is the same among siblings. So NA values are replaced with those known from siblings.


df %<>% group_by(idpub) %>%
  fill(ses57, relfml) %>% 
  ungroup()


IQs for siblings and graduates are standardized to have mean 0 and std 1, and a common variable is created, iq_std. Also a common variable iq100 that has mean 100 and std 15.

pgs_ea3_mtag is likewise standardized to have mean 0 and std 1, called pgs_std.


df %<>%  
  mutate(
    iq_siblings_std = stdize(swiiq_t),
    iq_graduates_std = stdize(gwiiq_bm),
    iq_std = ifelse(!is.na(iq_graduates_std), iq_graduates_std, iq_siblings_std),
    iq100 = iq_std * 15 + 100,
    pgs_edu = stdize(pgs_edu)
  )


For studies using PGS, we are not typically interested in multiple siblings from each family. So we choose a single family member in this way:


df_nosibs <- df %>% 
  drop_na(pgs_edu, nkids) %>% 
  group_by(idpub)  %>% 
  top_n(-1, rtype) %>% 
  ungroup()


Save the data.


df %>% write_feather("data/wls.f")

df_nosibs %>% write_feather("data/wls_nosibs.f")