Regression to the mean

Analysis of regression to the mean in children of exceptional parents

Jonatan Pallesen
10-10-2018

Table of Contents


Introduction

What is the expectation of the distribution of a trait of offspring of parents with a trait value above the mean? We expect the child also to have higher than average, but less so than the parents, due to regression to the mean. In this investigation I look specifically at the trait of IQ.

According to wikipedia, the heritability of IQ is “between 57% and 73% with some more-recent estimates as high as 80% and 86%.” I choose 0.7 for the calculations here. Assortative mating for IQ is possibly around 0.4.

If the heritability of IQ is 0.7 that means that 70% of the variance is determined by heritable factors, that is, additive genetic variants. The remaining 30% of variance is explained by other factors that are not inherited. These elements are assumed to be uncorrelated and normally distributed.

The IQ of a is thus determined by:

\[\begin{equation} IQ = \sqrt{h^2} * additive + \sqrt{1 - h^2} * \mathcal{N}(\mu = 0, \sigma = 1) \end{equation}\]

The children inherit the additive portion from their two parents, while the other factors are again uncorrelated and randomly distributed with mean 0.

In this document, I use these values and formulas to simulate a population, and use it to investigate regression to the mean effects.

Further details about the model

Assortative mating is simulated by generating a mate factor, of which IQ determines 0.4 of the variance, and the rest is random. The population is then paired based on this mate factor.

The additive genetic values of offspring can’t simply be half of the parents values, since this would deplete variation. According to this paper, “Bossert (unpublished) pioneered a phenotypic version of this model, in which inheritance is represented by a fixed ‘‘segregation kernel’’ that approximates the distribution of full-sib phenotypes as a Gaussian with mean equal to the midparent value and fixed variance, independent of the midparent.” I use this same idea and sample the children’s additive values from a Gaussian distribution with variance = 1/2:

\[\begin{equation} child_additive = \mathcal{N}(\mu = \frac{\textit{parent1_additive} + \textit{parent2_additive}}{2}, \sigma = \frac{1}{2}) \end{equation}\]

This maintains variance at a steady level with no assortative mating, but gives a slight variance inflation if there is assortative mating.


Simulations

imports


library(pacman)

p_load(tidyverse, pander)

source('../../src/extra.R', echo = F, encoding="utf-8")

n <- 10**6

set.seed(10)

simulation code


h <- 0.7

assortative_mating <- 0.4



gen_parent <- function(n, h, race_factor, prefix){
  tibble(
    additive = rnorm(n),
    shared_family = rnorm(n),
    other = rnorm(n),
    mate_factor = rnorm(n),
    IQ = sqrt(h) * additive + sqrt(1 - h) * other + race_factor,
    mate_order = mate_factor * sqrt(1 - assortative_mating) + IQ * sqrt(assortative_mating),
    IQ100 = round(IQ * 15 + 100, 0),
  ) %>% 
    select_all(~glue("{prefix}_{.}"))
}

add_child <- function(df, n, race_factor, prefix, race = "white"){
  df %>% 
    mutate(
      additive = rnorm(n, mean = (parent1_additive + parent2_additive) / 2, sd = sqrt(1/2)),
      other = rnorm(n),
      IQ = sqrt(h) * additive + sqrt(1 - h) * other + race_factor,
      IQ100 = round(IQ * 15 + 100, 0),
      race = race
    ) %>% rename_at(vars(-contains("_")), ~glue("{prefix}_{.}"))
}

dff <- 
  bind_cols(
    gen_parent(n, h, 0, "parent1"), 
    gen_parent(n, h, 0, "parent2")
  ) %>% 
  add_child(n, 0, "child1") %>% add_child(n, 0, "child2")

df <- 
  bind_cols(
    gen_parent(n, h, 0, "parent1") %>% arrange(parent1_mate_order), 
    gen_parent(n, h, 0, "parent2") %>% arrange(parent2_mate_order)
  ) %>% 
  add_child(n, 0, "child1") %>% add_child(n, 0, "child2")


show simulated values


tribble(~variable, ~without_assortative_mating, ~with_assortative_mating,
  "parent_IQ_mean", mean(dff$parent1_IQ100), mean(df$parent1_IQ100),
  "parent_IQ_sd", sd(dff$parent1_IQ100), sd(df$parent1_IQ100),
  "parent_additive_sd", sd(dff$parent1_additive), sd(df$parent1_additive),
  "child_additive_sd", sd(dff$child1_additive), sd(df$child1_additive),
  "parent_additive_corr", cor(dff$parent1_additive, dff$parent2_additive), cor(df$parent1_additive, df$parent2_additive),
  "parent_IQ_corr", cor(dff$parent1_IQ, dff$parent2_IQ), cor(df$parent1_IQ, df$parent2_IQ),
  "sibling_IQ_cor", cor(dff$child1_IQ, dff$child2_IQ), cor(df$child1_IQ, df$child2_IQ),
  "child_IQ_mean", mean(dff$child1_IQ100), mean(df$child1_IQ100),
  "child_IQ_sd", sd(dff$child1_IQ100), sd(df$child1_IQ100)
)
variable without_assortative_mating with_assortative_mating
parent_IQ_mean 100 100
parent_IQ_sd 15 15
parent_additive_sd 1 1
child_additive_sd 1 1.07
parent_additive_corr -0.000475 0.282
parent_IQ_corr -0.000982 0.4
sibling_IQ_cor 0.349 0.409
child_IQ_mean 100 100
child_IQ_sd 15 15.7

These values are in agreement with those listed on wikipedia. The model with assortative mating slightly inflates the standard deviation compared to the arbitrary baseline, but I think this is acceptable for modelling purposes.


Illustrations

Regression to the mean

Parent1 IQ = 130 and parent2 IQ = 100. The midparent value is plotted with a dashed line.

plot


close <- function(n, m) {(n >= m -3) & n <= m + 3}

plot_ridges <- function(df, midval, ylow = 70, yhigh = 150){
  plot_ridges_q(df, "child1_IQ100", "child1_race") +
  scale_x_continuous(breaks = seq(ylow, yhigh, 10), limits = c(ylow, yhigh)) +
  geom_vline(xintercept = midval, linetype = "dashed") +
  labs(x = "Child IQ", y = "")
    
}

one_ridge <- function(p){
  p + labs(y = "") + 
    theme(
      axis.text.y = element_blank(),
      axis.ticks.y = element_blank()) +
    scale_y_discrete(expand = c(0.1, 0))
}

df %>%
  filter(close(parent1_IQ100, 130), close(parent2_IQ100, 100)) %>% 
  plot_ridges(115) %>% one_ridge()


Slightly larger RTM

Both parents have IQ 130.

plot


df %>%
  filter(close(parent1_IQ100, 130), close(parent2_IQ100, 130)) %>% 
  plot_ridges(130) %>%  one_ridge()


Upwards RTM

Trait values also regress upwards towards 100, if the parents have low values.

plot


df %>%
  filter(close(parent1_IQ100, 80), close(parent2_IQ100, 60)) %>% 
  plot_ridges(70, 40, 120) %>%  one_ridge()


Sibling RTM

If one childs IQ is known to be high (130), the expected IQ distribution of that childs siblings is significantly lower.

plot


df %>%
  filter(close(child2_IQ100, 130)) %>% 
  plot_ridges(130)  %>% one_ridge() + labs(x = "Sibling IQ")


RTM only happens in one generation

Regression to the mean only happens in the next generation, it does not go on forever. For instance, let’s assume that we select a group of people with IQ at least 130 to settle another planet. We will call these the first generation. The second generation will then have lower average IQ due to regression to the mean. I then also simulate a third generation based on mating within the second generation. And we see that they have the same average IQ as the second generation.

code


dfh <- df %>% filter(parent1_IQ100 > 130, parent2_IQ100 > 130)

children <- bind_rows(
  dfh %>% select(additive = child1_additive, IQ = child1_IQ100, first_gen = parent1_IQ100),
  dfh %>% select(additive = child2_additive, IQ = child2_IQ100, first_gen = parent2_IQ100)
)

nn <- nrow(children) / 2

parent1 <- children %>% 
  head(nn) %>% 
  mutate(
    mate_factor = rnorm(nn),
    mate_order = mate_factor * sqrt(1 - assortative_mating) + IQ * sqrt(assortative_mating)
    ) %>% 
  arrange(mate_order) %>% 
  select(parent1_additive = additive, second_gen = IQ, first_gen)

parent2 <- children %>% 
  tail(nn) %>% 
  mutate(
    mate_factor = rnorm(nn),
    mate_order = mate_factor * sqrt(1 - assortative_mating) + IQ * sqrt(assortative_mating)
    ) %>% 
  arrange(mate_order) %>% 
  select(parent2_additive = additive)

dff <- bind_cols(parent1, parent2) %>% 
  add_child(nn, 0, "child") %>% 
  rename(third_gen = child_IQ100)

ylow <- 80; yhigh <- 170

dff %>% select(first_gen, second_gen, third_gen) %>% 
  gather(key = "generation", value = "IQ") %>% 
  plot_ridges_q("IQ", "generation") +
  scale_x_continuous(breaks = seq(ylow, yhigh, 10), limits = c(ylow, yhigh))


RTM in populations with different means

If the mean trait value of a population is lower, the regression to the mean will be larger. For example, there is frequently found a ~1SD gap in IQ scores between blacks and whites in US. Thus, if we compare the children of two black parents with high IQ (130) with the children of two white parents with equally high IQ, the distribution of the expected values of the black children will be slightly lower.

plot


set.seed(1)

f <- 10

whites <- bind_cols(gen_parent(n, h, 0, "parent1"), gen_parent(n, h, 0, "parent2")) %>% 
  add_child(n, 0, "child1", "white")

blacks <- bind_cols(gen_parent(n*f, h, -1, "parent1"), gen_parent(n*f, h, -1, "parent2")) %>% 
  add_child(n*f, -1, "child1", "black")

dfc <- bind_rows(whites, blacks)

dfc %>%
  filter(close(parent1_IQ100, 130), close(parent2_IQ100, 130)) %>% 
  plot_ridges(130)


Egression from the mean

Regression to the mean and egression from the mean. Children of parents that have high values tend to move closer to the mean. This is illustrated with red color in the figure below. Of course, variation in the population has to be maintained, and that happens when parents that are close to normal have children that are further from the mean, illustrated in blue in the figure below.

plot


df %>% 
  sample_frac(0.001) %>% 
  mutate(
    midparent = (parent1_IQ100 + parent2_IQ100) /2,
    effect = case_when(
      abs(midparent - 100) > abs(child1_IQ100 - 100) + 10 ~ "regression to the mean", 
      abs(midparent - 100) + 10 < abs(child1_IQ100 - 100) ~ "egression from the mean",
      T ~ "keeps around same value")
         ) %>% 
  ggplot(aes(y = child1_IQ100, x = midparent, color = effect)) +
  geom_point(alpha = 0.1) +
  ylim(50, 150) +
  xlim(50, 150) +
  guides(colour = guide_legend(override.aes = list(size=4, alpha = 1))) +
  scale_color_manual(values=c("steelblue", "#D3D3D3", "brown4"))