How to transform your data

Is log transformation the best way?

Jonatan Pallesen
08-23-2019

Introduction

Often real data is exponential. For example, let’s say that the base value of a company is $50,000, and it increases by 3% for each unit of quality (q).


library(pacman)

p_load(tidyverse, magrittr, pander, scales, broom, rap)

source('../../src/extra.R', echo = F, encoding="utf-8")

set.seed(1)


n <- 1000

scale <- 50000

dfo <- tibble(
  q = sample(0:100, n, replace = T),
  value = scale * 1.03^q
) 

ggplot(dfo, aes(x = q, y = value)) +
  geom_point(size = 1)


Assume we are not aware of this, and we are trying to model q / value relationship from the data. Then we could log transform the value and make a linear model.


ggplot(dfo, aes(x = q, y = log(value))) +
  geom_point(size = 1)


In this case we have a perfect straight line, and thus perfect prediction.

But in real life there is often noise. Here I add some random noise to the data, and also set all negative values to 0. (Another common feature of real life data.)


add_error <- function(df){
  df %>% mutate(
    e1 = rnorm(n, 0, 10),
    e2 = runif(n, -2*scale, 2*scale),
    ye0 = scale * 1.03^(q + e1) + e2,
    value = ifelse(ye0 < 0, 0, ye0))
}

df <- dfo %>% add_error()  

ggplot(df, aes(x = q, y = value)) +
  geom_point(size = 1, alpha = 0.5)

What is the best transformation in such real life cases?

Analysis

We cannot simply perform a log transformation, since we can’t take the log of 0. A common approach is then to take the log of value + 1. Alternatively we could make another transformation, such as taking the square root or another root. Another possibility is to try and guess the scale (which in this case is the base value of $50,000 we set). I plot this as “log_plus_scale”.

Let’s look at these four models applied to the data without noise.


get_transformations <- function(df) {
  df %>% mutate(
    untransformed = value,
    log_plus1 = log(value + 1),
    log_plus_scale = log(value + scale),
    square_root = value^(1/2),
    power_1_4 = value^(1/4))
}

plotit <- function(df){
  df %>% pivot_longer(c(untransformed, log_plus1, log_plus_scale, power_1_4, square_root), names_to = "transformation", values_to = "transformed_value") %>% 
  ggplot(aes(x = q, y = transformed_value)) +
  geom_point(alpha = 0.5, size = 1) +
  geom_smooth(method = "lm", formula = y ~ x) +
  facet_wrap(~transformation, scales = "free", nrow = 3, ncol = 2)
}

dfo %<>%  get_transformations()

plotit(dfo)

Log + 1 fits exactly on the line and thus gives us perfect prediction. The other models have some deviation. The untransformed model is clearly a worse fit than the others, stressing the importance of transforming the data.

A way of quantifying the prediction strength is by looking at the R2 values.


get_r2 <- function(f, df) {lm(f, df) %>% glance() %>% pull(r.squared)}

calc_r2 <- function(df){
  tibble(
    transformation = c(
      "q ~ untransformed",
      "q ~ log_plus1",
      "q ~ log_plus_scale",
      "q ~ square_root",
      "q ~ power_1_4")) %>% 
    rap("R<sup>2</sup>" = double() ~ get_r2(transformation, df))
}

calc_r2(dfo)
transformation R2
q ~ untransformed 0.879
q ~ log_plus1 1
q ~ log_plus_scale 0.994
q ~ square_root 0.965
q ~ power_1_4 0.991

All the transformations have a good fit for the data without noise added, with the log models performing best.


Now let’s look at their performance when we add noise.


df %<>% get_transformations 

plotit(df)

Here we see that the log_plus1 model has a large deviation from linear caused by the 0 values. The other models perform better here. We can also see this quantified with the R2 values:


calc_r2(df)
transformation R2
q ~ untransformed 0.67
q ~ log_plus1 0.335
q ~ log_plus_scale 0.761
q ~ square_root 0.758
q ~ power_1_4 0.665


The log_plus_scale transformation still performs well. However, we have also assumed that we have created a perfect guess about the correct base value of $50,000. The other transformations may be more appealing, since they do not rely on this accurate guess. However, their performance may vary depending on the data range. Let’s try extending the range of q values and see what happens.


df <- tibble(
  q = sample(0:300, n, replace = T),
  y = scale * 1.03^q
) %>% add_error() %>% 
  get_transformations()

df %>% plotit()


calc_r2(df)
transformation R2
q ~ untransformed 0.48
q ~ log_plus1 0.829
q ~ log_plus_scale 0.982
q ~ square_root 0.751
q ~ power_1_4 0.913

We see that with this larger data range, the strong transformations perform well, and not transforming the data performs very poorly.


Conclusion