Dishonest dishonesty study study

A deep dive into a fraudulent study from Dan Ariely.
Inside Science
Author

Jonatan Pallesen

Published

August 21, 2021

Introduction

In 2012, Shu, Mazar, Gino, Ariely, and Bazerman published a three-study paper reporting that dishonesty can be reduced by asking people to sign a statement of honest intent before providing information (i.e., at the top of a document) rather than after providing information (i.e., at the bottom of a document). This study is quite well-known, and has gathered many citations.

Recently, the excellent blog Datacolada found that this study was fraudulent. They performed a thorough analysis here. One of the columns was shown not to be genuine, but instead it was created by adding a random number o another column. And also the data was duplicated, and the duplicated rows were amusingly written in a different font. These things are obvious from looking at the data. But what is less obvious is how exactly the data was manipulated in order to show the intended effect of signing at the top. I found this a rather fascinating question, and have been puzzling with it for a while.

The data is such that the manipulations must have been performed in a certain way and order. And it is inconceivable that the data would come to look like this, if it was not purposefully manipulated in order to make the hypothesis true. And only one person had access to the data, as well as the motive to want to make the hypothesis true: Dan Ariely.

Also, the data manipulations were hilariously inept.


Data detective introduction

The first thing to consider is these numbers:

Code
library(jsmp)
library(scales)
set.seed(1)

d <- function(df){
   df |> gt() |>
      tab_options(
         data_row.padding = px(0),
         table.font.size = 16,
         table.align = "left",
         table.margin.left = px(0),
         table.border.top.style = "hidden",
         table.border.bottom.style = "hidden"
         ) |>
      cols_align(align = "left") |>
      cols_width(
         gt::everything() ~ px(150),
         c(where(is.numeric)) ~ px(150))
}


dff <- readxl::read_excel("DATA/DrivingdataAll with font.xlsx") |> 
   mutate(distance_car1 = update_car1 - baseline_car1)

df <- dff |> filter(font == "Calibri")

dff |> group_by(condition) |> 
   summarise(
      mean_baseline = mean(baseline_car1) |> round(),
      mean_update = mean(update_car1) |> round(),
      mean_distance = mean(distance_car1) |> round(),
      ) |> 
   d()
condition mean_baseline mean_update mean_distance
Sign Bottom 74946 98568 23623
Sign Top 59945 86150 26205


The baseline is the distance reading when the driver receives the car, and the update is the distance on return. This is self-reported, so it’s possible to write in an amount that is lower than the true one, and thereby save some money. The distance driven is the difference between the updated value and the baseline. If signing at the top gives more honesty, then this self-reported distance driven would be higher. And it is indeed ~2,300 higher. So far so good.

But the weird thing is that the baseline values are actually a lot higher for the Sign Bottom group. This is the initial reading when they receive the car, and should be roughly equal for the two groups.

If you are making fraudulent data, it makes no sense to make in this way. So why did the fraudster do it?

Looking at various attributes of the data, I believe there is only one plausible route, which includes a series of bungled steps. I will go through these in the following. (Note that all the fraudulent aspects are already documented in the Datacolada post and the appendix. This post is about figuring out how the fraud was performed, and about recreating the steps and their effects in a synthetic data set.)


Fraud Step 1

Adding a random value to (most of the) Sign Bottom baseline values.

Fraudster logic: If we start with data that has roughly equal values for baseline and update in the two conditions, then if we add something to the baseline values for only the Sign Bottom group, then this group will have lower mean_distance. (Since distance is update - baseline.)


We can see that this was done from two attributes in the data:


1. Sign Bottom baseline is 15,000 higher on average than Sign Top.

This is shown on the table in the previous section.


2. Sign Bottom baseline values show that they have added a random number to them.

There is a thorough explanation of this in the Datacolada post. The short version is that humans tend to like reporting round values, such as those divisible by 1000. So these numbers will be more common in the actual data. But once you add a random number to them, this attribute will disappear.


Code
analyse <- function(df, v){
   tribble(
      ~Attribute, ~Percentage,
      "Divisible by 1000", (df |> filter(!!sym(v) %% 1000 == 0) |> nrow() / nrow(df)),
      "Equal to 0", (df |> filter(!!sym(v) == 0) |> nrow() / nrow(df))
   ) |> 
    mutate(Percentage = scales::percent(Percentage, accuracy = 0.1))
}

analyse(df |> filter(condition == "Sign Top"), "baseline_car1") |> 
   d() |> 
   tab_header("Sign Top")
Sign Top
Attribute Percentage
Divisible by 1000 35.1%
Equal to 0 3.2%
Code
analyse(df |> filter(condition == "Sign Bottom"), "baseline_car1") |> 
   d() |> 
   tab_header("Sign Bottom")
Sign Bottom
Attribute Percentage
Divisible by 1000 5.7%
Equal to 0 0.2%


We can see that the Sign Top group has human-characteristically large number of values divisible by 1000, whereas these have mostly disappeared from Sign Bottom.

Throughout these steps I will use a recreation dataset that starts out looking like the original untampered data set. I will then apply the predicted fraudulent steps to this data set, and confirm that it looks like what we see in the fraudulent data set.

Here I will add random(0, 33000) to 90% of the baseline for Sign Bottom:

Code
o_half <- dff |> filter(condition == "Sign Top", font == "Calibri")

o <- 
   bind_rows(
      o_half,
      o_half |> mutate(condition = "Sign Bottom")
   ) |> 
   mutate(font = "Calibri")

sample1 <- sample(o |> filter(condition == "Sign Bottom") |> pull(id), nrow(o|> filter(condition == "Sign Bottom")) * 0.9, replace = F)

o <- o |> mutate(
   r1 = sample(0:33000, nrow(o), replace = T),
   baseline_car1 = ifelse(condition == "Sign Bottom" & id %in% sample1, baseline_car1 + r1, baseline_car1),
   id_v2 = row_number())


p1 <- df |> 
   filter(baseline_car1 < 230000) |> 
   ggplot(aes(x = baseline_car1, fill = condition)) +
   theme_minimal() +
   geom_histogram(alpha = 0.5, position = "identity") +
   ggeasy::easy_move_legend(to = "bottom") +
   scale_x_continuous(labels = comma) +
   labs(title = "Fraudulent data")

p2 <- o |> 
   filter(baseline_car1 < 230000) |> 
   ggplot(aes(x = baseline_car1, fill = condition)) +
   theme_minimal() +
   geom_histogram(alpha = 0.5, position = "identity") +
   ggeasy::easy_move_legend(to = "bottom") +
   scale_x_continuous(labels = comma) +
   labs(title = "Data recreation")

p1 + p2


We can see that they look quite identical, thus confirming that this step would lead to data with the observed attributes.

The reasoning for adding this to the Sign Bottom baseline makes some sense. If you keep the updated distances as they are, then increasing the bottom baseline will make the driven distances shorter for the Sign Bottom than Sign Top.

It would have made more sense to increase the updated distances for Sign Top, though. Then the fraud would work in the sense that the driven distances for Sign Top would be higher, as intended, and there wouldn’t be the weird attribute that Sign Bottom were higher already at baseline, which made people suspicious.


Fraud Step 2

Make duplicates of all the entries, and add a small amount to each copy

Fraud logic: To get a higher sample size. Adding the small amount hides the duplication.

This is where he amusingly used a different font in the other spreadsheet, so that the copies are easily distinguishable with the new font. The Datacolada post shows that he added random(1, 1000) to the baseline for each of the copies.

We know that this step must been performed between Fraud Step 1 and Fraud Step 3, because the manipulation in Fraud Step 1 is copied in the duplicates, but that of Fraud Step 3 is not.


Fraud Step 3

Creating the updated distance values from scratch by adding random(0, 50,000) to the baseline value

Fraud logic: This is the most hilariously inept step performed. The fraudster seems to have a bad sense for numbers. But social scientists need to publish these kinds of studies to succeed, so he has to try and work with numbers as best he can using Excel.

It was quite puzzling to figure out why this seemingly meaningless step was performed, but I believe I have figured out the explanation.

After adding a random number to the Sign Bottom baseline values, there will sometimes be values where the baseline value is higher than the updated value. This will result in negative values for distance driven, which is obviously nonsensical. Perhaps these values showed up in some reported summary table, which made the fraudster notice it.

And then he panicked. So he created new updated values by adding random(0, 50,000) to the baseline values. This would solve his problem with negative distances driven, since the updated values are always higher than the baseline values. However, it would also ruin the effect he had created in Fraud Step 1. Since now the difference in distance driven between Sign Top and Sign Bottom would instead be determined by these random numbers, and thus turn back to 0.

So why didn’t he here just go back to the original data set, and add to the updated values for Sign Top instead? It’s hard to understand. Perhaps he had already done a lot of work on the sheet without saving the work between the steps, and it would be hard to start over from scratch. Another possibility is that since he didn’t realize it initially, he still did not think of this more obvious solution.

Whatever his reasoning, the data clearly shows that creating the updated distance values from scratch is what he did:

Code
df |> ggplot(aes(x = distance_car1)) +
   geom_histogram(boundary = 1, fill = c1, color = NA) +
   scale_x_continuous(labels = comma) +
   theme_minimal()


The values are from a random uniform distribution, and stop abrubtly at 50,000. The only way for this to happen, is if the data are generated by… a random uniform distribution going from 0 to 50,000.

I recreate this step in the recreation data set, and verify that it looks similar to the actual fake data:

Code
o <- o |> mutate(
   r2 = sample(0:50000, nrow(o)),
   update_car1 = baseline_car1 + r2,
   distance_car1 = update_car1 - baseline_car1)

p1 <- df |> 
   filter(update_car1 < 280000) |> 
   ggplot(aes(x = update_car1, fill = condition)) +
   theme_minimal() +
   ggeasy::easy_move_legend(to = "bottom") +
   geom_histogram(alpha = 0.5, position = "identity") +
   scale_x_continuous(labels = comma) +
   labs(title = "Fraudulent data")

p2 <- o |> 
   filter(update_car1 < 280000) |> 
   ggplot(aes(x = update_car1, fill = condition)) +
   theme_minimal() +
   geom_histogram(alpha = 0.5, position = "identity") +
   ggeasy::easy_move_legend(to = "bottom") +
   scale_x_continuous(labels = comma) +
   labs(title = "Data recreation")

p1 + p2


Fraud Step 4

Reassign labels for a small subset of the data set, so that Sign Top gets higher values of distance driven

Fraud logic: After his amusingly dumb Fraud Step 3, he has solved the problem of the negative distances driven, but he introduced a new one: There is now no longer the desired effect that the Sign Top distance driven is higher. This fraud step reintroduces that effect.

He could perhaps solve this problem by adding some random value to the updated values for Sign Top. However, we can see that this is not what happened, since none of distance driven values exceed 50,000. And given that the distance driven values were defined by random(0, 50,000) in Fraud Step 3, adding some further value to this would bring some of the values above 50,000.

Also if we look at the histogram, the difference is not caused solely by a skewing of the Sign Top values, but there is an equal skewing of the Sign Bottom values in the opposite direction.

Code
df |> ggplot(aes(x = distance_car1, fill = condition)) +
   theme_minimal() +
   geom_histogram(position = "identity", alpha = 0.5, boundary = 1, binwidth = 2000)


So instead I believe he did something very close to this:

  • Look at a small subset of the dataset.
  • Rearrange it from low to high distance driven.
  • Assign Sign Bottom to the lowest half of this subset, and Sign Top to the upper half of this subset

The interesting thing about this is that it is actually quite a clever step. If he had done this from the beginning, not only would the study have shown the desired result, it would have been very difficult to detect fraud in the study. There would be none of the easily detectable signs of fraud, such as with numbers divisible by thousand being rare. It is somewhat surprising that he used such a clever approach, after the bungling in the first two steps.

Unfortunately for the fraudster, he for some reason kept the manipulations performed in Fraud Steps 1-3, even though they are not necessary to achieve the desired result, but still make the data set look suspicious.


Data detective analysis details

Figuring out this step was the hardest part of the puzzle for me. There are multiple attributes of the data set that have to add up, most importantly:

  • The assigning of the labels have to happen after the distance driven are generated. This can be seen since the distance driven is a uniform distribution, which is hard to arrive at in other ways.

  • The Sign Bottom have added random values to some of them. This can be seen since they are about 15,000 higher than Sign Top.

  • Not all of the Sign Bottom have added random values to them. This can be seen since 5.7% percent of them are divisible by 1000. If they all had added a random number to them, ~0.1% would be divisible by 1000. Also there are a small amount of 0 values, which would not be present if a positive number had been added to them.

  • The duplicated rows have identical condition labels most of the time, but occasionally they are non-identical.

So this leaves only this way to solve the puzzle of how the numbers were generated: First Fraud Step 1-3 were performed, and then the labels were rearranged seperately in the two Excel sheets.

Let’s us try and recreate these fraud steps using the recreation data set, and see what happens:


Code
o_half <- dff |> filter(condition == "Sign Top", font == "Calibri")

o <- 
   bind_rows(
      o_half,
      o_half |> mutate(condition = "Sign Bottom")
   ) |> 
   mutate(font = "Calibri")

sample1 <- sample(o |> filter(condition == "Sign Bottom") |> pull(id), nrow(o|> filter(condition == "Sign Bottom")) * 0.9, replace = F)

o_plus <- o |> mutate(
   r1 = sample(0:40000, nrow(o), replace = T),
   baseline_car1 = ifelse(condition == "Sign Bottom" & id %in% sample1, baseline_car1 + r1, baseline_car1),
   id_v2 = row_number())

dupe <- o_plus |> 
  mutate(
    r3 = sample(1:1000, nrow(o_plus), replace = T),
    baseline_car1 = baseline_car1 + r3,
    font = "Cambria"
    )

o_plus2 <- o_plus |> mutate(
   r2 = sample(0:50000, nrow(o_plus), replace = T),
   update_car1 = baseline_car1 + r2,
   distance_car1 = update_car1 - baseline_car1)

o1 <- o_plus2 |> sample_n(nrow(o) / 8, replace = F)
o2 <- o_plus2 |> anti_join(o1, by = "id_v2")
o1 <- o1 |> arrange(distance_car1) |> 
   mutate(g = ntile(distance_car1, 2)) |> 
   mutate(condition = ifelse(g == 2, "Sign Top", "Sign Bottom"))

oo <- bind_rows(o1, o2)

dupe_plus2 <- dupe |> mutate(
   r2 = sample(0:50000, nrow(dupe), replace = T),
   update_car1 = baseline_car1 + r2,
   distance_car1 = update_car1 - baseline_car1)

d1 <- dupe_plus2 |> sample_n(nrow(dupe_plus2) / 8, replace = F)
d2 <- dupe_plus2 |> anti_join(d1, by = "id_v2")
d1 <- d1 |> arrange(distance_car1) |> 
   mutate(g = ntile(distance_car1, 2)) |> 
   mutate(condition = ifelse(g == 2, "Sign Top", "Sign Bottom"))

dd <- bind_rows(d1, d2)

synth <- bind_rows(dd, oo)

synth |> group_by(condition) |>
   summarise(
      mean_baseline = mean(baseline_car1),
      mean_distance= mean(distance_car1),
      ) |>
   d() |>
   tab_header("Recreation data set mean values")
Recreation data set mean values
condition mean_baseline mean_distance
Sign Bottom 76703.02 23305.74
Sign Top 61274.58 26435.05


The recreated data set has the trait that the mean baseline is ~15k higher in Sign Bottom, but the mean distance is almost 3k higher, as in the actual fraudulent Excel sheet.


Code
get_stat <- function(df, condition1, font1, trait){
   if (font1 == "Both"){
      m <- df |> filter(condition == condition1)
   } else {
      m <- df |> filter(font == font1, condition == condition1)
   }
  
  
  if (trait == "Divisible by 1000"){n <- m |> filter(baseline_car1 %% 1000 == 0) |> nrow()}
   if (trait == "Divisible by 100"){n <- m |> filter(baseline_car1 %% 100 == 0) |> nrow()}
  if (trait == "Divisible by 10"){n <- m |> filter(baseline_car1 %% 10 == 0) |> nrow()}
  if (trait == "Equal to 0"){n <- m |> filter(baseline_car1 == 0) |> nrow()}
  
  (n / nrow(m))
}

tribble(
  ~`Condition`, ~Font, ~Attribute, ~`Excel sheet`, ~Recreation,
  "Sign Top", "Cambria", "Divisible by 1000", 
  get_stat(dff, "Sign Top", "Cambria", "Divisible by 1000"),
  get_stat(synth, "Sign Top", "Cambria", "Divisible by 1000"),
   "Sign Top", "Cambria", "Divisible by 100", 
  get_stat(dff, "Sign Top", "Cambria", "Divisible by 100"),
  get_stat(synth, "Sign Top", "Cambria", "Divisible by 100"),
  "Sign Top", "Cambria", "Divisible by 10", 
  get_stat(dff, "Sign Top", "Cambria", "Divisible by 10"),
  get_stat(synth, "Sign Top", "Cambria", "Divisible by 10"),
  "Sign Top", "Cambria", "Equal to 0", 
  get_stat(dff, "Sign Top", "Cambria", "Equal to 0"),
  get_stat(synth, "Sign Top", "Cambria", "Equal to 0"),
  "Sign Top", "Calibri", "Divisible by 1000", 
  get_stat(dff, "Sign Top", "Calibri", "Divisible by 1000"),
  get_stat(synth, "Sign Top", "Calibri", "Divisible by 1000"),
   "Sign Top", "Calibri", "Divisible by 100", 
  get_stat(dff, "Sign Top", "Calibri", "Divisible by 100"),
  get_stat(synth, "Sign Top", "Calibri", "Divisible by 100"),
  "Sign Top", "Calibri", "Divisible by 10", 
  get_stat(dff, "Sign Top", "Calibri", "Divisible by 10"),
  get_stat(synth, "Sign Top", "Calibri", "Divisible by 10"),
  "Sign Top", "Calibri", "Equal to 0", 
  get_stat(dff, "Sign Top", "Calibri", "Equal to 0"),
  get_stat(synth, "Sign Top", "Calibri", "Equal to 0"),
           
  "Sign Bottom", "Cambria", "Divisible by 1000", 
  get_stat(dff, "Sign Bottom", "Cambria", "Divisible by 1000"),
  get_stat(synth, "Sign Bottom", "Cambria", "Divisible by 1000"),
   "Sign Bottom", "Cambria", "Divisible by 100", 
  get_stat(dff, "Sign Bottom", "Cambria", "Divisible by 100"),
  get_stat(synth, "Sign Bottom", "Cambria", "Divisible by 100"),
  "Sign Bottom", "Cambria", "Divisible by 10", 
  get_stat(dff, "Sign Bottom", "Cambria", "Divisible by 10"),
  get_stat(synth, "Sign Bottom", "Cambria", "Divisible by 10"),
  "Sign Bottom", "Cambria", "Equal to 0", 
  get_stat(dff, "Sign Bottom", "Cambria", "Equal to 0"),
  get_stat(synth, "Sign Bottom", "Cambria", "Equal to 0"),
  "Sign Bottom", "Calibri", "Divisible by 1000", 
  get_stat(dff, "Sign Bottom", "Calibri", "Divisible by 1000"),
  get_stat(synth, "Sign Bottom", "Calibri", "Divisible by 1000"),
   "Sign Bottom", "Calibri", "Divisible by 100", 
  get_stat(dff, "Sign Bottom", "Calibri", "Divisible by 100"),
  get_stat(synth, "Sign Bottom", "Calibri", "Divisible by 100"),
  "Sign Bottom", "Calibri", "Divisible by 10", 
  get_stat(dff, "Sign Bottom", "Calibri", "Divisible by 10"),
  get_stat(synth, "Sign Bottom", "Calibri", "Divisible by 10"),
  "Sign Bottom", "Calibri", "Equal to 0", 
  get_stat(dff, "Sign Bottom", "Calibri", "Equal to 0"),
  get_stat(synth, "Sign Bottom", "Calibri", "Equal to 0"),
  
  "Sign Bottom", "Both", "Divisible by 1000", 
  get_stat(dff, "Sign Bottom", "Both", "Divisible by 1000"),
  get_stat(synth, "Sign Bottom", "Both", "Divisible by 1000"),
   "Sign Bottom", "Both", "Divisible by 100", 
  get_stat(dff, "Sign Bottom", "Both", "Divisible by 100"),
  get_stat(synth, "Sign Bottom", "Both", "Divisible by 100"),
  "Sign Bottom", "Both", "Divisible by 10", 
  get_stat(dff, "Sign Bottom", "Both", "Divisible by 10"),
  get_stat(synth, "Sign Bottom", "Both", "Divisible by 10"),
  "Sign Bottom", "Both", "Equal to 0", 
  get_stat(dff, "Sign Bottom", "Both", "Equal to 0"),
  get_stat(synth, "Sign Bottom", "Both", "Equal to 0"),
  
  "Sign Top", "Both", "Divisible by 1000", 
  get_stat(dff, "Sign Top", "Both", "Divisible by 1000"),
  get_stat(synth, "Sign Top", "Both", "Divisible by 1000"),
   "Sign Top", "Both", "Divisible by 100", 
  get_stat(dff, "Sign Top", "Both", "Divisible by 100"),
  get_stat(synth, "Sign Top", "Both", "Divisible by 100"),
  "Sign Top", "Both", "Divisible by 10", 
  get_stat(dff, "Sign Top", "Both", "Divisible by 10"),
  get_stat(synth, "Sign Top", "Both", "Divisible by 10"),
  "Sign Top", "Both", "Equal to 0", 
  get_stat(dff, "Sign Top", "Both", "Equal to 0"),
  get_stat(synth, "Sign Top", "Both", "Equal to 0")
  ) |> 
  mutate(across(c("Excel sheet", "Recreation"), ~scales::percent(., accuracy = 0.1))) |> 
   d()
Condition Font Attribute Excel sheet Recreation
Sign Top Cambria Divisible by 1000 0.0% 0.1%
Sign Top Cambria Divisible by 100 1.1% 1.1%
Sign Top Cambria Divisible by 10 9.8% 9.7%
Sign Top Cambria Equal to 0 0.0% 0.0%
Sign Top Calibri Divisible by 1000 35.1% 33.2%
Sign Top Calibri Divisible by 100 44.6% 42.2%
Sign Top Calibri Divisible by 10 52.9% 50.8%
Sign Top Calibri Equal to 0 3.2% 3.1%
Sign Bottom Cambria Divisible by 1000 0.1% 0.1%
Sign Bottom Cambria Divisible by 100 1.3% 1.1%
Sign Bottom Cambria Divisible by 10 10.2% 10.1%
Sign Bottom Cambria Equal to 0 0.0% 0.0%
Sign Bottom Calibri Divisible by 1000 5.7% 5.4%
Sign Bottom Calibri Divisible by 100 10.8% 7.4%
Sign Bottom Calibri Divisible by 10 22.5% 15.5%
Sign Bottom Calibri Equal to 0 0.2% 0.5%
Sign Bottom Both Divisible by 1000 2.9% 2.8%
Sign Bottom Both Divisible by 100 6.0% 4.3%
Sign Bottom Both Divisible by 10 16.3% 12.8%
Sign Bottom Both Equal to 0 0.1% 0.2%
Sign Top Both Divisible by 1000 17.6% 16.7%
Sign Top Both Divisible by 100 22.9% 21.7%
Sign Top Both Divisible by 10 31.4% 30.3%
Sign Top Both Equal to 0 1.6% 1.6%


We see that these numbers are all similar in the original data set and the recreated data set. (Note that the values are slightly smaller in the recreated data set. This makes sense since the recreation data set is based on the Sign Top Calibri rows in the original data set, which are slightly diluted due to the label rearrangement step.)

As mentioned, in the original Excel sheets, sometimes the duplicated rows had different labels. It is hard to say exactly how often, since it’s not possible to idenfity all duplicated rows. If we take subsets of the data set where the duplicated rows are easily identifiable, and check how many are of them have unmatching labels, it seems to be something like 5-15%. If we check for unmatching labels in the recreation data set, we find a similar value:

Code
inner_join(dd |> select(condition_o = condition, id_v2), 
            oo |> select(condition_d = condition, id_v2), by = "id_v2") |> 
   mutate(identical_labels = condition_o == condition_d) |> 
   count(identical_labels) |> 
   d()
identical_labels n
FALSE 800
TRUE 6040


Who committed the fraud?

Dan Ariely was the one who was sent the data from the car insurance company, and is the creator of the Excel document containing the fraudulent data. So, it can’t be any of the co-authors.

Is it possible that someone at the car insurance company faked the data, and Dan Ariely simple received this fake data? I would say that it is not.

It could be imagined that some person at the car insurance company would perform Fraud Step #2 and Fraud Step #3. Perhaps they were too lazy to gather the data, so would just generate some fake data instead.

But it is inconceivable that they would perform Fraud Step #1 and even more so for Fraud Step #4. These steps specifically have the purpose to make the research hypothesis true. And the car insurance company would have no incentive to do this.


Perspective

This is a case of fraud that is completely bungled by ineptitude. As a result it had signs of fraud that were obvious from just looking at the most basic summary statistics of the data. And still, it was only discovered after 9 years, after someone attempted a replication. As I went through above, the fraudster had multiple obvious opportunities to manipulate the data in a way that would have likely never been discovered. In fact, it seems that the only reason it was discovered, was because of traits the data set acquired after puzzlingly unnecessary fraudulent manipulations.

This makes it seem likely that there is a lot more fraud than most people expect.

I would suggest that no study should be trusted if it doesn’t release the data. I am not having any illusions about the good will of journals here. I am saying that we as a scientific community should not trust any study without open data, regardless of which journal it was published in.

Also, I think we should look at all old Dan Ariely studies. Fraudulent people probably commit fraud more than once, and given the level of mathematical competency showed here, we could expect it to be not too hard to uncover.

Dan Ariely made a reply to the accusations. (Which also uses two different fonts!)



Footnotes

Throughout I will analyze the numbers for car1 only, since the same is done for the other cars. Also I will show data for the Calibri numbers only most of the time, since they are the original values.