A deep dive into a fraudulent study from Dan Ariely.
Inside Science
Author
Jonatan Pallesen
Published
August 21, 2021
Introduction
In 2012, Shu, Mazar, Gino, Ariely, and Bazerman published a three-study paper reporting that dishonesty can be reduced by asking people to sign a statement of honest intent before providing information (i.e., at the top of a document) rather than after providing information (i.e., at the bottom of a document). This study is quite well-known, and has gathered many citations.
Recently, the excellent blog Datacolada found that this study was fraudulent. They performed a thorough analysis here. One of the columns was shown not to be genuine, but instead it was created by adding a random number o another column. And also the data was duplicated, and the duplicated rows were amusingly written in a different font. These things are obvious from looking at the data. But what is less obvious is how exactly the data was manipulated in order to show the intended effect of signing at the top. I found this a rather fascinating question, and have been puzzling with it for a while.
The data is such that the manipulations must have been performed in a certain way and order. And it is inconceivable that the data would come to look like this, if it was not purposefully manipulated in order to make the hypothesis true. And only one person had access to the data, as well as the motive to want to make the hypothesis true: Dan Ariely.
Also, the data manipulations were hilariously inept.
The baseline is the distance reading when the driver receives the car, and the update is the distance on return. This is self-reported, so it’s possible to write in an amount that is lower than the true one, and thereby save some money. The distance driven is the difference between the updated value and the baseline. If signing at the top gives more honesty, then this self-reported distance driven would be higher. And it is indeed ~2,300 higher. So far so good.
But the weird thing is that the baseline values are actually a lot higher for the Sign Bottom group. This is the initial reading when they receive the car, and should be roughly equal for the two groups.
If you are making fraudulent data, it makes no sense to make in this way. So why did the fraudster do it?
Looking at various attributes of the data, I believe there is only one plausible route, which includes a series of bungled steps. I will go through these in the following. (Note that all the fraudulent aspects are already documented in the Datacolada post and the appendix. This post is about figuring out how the fraud was performed, and about recreating the steps and their effects in a synthetic data set.)
Fraud Step 1
Adding a random value to (most of the) Sign Bottom baseline values.
Fraudster logic: If we start with data that has roughly equal values for baseline and update in the two conditions, then if we add something to the baseline values for only the Sign Bottom group, then this group will have lower mean_distance. (Since distance is update - baseline.)
We can see that this was done from two attributes in the data:
1. Sign Bottom baseline is 15,000 higher on average than Sign Top.
This is shown on the table in the previous section.
2. Sign Bottom baseline values show that they have added a random number to them.
There is a thorough explanation of this in the Datacolada post. The short version is that humans tend to like reporting round values, such as those divisible by 1000. So these numbers will be more common in the actual data. But once you add a random number to them, this attribute will disappear.
We can see that the Sign Top group has human-characteristically large number of values divisible by 1000, whereas these have mostly disappeared from Sign Bottom.
Throughout these steps I will use a recreation dataset that starts out looking like the original untampered data set. I will then apply the predicted fraudulent steps to this data set, and confirm that it looks like what we see in the fraudulent data set.
Here I will add random(0, 33000) to 90% of the baseline for Sign Bottom:
We can see that they look quite identical, thus confirming that this step would lead to data with the observed attributes.
The reasoning for adding this to the Sign Bottom baseline makes some sense. If you keep the updated distances as they are, then increasing the bottom baseline will make the driven distances shorter for the Sign Bottom than Sign Top.
It would have made more sense to increase the updated distances for Sign Top, though. Then the fraud would work in the sense that the driven distances for Sign Top would be higher, as intended, and there wouldn’t be the weird attribute that Sign Bottom were higher already at baseline, which made people suspicious.
Fraud Step 2
Make duplicates of all the entries, and add a small amount to each copy
Fraud logic: To get a higher sample size. Adding the small amount hides the duplication.
This is where he amusingly used a different font in the other spreadsheet, so that the copies are easily distinguishable with the new font. The Datacolada post shows that he added random(1, 1000) to the baseline for each of the copies.
We know that this step must been performed between Fraud Step 1 and Fraud Step 3, because the manipulation in Fraud Step 1 is copied in the duplicates, but that of Fraud Step 3 is not.
Fraud Step 3
Creating the updated distance values from scratch by adding random(0, 50,000) to the baseline value
Fraud logic: This is the most hilariously inept step performed. The fraudster seems to have a bad sense for numbers. But social scientists need to publish these kinds of studies to succeed, so he has to try and work with numbers as best he can using Excel.
It was quite puzzling to figure out why this seemingly meaningless step was performed, but I believe I have figured out the explanation.
After adding a random number to the Sign Bottom baseline values, there will sometimes be values where the baseline value is higher than the updated value. This will result in negative values for distance driven, which is obviously nonsensical. Perhaps these values showed up in some reported summary table, which made the fraudster notice it.
And then he panicked. So he created new updated values by adding random(0, 50,000) to the baseline values. This would solve his problem with negative distances driven, since the updated values are always higher than the baseline values. However, it would also ruin the effect he had created in Fraud Step 1. Since now the difference in distance driven between Sign Top and Sign Bottom would instead be determined by these random numbers, and thus turn back to 0.
So why didn’t he here just go back to the original data set, and add to the updated values for Sign Top instead? It’s hard to understand. Perhaps he had already done a lot of work on the sheet without saving the work between the steps, and it would be hard to start over from scratch. Another possibility is that since he didn’t realize it initially, he still did not think of this more obvious solution.
Whatever his reasoning, the data clearly shows that creating the updated distance values from scratch is what he did:
Code
df |>ggplot(aes(x = distance_car1)) +geom_histogram(boundary =1, fill = c1, color =NA) +scale_x_continuous(labels = comma) +theme_minimal()
The values are from a random uniform distribution, and stop abrubtly at 50,000. The only way for this to happen, is if the data are generated by… a random uniform distribution going from 0 to 50,000.
I recreate this step in the recreation data set, and verify that it looks similar to the actual fake data:
Code
o <- o |>mutate(r2 =sample(0:50000, nrow(o)),update_car1 = baseline_car1 + r2,distance_car1 = update_car1 - baseline_car1)p1 <- df |>filter(update_car1 <280000) |>ggplot(aes(x = update_car1, fill = condition)) +theme_minimal() + ggeasy::easy_move_legend(to ="bottom") +geom_histogram(alpha =0.5, position ="identity") +scale_x_continuous(labels = comma) +labs(title ="Fraudulent data")p2 <- o |>filter(update_car1 <280000) |>ggplot(aes(x = update_car1, fill = condition)) +theme_minimal() +geom_histogram(alpha =0.5, position ="identity") + ggeasy::easy_move_legend(to ="bottom") +scale_x_continuous(labels = comma) +labs(title ="Data recreation")p1 + p2
Fraud Step 4
Reassign labels for a small subset of the data set, so that Sign Top gets higher values of distance driven
Fraud logic: After his amusingly dumb Fraud Step 3, he has solved the problem of the negative distances driven, but he introduced a new one: There is now no longer the desired effect that the Sign Top distance driven is higher. This fraud step reintroduces that effect.
He could perhaps solve this problem by adding some random value to the updated values for Sign Top. However, we can see that this is not what happened, since none of distance driven values exceed 50,000. And given that the distance driven values were defined by random(0, 50,000) in Fraud Step 3, adding some further value to this would bring some of the values above 50,000.
Also if we look at the histogram, the difference is not caused solely by a skewing of the Sign Top values, but there is an equal skewing of the Sign Bottom values in the opposite direction.
So instead I believe he did something very close to this:
Look at a small subset of the dataset.
Rearrange it from low to high distance driven.
Assign Sign Bottom to the lowest half of this subset, and Sign Top to the upper half of this subset
The interesting thing about this is that it is actually quite a clever step. If he had done this from the beginning, not only would the study have shown the desired result, it would have been very difficult to detect fraud in the study. There would be none of the easily detectable signs of fraud, such as with numbers divisible by thousand being rare. It is somewhat surprising that he used such a clever approach, after the bungling in the first two steps.
Unfortunately for the fraudster, he for some reason kept the manipulations performed in Fraud Steps 1-3, even though they are not necessary to achieve the desired result, but still make the data set look suspicious.
Data detective analysis details
Figuring out this step was the hardest part of the puzzle for me. There are multiple attributes of the data set that have to add up, most importantly:
The assigning of the labels have to happen after the distance driven are generated. This can be seen since the distance driven is a uniform distribution, which is hard to arrive at in other ways.
The Sign Bottom have added random values to some of them. This can be seen since they are about 15,000 higher than Sign Top.
Not all of the Sign Bottom have added random values to them. This can be seen since 5.7% percent of them are divisible by 1000. If they all had added a random number to them, ~0.1% would be divisible by 1000. Also there are a small amount of 0 values, which would not be present if a positive number had been added to them.
The duplicated rows have identical condition labels most of the time, but occasionally they are non-identical.
So this leaves only this way to solve the puzzle of how the numbers were generated: First Fraud Step 1-3 were performed, and then the labels were rearranged seperately in the two Excel sheets.
Let’s us try and recreate these fraud steps using the recreation data set, and see what happens:
The recreated data set has the trait that the mean baseline is ~15k higher in Sign Bottom, but the mean distance is almost 3k higher, as in the actual fraudulent Excel sheet.
Code
get_stat <-function(df, condition1, font1, trait){if (font1 =="Both"){ m <- df |>filter(condition == condition1) } else { m <- df |>filter(font == font1, condition == condition1) }if (trait =="Divisible by 1000"){n <- m |>filter(baseline_car1 %%1000==0) |>nrow()}if (trait =="Divisible by 100"){n <- m |>filter(baseline_car1 %%100==0) |>nrow()}if (trait =="Divisible by 10"){n <- m |>filter(baseline_car1 %%10==0) |>nrow()}if (trait =="Equal to 0"){n <- m |>filter(baseline_car1 ==0) |>nrow()} (n /nrow(m))}tribble(~`Condition`, ~Font, ~Attribute, ~`Excel sheet`, ~Recreation,"Sign Top", "Cambria", "Divisible by 1000", get_stat(dff, "Sign Top", "Cambria", "Divisible by 1000"),get_stat(synth, "Sign Top", "Cambria", "Divisible by 1000"),"Sign Top", "Cambria", "Divisible by 100", get_stat(dff, "Sign Top", "Cambria", "Divisible by 100"),get_stat(synth, "Sign Top", "Cambria", "Divisible by 100"),"Sign Top", "Cambria", "Divisible by 10", get_stat(dff, "Sign Top", "Cambria", "Divisible by 10"),get_stat(synth, "Sign Top", "Cambria", "Divisible by 10"),"Sign Top", "Cambria", "Equal to 0", get_stat(dff, "Sign Top", "Cambria", "Equal to 0"),get_stat(synth, "Sign Top", "Cambria", "Equal to 0"),"Sign Top", "Calibri", "Divisible by 1000", get_stat(dff, "Sign Top", "Calibri", "Divisible by 1000"),get_stat(synth, "Sign Top", "Calibri", "Divisible by 1000"),"Sign Top", "Calibri", "Divisible by 100", get_stat(dff, "Sign Top", "Calibri", "Divisible by 100"),get_stat(synth, "Sign Top", "Calibri", "Divisible by 100"),"Sign Top", "Calibri", "Divisible by 10", get_stat(dff, "Sign Top", "Calibri", "Divisible by 10"),get_stat(synth, "Sign Top", "Calibri", "Divisible by 10"),"Sign Top", "Calibri", "Equal to 0", get_stat(dff, "Sign Top", "Calibri", "Equal to 0"),get_stat(synth, "Sign Top", "Calibri", "Equal to 0"),"Sign Bottom", "Cambria", "Divisible by 1000", get_stat(dff, "Sign Bottom", "Cambria", "Divisible by 1000"),get_stat(synth, "Sign Bottom", "Cambria", "Divisible by 1000"),"Sign Bottom", "Cambria", "Divisible by 100", get_stat(dff, "Sign Bottom", "Cambria", "Divisible by 100"),get_stat(synth, "Sign Bottom", "Cambria", "Divisible by 100"),"Sign Bottom", "Cambria", "Divisible by 10", get_stat(dff, "Sign Bottom", "Cambria", "Divisible by 10"),get_stat(synth, "Sign Bottom", "Cambria", "Divisible by 10"),"Sign Bottom", "Cambria", "Equal to 0", get_stat(dff, "Sign Bottom", "Cambria", "Equal to 0"),get_stat(synth, "Sign Bottom", "Cambria", "Equal to 0"),"Sign Bottom", "Calibri", "Divisible by 1000", get_stat(dff, "Sign Bottom", "Calibri", "Divisible by 1000"),get_stat(synth, "Sign Bottom", "Calibri", "Divisible by 1000"),"Sign Bottom", "Calibri", "Divisible by 100", get_stat(dff, "Sign Bottom", "Calibri", "Divisible by 100"),get_stat(synth, "Sign Bottom", "Calibri", "Divisible by 100"),"Sign Bottom", "Calibri", "Divisible by 10", get_stat(dff, "Sign Bottom", "Calibri", "Divisible by 10"),get_stat(synth, "Sign Bottom", "Calibri", "Divisible by 10"),"Sign Bottom", "Calibri", "Equal to 0", get_stat(dff, "Sign Bottom", "Calibri", "Equal to 0"),get_stat(synth, "Sign Bottom", "Calibri", "Equal to 0"),"Sign Bottom", "Both", "Divisible by 1000", get_stat(dff, "Sign Bottom", "Both", "Divisible by 1000"),get_stat(synth, "Sign Bottom", "Both", "Divisible by 1000"),"Sign Bottom", "Both", "Divisible by 100", get_stat(dff, "Sign Bottom", "Both", "Divisible by 100"),get_stat(synth, "Sign Bottom", "Both", "Divisible by 100"),"Sign Bottom", "Both", "Divisible by 10", get_stat(dff, "Sign Bottom", "Both", "Divisible by 10"),get_stat(synth, "Sign Bottom", "Both", "Divisible by 10"),"Sign Bottom", "Both", "Equal to 0", get_stat(dff, "Sign Bottom", "Both", "Equal to 0"),get_stat(synth, "Sign Bottom", "Both", "Equal to 0"),"Sign Top", "Both", "Divisible by 1000", get_stat(dff, "Sign Top", "Both", "Divisible by 1000"),get_stat(synth, "Sign Top", "Both", "Divisible by 1000"),"Sign Top", "Both", "Divisible by 100", get_stat(dff, "Sign Top", "Both", "Divisible by 100"),get_stat(synth, "Sign Top", "Both", "Divisible by 100"),"Sign Top", "Both", "Divisible by 10", get_stat(dff, "Sign Top", "Both", "Divisible by 10"),get_stat(synth, "Sign Top", "Both", "Divisible by 10"),"Sign Top", "Both", "Equal to 0", get_stat(dff, "Sign Top", "Both", "Equal to 0"),get_stat(synth, "Sign Top", "Both", "Equal to 0") ) |>mutate(across(c("Excel sheet", "Recreation"), ~scales::percent(., accuracy =0.1))) |>d()
Condition
Font
Attribute
Excel sheet
Recreation
Sign Top
Cambria
Divisible by 1000
0.0%
0.1%
Sign Top
Cambria
Divisible by 100
1.1%
1.1%
Sign Top
Cambria
Divisible by 10
9.8%
9.7%
Sign Top
Cambria
Equal to 0
0.0%
0.0%
Sign Top
Calibri
Divisible by 1000
35.1%
33.2%
Sign Top
Calibri
Divisible by 100
44.6%
42.2%
Sign Top
Calibri
Divisible by 10
52.9%
50.8%
Sign Top
Calibri
Equal to 0
3.2%
3.1%
Sign Bottom
Cambria
Divisible by 1000
0.1%
0.1%
Sign Bottom
Cambria
Divisible by 100
1.3%
1.1%
Sign Bottom
Cambria
Divisible by 10
10.2%
10.1%
Sign Bottom
Cambria
Equal to 0
0.0%
0.0%
Sign Bottom
Calibri
Divisible by 1000
5.7%
5.4%
Sign Bottom
Calibri
Divisible by 100
10.8%
7.4%
Sign Bottom
Calibri
Divisible by 10
22.5%
15.5%
Sign Bottom
Calibri
Equal to 0
0.2%
0.5%
Sign Bottom
Both
Divisible by 1000
2.9%
2.8%
Sign Bottom
Both
Divisible by 100
6.0%
4.3%
Sign Bottom
Both
Divisible by 10
16.3%
12.8%
Sign Bottom
Both
Equal to 0
0.1%
0.2%
Sign Top
Both
Divisible by 1000
17.6%
16.7%
Sign Top
Both
Divisible by 100
22.9%
21.7%
Sign Top
Both
Divisible by 10
31.4%
30.3%
Sign Top
Both
Equal to 0
1.6%
1.6%
We see that these numbers are all similar in the original data set and the recreated data set. (Note that the values are slightly smaller in the recreated data set. This makes sense since the recreation data set is based on the Sign Top Calibri rows in the original data set, which are slightly diluted due to the label rearrangement step.)
As mentioned, in the original Excel sheets, sometimes the duplicated rows had different labels. It is hard to say exactly how often, since it’s not possible to idenfity all duplicated rows. If we take subsets of the data set where the duplicated rows are easily identifiable, and check how many are of them have unmatching labels, it seems to be something like 5-15%. If we check for unmatching labels in the recreation data set, we find a similar value:
Dan Ariely was the one who was sent the data from the car insurance company, and is the creator of the Excel document containing the fraudulent data. So, it can’t be any of the co-authors.
Is it possible that someone at the car insurance company faked the data, and Dan Ariely simple received this fake data? I would say that it is not.
It could be imagined that some person at the car insurance company would perform Fraud Step #2 and Fraud Step #3. Perhaps they were too lazy to gather the data, so would just generate some fake data instead.
But it is inconceivable that they would perform Fraud Step #1 and even more so for Fraud Step #4. These steps specifically have the purpose to make the research hypothesis true. And the car insurance company would have no incentive to do this.
Perspective
This is a case of fraud that is completely bungled by ineptitude. As a result it had signs of fraud that were obvious from just looking at the most basic summary statistics of the data. And still, it was only discovered after 9 years, after someone attempted a replication. As I went through above, the fraudster had multiple obvious opportunities to manipulate the data in a way that would have likely never been discovered. In fact, it seems that the only reason it was discovered, was because of traits the data set acquired after puzzlingly unnecessary fraudulent manipulations.
This makes it seem likely that there is a lot more fraud than most people expect.
I would suggest that no study should be trusted if it doesn’t release the data. I am not having any illusions about the good will of journals here. I am saying that we as a scientific community should not trust any study without open data, regardless of which journal it was published in.
Also, I think we should look at all old Dan Ariely studies. Fraudulent people probably commit fraud more than once, and given the level of mathematical competency showed here, we could expect it to be not too hard to uncover.
Throughout I will analyze the numbers for car1 only, since the same is done for the other cars. Also I will show data for the Calibri numbers only most of the time, since they are the original values.