The Internet’s hidden science factory
Illustration by Edel Rodriguez
In a small apartment in a small town in northeastern Mississippi, Sarah Marshall sits at her computer, clicking bubbles for an online survey, as her 1-year-old son plays nearby. She hasn’t done this exact survey before, but the questions are familiar, and she works fast. That’s because Marshall is what you might call a professional survey-taker. In the past five years, she has completed roughly 20,000 academic surveys. This is her 21st so far this week. And it’s only Tuesday.
Marshall is a worker for Amazon’s Mechanical Turk, an online job forum where “requesters” post jobs, and an army of crowdsourced workers complete them, earning fantastically small fees for each task. The work has been called microlabor, and the jobs, known as Human Intelligence Tasks, or HITs, range wildly. Some are tedious: transcribing interviews or cropping photos. Some are funny: prank calling someone’s buddy (that’s worth $1) or writing the title to a pornographic movie based on a collection of dirty screen grabs (6 cents). And others are downright bizarre. One task, for example, asked workers to strap live fish to their chests and upload the photos. That paid $5 — a lot by Mechanical Turk standards.
Mostly, Marshall is a sort of cyber guinea pig, providing a steady stream of data to academic and scientific research. This places her squarely inside a growing culture of super-savvy, highly experienced study participants.
As she works, she hears a rustling noise. “Grayson, are you in my garbage can?”
In the kitchen, the trash can’s on its side. Her son has liberated an empty box of cinnamon rolls and dumped the remaining contents on the floor. She goes to him, scoops him up and carries him back to the living room, where he circles the carpet, chattering happily as she resumes typing.
“I’m never going to be absolutely undistracted, ever,” Marshall says, and smiles.
Her employers don’t know that Marshall works while negotiating her toddler’s milk bottles and giving him hugs. They don’t know that she has seen studies similar to theirs maybe hundreds, possibly thousands, of times.
These factors are such a draw for researchers that, in certain academic fields, crowdsourced workers are outpacing psychology students — the traditional go-to study subjects. And the studies are a huge draw for many workers, who tend to participate again and again and again.
These aren’t obscure studies that Turkers are feeding. They span dozens of fields of research, including social, cognitive and clinical psychology, economics, political science and medicine. They teach us about human behavior. They deal in subjects like energy conservation, adolescent alcohol use, managing money and developing effective teaching methods.
“Most of what’s happening in these studies involves trying to understand human behavior,” said Yale University’s David Rand. “Understanding bias and prejudice, and how you make financial decisions, and how you make decisions generally that involve taking risks, that kind of thing. And there are often very clear policy implications.”
As the use of online crowdsourcing in research continues to grow, some are asking the question: How reliable are the data that these modern-day research subjects generate?
The early adopter
In 2010, the researcher Joseph Henrich and his team published a paper showing that an American undergraduate was about 4,000 times more likely than an average American to be the subject of a research study.
But that output pales in comparison to Mechanical Turk workers. The typical “Turker” completes more studies in a week than the typical undergraduate completes in a lifetime. That’s according to research by Rand, who surveyed both groups. Among those he surveyed, he found that the median traditional lab subject had completed 15 total academic studies — an average of one per week. The median Turker, on the other hand, had completed 300 total academic studies — an average of 20 per week.
“Which is just crazy,” Rand said. “And for a lot of experiments, that’s a big problem.”
Rand, a young, energetic behavioral economist, who accessorizes his suit jacket with gray converse shoes and orange-striped socks, works on the second floor of a beautiful cathedral-like building on Yale’s main campus. Behind his desk are shelves of robot toys, a nod to his pre-professor days, when he fronted an electro-punk band called Robot Goes Here. “I actually had a record deal,” he said.
Rand was an early proselytizer for Mechanical Turk. In fact, he authored the first study in his field that encouraged scientists to tap Turkers for surveys. At the time, he gave talks to fellow researchers, telling them recruiting via Mechanical Turk could be done more quickly, cheaply and easily and could be “just as valid as other types of experiments.”
That was in 2010. Since then, his early enthusiasm has been tempered with caution. He’s been following the forum for nearly a decade and has come to believe that it has some serious limitations. First, there’s the question of dropout rates. Turkers are more likely to drop out mid-study, and that can skew the results. Then there’s the question of environmental control. Read: There is none. In the lab, it’s easy to monitor survey takers; not so online. Who’s to say they’re not watching reality television while working, or drinking a few beers on the job? To guard against this, researchers test a worker’s focus by planting “attention checks” in their surveys. “Have you ever eaten a sandwich on Mars,” a question might read. Or “Have you ever had a fatal heart attack?” But the attention check questions are often recycled, and experienced workers spot them immediately. (“Whenever I see the word vacuum, I know it’s an attention check,” Marshall has said.)
But it’s the absence of gut intuition from experienced workers that concerns Rand the most.
A person’s gut response to a question is an important measurement in many social psychology studies. It’s common to compare the automatic, intuitive part of the decision-making brain with the part that’s rational and deliberate. But a psychologist testing for this among professional survey takers may very well be on a fool’s errand.
Katie Hays, 28, makes about $200 a week Turking in Biloxi, Mississippi, and has observed a change in her own survey performance. “It’s hard to reproduce a gut response when you’ve answered a survey that’s basically the same 200 times,” she said. “You kind of just lose that freshness.”
Recently, Rand recruited a group of college students and Turkers — 5,831 total subjects — to perform a series of experiments testing whether cooperative behavior that’s successful in daily life will spill over into what’s known as a Public Goods game. In the game, participants were given a small amount of money and asked to make a choice: how much cash do they keep and how much gets contributed to the common pool, which benefits other players. He then made the players more or less inclined to rely on their gut responses by forcing them to answer quickly or to stop and consider their choices. Among college students, the pattern was clear. When forced to answer quickly, they were more likely to contribute to the common good; the more time they had to deliberate, the more they hoarded for themselves. But experienced Turkers behaved differently. Among them, the impulse to share had largely disappeared, likely because they knew the game, and had learned that sharing was not a good strategy. The study was published in the journal Nature Communications in April 2014.
There are two critical points here as they relate to Mechanical Turk. The first is that frequent Mechanical Turk workers are fluent in these experiments on arrival. They know how to play the game. But also, perhaps more importantly, their natural human impulses from daily life, as they apply to the game, no longer exist.
“If you’re running social psychology studies on Turk, watch out, because [the subjects] have gotten experienced, and that can change effects,” Rand said. “So if you run my experiment on Turk right now, you won’t get any effect. Which sucks for me.”
To be clear, extreme experience isn’t always a problem, Rand said. There are some psychological tests that are so robust that no amount of experience will override the effect, said Jesse Chandler, a researcher at the University of Michigan’s Institute for Social Research. The Stroop effect, for example, which involves identifying colors when the color of a word doesn’t match the color spelled out by the text. When the word “red,” for example, is colored green, it takes longer to override the automatic reading of the text and choose green.
“It’s a really powerful effect,” Chandler said. “No amount of practice, no amount of awareness will completely override that happening.”
And some puzzles with tricky instructions can benefit from a professional survey taker — someone who’s fluent in the game, said Rand, who notes that he still uses Mechanical Turk often and considers it a great tool, provided that you understand its limitations.
Can you answer these questions commonly seen on Mechanical Turk?
“But if a puzzle has a trick to figure out, once people have seen it a few times, they get it,” Rand said. “If they’ve done the task a lot of times before, their intuitions are corrupted, basically. That original impulse just isn’t there anymore.”
It could be argued that the qualities that make these subjects natural and fallible, the very things that make a human human, get swallowed up by experience.
Mechanical Turk takes its name from a late-18th century automaton chess player that wowed crowds, outthinking humans and defeating nearly every opponent it faced, including Napoleon Bonaparte and Benjamin Franklin. But it had a secret. After some 50 years of winning, it was revealed that a hidden chess master was concealed inside, working the pieces. There was a human behind the machine.
Amazon’s modern-day Mechanical Turk has humans too, gobs of them, powering its virtual machine. But is it possible it has its own dirty secret? If so, it’s this: The humans behind the machine have essentially become machines.
You can’t trick the Turk
It was in 1675 that physicist Sir Isaac Newton, in a letter, famously scrawled the words, “If I’ve seen further, it is by standing on the shoulders of giants.” Scientists are forever teetering on the shoulders of other scientists, and nowhere is that better represented than by the well-tested paradigms used by researchers who deal in surveys. Games, puzzles or individual questions that have proven effective at measuring something are reused constantly with minor modifications. When something has been proven to work, there’s value in not reinventing the wheel.
Back in Booneville, Mississippi, Marshall pulls up a screen with one of the most commonly used study questions: “A bat and a ball cost $1.10. The bat costs one dollar more than the ball. How much does the ball cost?”
Think about it for a second. The answer is 10 cents, right? Nope. It’s a nickel.
“The first time I saw this question, it would have tripped me up,” Marshall said. “It’s one of those things — it’s supposed to challenge your critical thinking. It doesn’t do that for me anymore.”
Nobel Laureate Daniel Kahneman, a behavioral economist at Princeton University, wrote about this puzzle in his book, “Thinking Fast and Slow.” “Many thousands of university students have answered the bat-and-ball puzzle and the results are shocking,” he wrote. “More than 50 percent of students at Harvard, MIT, and Princeton gave the intuitive — incorrect — answer.” At less selective universities, he continued, more than 80 percent got it wrong. “The distinctive mark of this easy puzzle,” he wrote, “is that it evokes an answer that is intuitive, appealing, and wrong.”
Marshall answers it correctly every time. That’s because it appears, on average, in one survey she sees every single day.
She not only knows the answer, but also the two questions that will almost inevitably follow. The first involves using machines to make widgets, and the second is about the rate that lily pads expand in a pond. “They’re always in the exact same order and the exact same questions,” she says.
Marshall admits that in her line of work, she sometimes observes her behavior becoming mechanical. “You just get in a motion,” she said. “And you’ll see particular questions and it’s like, if I see the same block of questions twice on the same day, I even know the pattern for my answers.”
‘I knew I was playing a robot’
To get to Marshall’s house in Booneville, you drive south from Corinth on 45, past the Thrasher Baptist Church, the Ol’ 45 One Stop and acres of empty fields. The town is partly rundown, with trailers overgrown with weeds, a gutted town theater and abandoned stores with scratched out signs.
But the 24-year-old has managed to eke out a living independent from Booneville. Inside the apartment, duct tape holds together an old air conditioning unit; a red blanket pinned to the window is a makeshift curtain. Marshall’s husband, Isaiah, paints cars at a nearby auto-body shop, and she “Turks” to help pay the bills. What began as a hobby turned into a job after Grayson was born. That’s when she realized she could make more money Turking than she was making as a line cook in a nearby deli — and not come home with aching legs, smelling like sandwiches. Plus, she could spend her days at home with her son.
It’s noon, and she’s now working on a UC San Diego survey — a game that pits her against another worker, who lobs nasty comments at her as they play.
“I think they might be testing the psychological impact of trash talk,” Marshall says. She suspects the other worker is actually a “bot” pretending to be a person. To test her theory, she throws out a nonsensical comment. “I enjoy walruses,” Marshall types. The response from the other player: “Also, I’m taking that bonus money. And you’re going to lose.”
“Yeah, it’s a robot,” Marshall says.
Under the desk, Grayson sneezes. It’s a slow day. Marshall has completed five surveys so far, and made $8.35. In a message, she alerts the researcher, “I knew I was playing a robot.”
“I can’t comfortably separate my honesty from money, because at the end of the day, all you really have is your integrity,” she says. “And if you are willing to sell that for a dollar, it says a lot about the kind of person you are.”
Marshall prides herself on being honest. There are jobs she won’t do. She won’t engage in what she calls the “seedy parts of the Internet,” like the HITS that pay workers to write a five-star review for a restaurant they’ve never dined at or an app they’ve never used. She has other limits too: For example, no more jobs involving cartoon characters having sex: “I’ve seen Inspector Gadget doing very dirty things to Penny,” she says.
As for surveys, when she thinks her experience might be compromising the data, she sends the research team an alert: “I realized there was some deception,” she’ll write. Or: “I’ve seen these questions before.”
And as for becoming too robotic in her answers?
“When I start to feel like I’m becoming really mechanical or like I am a machine who takes surveys, I’ll say, ‘I’m going to stop now and regain my humanity,’” Marshall says. “And I’ll start again tomorrow when I start to feel like a person again.”
Sometimes Rand wishes he and other early adopters had never let the Mechanical Turk genie out of the bottle.
“Everything that I study is public goods, right?” he said. “Exploitable resources and how you get people to not overexploit resources. But in some sense, Mechanical Turk is just that. Because it got so popular, it’s overexploited, and now it doesn’t work for the things that I was originally wanting to use it for. There’s a little bit of ‘Man, couldn’t we have just kept our mouth shut and kept it as this nice, clean thing for ourselves?’”
But, as Rand acknowledges, if it wasn’t him, some other researcher would have discovered Mechanical Turk eventually and let it loose on the masses. There was no keeping it contained.
Flawed by design?
Not long ago, Chandler and other researchers tackled head on the question of extreme experience, which they refer to as “nonnaïveté.” Early in their article, published March 2014 in the journal Behavioral Research Methods, they sum up their concerns about experience: Turkers repeat the same or similar studies, they share information through discussion boards, and plug-ins allow them to complete the tasks of favored requesters.
“While participant nonnaïveté can also be an issue in traditional participant pools, M. Turk workers might share information in a more systematic, permanent, and searchable manner, with more dramatic consequences for data validity,” the article reads.
The team took a sample of 16,408 Mechanical Turk HITs, completed by 7,498 workers and found that a fraction of the workers had completed a disproportionate number of the HITS: 10 percent of workers had completed 41 percent of the HITs. (Study authors refer to this group as “super Turkers.”) However, in another sample of 132 published papers, the team found that only 5 percent of researchers had addressed worker experience or “nonnaivete” as a possible limitation of the study.
“We found that researchers basically ignore this,” Chandler said. “They’ll run multiple studies, and there will be no mention of whether they excluded duplicate workers.”
Why aren’t researchers more forthcoming about this information?
The sheer number of experienced respondents is an issue that is currently not appreciated by researchers, the study concludes, along with a recommendation: that researchers avoid commonly-used paradigms and, “at minimum, make an effort to measure whether participants have participated in similar experiments before.”
Researchers need to think carefully about how they’re using Mechanical Turk and whether it’s appropriate for their studies at all, said Gabriele Paolacci, one of the study authors, and an assistant professor of marketing at Erasmus University Rotterdam in the Netherlands. If researchers receive poor data after using the forum mindlessly, relying on common questions or underpaying workers, “they shouldn’t blame the market, they should blame themselves,” he said.
Early results by the team suggests another potentially interesting finding. Turkers seem more likely to provide false negatives – failing to observe a phenomenon that exists — than false positives — falsely observing something that doesn’t exist. (An example of a false positive would be a study that shows a relationship between vaccines and autism that doesn’t really exist. A test that fails to show the effectiveness of a successful drug would be a false negative.)
“It’s debatable which one’s worse,” Chandler said. “But we could imagine an alternate world where workers follow requesters, and they want to be helpful, so they learn their hypotheses and create fictional data to support that. That’s the more troubling outcome.”
“Historically, science has treated false positives as more worrisome,” Paolacci said.
The humans inside the machine
But consider this.
Numerous studies in various academic fields — social psychology, cognitive psychology and clinical psychology, among others — have shown that Turkers provide data of equal or better quality than more traditional participant pools. They’re also more diverse.
Susan Fiske, a professor of psychology at Princeton University who has used Mechanical Turk “dozens of times” in her lab, said, compared to undergraduates, “it’s way more representative of American people.”
“From my point of view as a social psychologist, this is so much better than running college students,” Fiske said. “And hands down so much better than college sophomores.”
A study slated for publication in the journal, Field Methods, found that Turkers, compared to people who are asked to participate in surveys via Google ads, provided better data, fewer “don’t know” responses and more disclosure. They were also more likely to provide their cell phone numbers and report their household incomes, said Christopher Antoun, a doctoral candidate in survey methodology at the University of Michigan, and the study’s lead author.
“We were surprised by it,” Antoun said. “My intuition going into it was that these workers might provide low-quality data. They might rush through to get to the next task. But there were fewer ‘don’t know’ answers provided by M Turk recruits.”
In other words, they weren’t phoning it in.
Independent reporting by the NewsHour supports that. To report this story, we decided to go “meta” and post our own HITs to Mechanical Turk. First, we sought experienced workers to interview about their own experiences in this line of work. We offered them no money, though Amazon did charge us 5 cents to post.
Since we were particularly interested in those with experience, we asked for workers with a 98 percent approval rating and an Amazon-awarded “Master’s degree.” (The Master’s degree, we would later learn, is like the unicorn of Mechanical Turk. No one knows why or when Amazon awards it. But once acquired, it’s a window into a whole new dimension of HITs, and that means more earnings opportunities.)
In our second HIT, we asked workers with a 95 percent approval rating who had completed more than 5,000 HITs to rank the questions they most commonly see repeated. No Master’s required this time. For this one, we paid each worker $1. The responses flooded in — 100 answers in 28 minutes.
Like Marshall, the workers we interviewed said they recognized many repeated questions and were aware of attempts to manipulate them. But many also reported a degree of pride in their work, despite the tedium and lousy pay. And, notably, despite the lack of oversight, they weren’t even tempted to game the system.
“I think most of us who do this with any form of seriousness realize that we’re representing not only all of the Turkers, but M. Turk as well,” said Clay Hamilton of Chenango Forks, New York, who Turks to pay for heating fuel and family travel. He estimates he’s done about 40,000 academic surveys. “If we start providing garbage information and garbage data, the work is going to slowly dry up.”
In fact, there’s a whole underground community made up of watchdogs for this sort of thing. William Little is a moderator on Turker Nation, one of the many online subcommunities that have grown out of Mechanical Turk. Among Turker Nation’s official rules, he said: “No disclosure or discussion of attention memory checks. No discussion of survey content, period. That can affect the results.”
There’s no escaping the fact that doing thousands of experiments could lead to behavior that’s more automated than spontaneous — driven more by rote memory than gut instinct. But the ultimate effect on the workers, and on the research, may be more nuanced.
Many Turkers report an increasing degree of self awareness that they attribute directly to their jobs.
Marshall says the surveys she has completed have inspired her to become more educated and more grounded in her own beliefs. She has been asked to think deeply about things she might not otherwise consider. And what could be more fundamentally human?
Hamilton summed it up this way: “I’d always considered myself a conservative, but the more of these surveys I’ve taken, I’ve realized I’m more of a moderate liberal. I’d always considered myself a Christian, but now I consider myself more of an agnostic. You just can’t fill out 40,000 surveys and not get a better sense of what you do and don’t think.”
Scientists undoubtedly will continue to wrestle with questions about Mechanical Turk and its use in academic research. Is the machine duping researchers — the way its chess-playing namesake once duped 18th-century audiences? Are Turkers misleading scientists and corrupting their science? Or, in a universe of imperfect study participants, is Mechanical Turk the best option available?
Both are certainly possible.
Then again, maybe Mechanical Turk isn’t a corrupt model at all. Isn’t it possible that, with the help of its savvy group of workers, the research it feeds is actually as strong, or stronger, than ever? How ironic it would be to find out the “robots” in the machine have become better than anyone at telling us about the human condition.