By —

The secret things you give away through your phone metadata

Science Jun 2, 2016 11:12 AM EDT

The word "metadata" achieved buzzword status in 2013. That's when whistleblower Edward Snowden leaked documents exposing a National Security Agency program that collected telephone metadata in bulk -- along with other surveillance schemes deemed unsavory by electronic rights watchdogs. Since then, metadata collection has been invoked in court proceedings, innumerable opinion pieces and an Oscar-winning documentary as one of the most egregious violations of personal privacy. On Monday, former U.S. Attorney General Eric Holder said Snowden "performed a public service" -- albeit an "inappropriate and illegal" one -- by sharing the secrets.

Yet, most people couldn't describe, step-by-step, how metadata are used to piece together personal secrets.

You're in luck. A new study from Stanford University charts exactly what can be learned from telephone metadata. The researchers used rudimentary techniques to show that your name or relationship status are immediately apparent from telephone metadata, but so are countless other personal details.

Don't want your parents to know that you're pregnant? Hope that they don't hack your smartphone's metadata.

The results clarify a longstanding debate. Metadata have historically received fewer legal protections than actual communications content, such as audio from a phone conversation or text message transcripts, to the disdain of privacy advocates. This study shows that sensitive information, like health services or lifestyle choices, are easily discernible from metadata with little digging.

"People have testified in Congress, saying that metadata definitely carries sensitive information, but there hadn't been a lot of science done," Patrick Mutchler, study co-author and member of Stanford's Computer Security Laboratory, told NewsHour. "What our study does is confirm a lot of the suspicions that people held about metadata."

Even though the NSA shuttered its bulk collection program six months ago, the researchers' findings remain pertinent. The NSA and Federal Bureau of Investigation can still obtain telephone metadata on individual suspects via the U.S. foreign intelligence surveillance court, which didn't deny any of the 1,457 requests made last year. (In fact, the FISA court hasn't refused an application since 2009). Plus, the NSA is holding on to boatloads of metadata collected over the last five years due to ongoing legal cases with privacy advocates.

The controversy also crosses borders. After last November's Paris attacks, France enhanced its surveillance powers to monitor phone calls without a warrant. Meanwhile, the U.K. government is debating similar legislation nicknamed the "snooper's charter." Regardless of what governments decide, companies continue to collect phone and internet metadata on customers, whether it's to sell ads or build better apps -- and they've done so for decades.

"With these data, people are able to make more informed decisions about whether or not they approve or disapprove of these policies," Mutchler said. Tech innovators can also use the research to devise shields against the practice of metadata collection.

But let's start at the beginning with MetaPhone.

"Wait, you're pregnant!?"

MetaPhone is an Android app, designed by Mutchler and his labmates to collect telephone metadata. Over an eight-month window, the smartphones of 823 adult volunteers beamed call and text logs to the team's secure server. This data comprised when a call or text was made, whether it was an incoming or outgoing transmission, the duration of the call or the text message's length (in characters). The app also noted the phone numbers of the senders and recipients, but no identifiable information, audio recordings or textual content.

"The same stuff is available to the NSA, but they'd have more of it," Mutchler said of his May 17 report in the Proceedings of the National Academy of Sciences. "The volunteers hailed from 45 states, D.C. and Puerto Rico."

This small pool yielded 62,229 unique phone numbers, 251,788 calls and 1,234,231 texts. Basic machine-learning algorithms did the rest of the heavy lifting. The team relied on these quasi-intuitive programs to make inferences about people's identities or lifestyles.

The team started with child's play. They had the algorithms skim public information from Facebook, Yelp or Google Places in order to match 30,000 randomly selected phone numbers to individuals or businesses. Using these three sources, the researchers matched identities for 32 percent of the phone numbers. When the hunt expanded to include a public records service -- a $19.95 investment -- and 70 minutes of Google searches, the algorithms caught 82 percent of the identities.

The researchers could also pinpoint the identity of romantic partners -- as verified by Facebook relationship statuses -- with 80 percent accuracy using call volume and 76 percent accuracy using how often the couple texted each day.

The shocks came when the researchers looked for sensitive connections. In the report, they presented five typical examples.

A Washington Metro bus is seen with an Edward Snowden sign on its side panel December 20, 2013. Photo by Gary Cameron/REUTERS

"I would have guessed general inferences -- like religious affiliation. But at least in one case we were able to identify a person with a cardiac arrhythmia," Mutchler said.

This participant received a long phone call from a cardiology group at a regional medical center, according to the paper, talked briefly with a medical laboratory and answered several short calls from a local drugstore. But the key giveaway may have been brief calls to a self-reporting hotline for a cardiac arrhythmia monitoring device. The team followed up and confirmed the cardiac arrhythmia, as well as a case where the analysis accurately concluded a person had purchased an automatic rifle.

Another volunteer called a pharmaceutical hotline for a drug prescribed only for multiple sclerosis, while a third vignette involved a person who spoke her sister early one morning for an extended period of time. Two days later, she made multiples calls to a nearby Planned Parenthood clinic. She repeated the pattern two weeks later...and then again, a month after the first call.

Communications with health services were the most common form of sensitive information caught by MetaPhone's surveillance, accounting for 57 percent of calls among participants. Financial services accounted for 40 percent.

Another case involved a person who "placed calls to a hardware outlet, locksmiths, a hydroponics store, and a head shop in under three weeks," the report stated.

"The call patterns are indicative of starting to grow marijuana," Mutchler said.

Overall, the analysis found metadata from an NSA request involving a single suspect could uncover information on approximately 25,000 individuals. Extend the search by one degree of separation -- you, your friend and their contacts -- and an agent could recover personal information on 20 million people. Kevin Bacon, eat your heart out. This latter scenario, known as three-hop surveillance, was the NSA's legal standard until recently.

"Maybe these [metadata] separately are innocuous, but there is a more meaningful picture that doesn't appear until you look at the data."

Metadata collection for the masses

Not long after Snowden outed the NSA, President Obama asked the National Academy of Sciences to convene a panel of 13 computer security experts. Over the course of five months in late 2014, they tackled whether there were currently technological alternatives to bulk collection of metadata that could still let intelligence agencies to do their work.

"In a sense, the short answer was not really," said Michael Kearns, a computer scientist at the University of Pennsylvania who served on the committee.

The reason is a sensible one, he said. The whole premise of intelligence work is maybe some individuals don't have the right to privacy. It's difficult to know in advance who you should and shouldn't be collecting data on, for the very reason that if you knew already, then you wouldn't need any data in the first place.

The National Security Agency has tapped directly into communications links used by Google and Yahoo to move huge amounts of email and other user information among overseas data centers, according to secret NSA documents leaked by former contractor Edward Snowden. Photo by Pawel Kopczynski/REUTERS

Kearns believes future technology can strike a balance for surveillance agencies. In January, his team published a set of algorithms that can take a social network -- like Facebook or a database of phone contacts -- and filter perps from the innocent. Here's how it works.

Suppose I tell you the average salary of the PBS NewsHour editorial staff immediately before and after a reporter resigns. If you know those two values, you can easily figure out how much the reporter was making. Kearns' algorithm rely on differential privacy -- a statistical masking that adds a bit of noise or randomness to the data. You can still make the salary calculation, but you can't identify the reporter.

"You want to limit the amount of information that's passing through the barrier between the place where all the data is held and the people who can act on the data," said Adam Smith, a security and privacy data scientist at Pennsylvania State University who wasn't involved with the research. "Differential privacy gives you a way to publish approximate stats to guarantee that there isn't too much information about one person."

Smith said companies like Google employ similar techniques to gather stats on how people people use apps on their phones but maintain privacy.

"Certain types of info, they'd rather not collect," Smith said. "They don't want to be on the hook for subpoena."

This info could be as simple as a person's homepage on their browser. The companies monitor these browser settings because some types of viruses and malware create false default homepages that take a user to another webpage. By using differential privacy, the company can track webpage traffic that raises a red flag and see if a piece of malware is responsible.

"Differential privacy allows them to collect approximate statistics about how people are setting their homepage without knowing the precise details of how you and I set our homepage," Smith said.

One team at MIT wants to apply the differential privacy approach, which primarily suits centralized databases, and apply it to individual smartphones. They've developed an app called SafeAnswers that allows a downloader to share parts of their metadata without forking over personal identifying content. The idea isn't completely novel. A handful of startups have created personal data storage platforms, so people can charge third-parties for access to metadata.

Yet, both Kearns and Smith said differential privacy works as a solution only if surveillance agencies or communications companies buy into it.

Individuals can end-to-end encrypt their phone calls, texts and WhatsApp messages with a service like OpenWhisperSystems. However, it requires an internet connection to create a secure channel. Mutchler couldn't think of an app that automatically anonymizes or creates false metadata to throw off possible snoops.

"As an individual, you don't have a lot of control over how your data is used or manipulated once it's left you and gone to the telecommunication companies. As it stands now, we would need to make a bunch of changes," Mutchler said. "From a public policy perspective, the next step is a continued discussion about metadata privacy and whether metadata should be considered separate or not" from content communications.

A free press is a cornerstone of a healthy democracy.

Support trusted journalism and civil dialogue.