If you’re planning a cyber-attack on the federal government, don’t discuss it on social media. Seems like common sense, but experienced hackers keen on exploiting software vulnerabilities often use online forums to exchange strategies and define targets.
Despite advancements in modern technology, analysts today still rely on cumbersome manual searches to sift through Internet content and pinpoint cyber threats. Researchers at the Lincoln Laboratory outlined a simple, automated method to streamline this process—“simple” being nearly as significant as “automated”—which was recently published in the Laboratory’s ownjournal .
Although automated techniques and cyber security have been well-studied in their respective research spheres, this paper is among the first works to marry the two, said Yuheng Hu, assistant professor of Information and Decision Sciences at the University of Illinois at Chicago.
The Lincoln Lab team devised and tested an automated computer program, called a classifier, which separates content into pre-defined categories. They trained the classifier to recognize key words within sample social media texts and then asked it to use this knowledge to categorize the documents as either suspicious or benign. The authors drew their example texts from Twitter, Reddit, and Stack Exchange, because these forums represent the three most common types of Internet discourse.
On its own, a classifier doesn’t understand the content of the conversations. That’s why the Lincoln Lab researchers chose to create a tool that incorporated a human language technology (HLT) component in addition to a traditional classifier. HLT takes advantage of the contextual content that classifiers do not, breaking down sentences into simpler pieces—for instance, reducing “hacks,” “hacker,” or “hacking” to “hack”—which are then fed to the classifier.
HLT comes into play once again after the documents have been grouped by the classifier to help interpret the results, and it assigns each a numerical value according to its potential threat. For instance, “infection” is far less suspicious in the context of bacterial infections than it is in terms of software vulnerabilities. If a ranking above 0.7 is considered problematic, then this medical reference would fall somewhere below that benchmark.
“Once the classifier has been trained, it knows the key words that are representative of a specific type of document,” said David Weller-Fahy, a member of the research team. The classifier essentially learns what social media discussions will be of interest to an analyst without being told exactly what to look for.
Hu recommended expanding the classifier’s dictionary, however, since Twitter lexicon is ever-changing and oftentimes unconventional. “Twitter is very dynamic, so that makes it hard to maintain a static dictionary,” he said.
Weller-Fahy explained he’s received similar suggestions to increase the classifier’s complexity, because in academia more is often better. But Weller-Fahy is catering to a different client; he’s targeting industry and the objectives of his sponsor.
“These are people looking to solve a particular set of problems,” he said. “If I provide them with something that’s simple, fast, and solves their problems, then they’re going to be happy.” Simple technologies can do impressive things with very little computing power if you start with the right data, he added.
In this case, he said less is more because the classifier is not limited to just one type of data. It purposefully ignores certain aspects of social media posts—like usernames, up and down votes, etc.—and scours only the text. It’s also possible to condense the classifier and email it directly to analysts, although Weller-Fahy hopes that once it’s officially released it will be automatically incorporated into software packages.
That said, the classifier is still far from polished. Eduard Hovy, professor at the Language Technologies Institute of Carnegie Mellon, urges the authors to address the accuracy of their classifier, which was absent from their paper. “No system works 100%,” he said. “But if the classifier read, say, 400 social media messages and found three bad ones, we don’t know if it missed two or found them all.” Hu echoed this concern.
Weller-Fahy affirmed that, while the paper did report a “miss rate,” it didn’t specifically disclose accuracy. “We only focused on those performance metrics that were important to our sponsor,” he said—who, of course, is highly classified.