Patrick Ball

Patrick Ball

Dr. Patrick Ball is Benetech's chief scientist and director of their human rights program.

Dr. Patrick Ball is Benetech's chief scientist and director of their human rights program. His software and statistical models have been used in large-scale human rights projects in El Salvador, Ethiopia, Guatemala, Haiti, South Africa, Kosovo, Sierra Leone, Sri Lanka and elsewhere to show patterns of genocide.

He tells FRONTLINE/World correspondent Clark Boyd how technology can help human rights workers preserve and analyze records like those found in Guatemala. "The point of all human rights work,” says Ball, "is to understand the past, so we can build a future that doesn’t repeat it."

This interview took place at Benetech’s headquarters in Palo Alto, California, on April 7, 2008. It has been edited for clarity.

Q: Tell us a bit about your background and past human rights work.

Ball: I’ve been working with Guatemalan projects since about 1993. Our first round of work was [collecting] interviews with the Communities of Population and Resistance. These were communities that had been displaced, who were living off the grid, essentially hiding from the state in undeveloped areas of Guatemala. We captured thousands of statements from these folks [who] had been attacked by the army and prepared them for eventual delivery to the truth commission.

The second project was supporting the truth commission in Guatemala, designing their database, then making projections of total mortality that resulted from the conflict. It was from that estimate that we got the calculation that approximately 200,000 people had been killed and disappeared between 1960 and 1996. It was long thought that the perpetrator of most of these disappearances was the national police. So the question then is, How do we get to this part of the story? How do we get to the disappeared? When the archive was discovered several years ago, all of us who’ve been studying human rights in Guatemala for a long time were very excited [and] thought, Oh my gosh, this is our opportunity to flush out the story of the disappeared. It's fascinating to have an opportunity to look at disappearances from the perpetrator’s side, from the side of the people who very likely committed most of these disappearances.

Q: What was your initial reaction when the archive was found?

[When] we looked at the archives it became obvious to us very quickly that nobody is going to read 80 million pages in any meaningful timeframe. It's going to take a long, long time even to preserve this material, much less analyze or understand it.

Q: Eighty million documents. How do you start to analyze that? How do you start to get all of that into some form that you can actually get useful data from?

You take a sample, and the sample can be pretty small. Statisticians often talk about when you’re making soup and you want to taste it: You stir the soup, stir it as well as you can, but your ability to taste the soup? Really, you use the same size spoon no matter how big the pot is. Sampling is a way of getting a small number of documents that will tell you about the entire universe of 80 million documents. For us, a small number is probably in the tens of thousands, but it's a smaller number relative to 80 million.

[Our] approach to the problem [was] asking, If we looked at all the documents, what would be the story that would come out of it? We can’t look at all the documents, so what’s an approximation to looking at all the documents? We conducted a statistical sample. What we did is create a topographic map in three dimensions that shows how things are piled up: where there are drawers, where there are sacks of paper, where there are other shelves. From that assessment of the volume of paper, we developed a method to choose points from which samples will be drawn.

The advantage of a random sample is first that you cannot be accused of cherry picking. In fact, you are giving every document in the archives a chance to be part of your analysis. You’re not wandering through and choosing things that support your case. [You are] choosing documents at random, so that you can get a sense of the entire picture. You’re really letting the documents tell their own story. The second advantage of a big statistical sample like this is that even with 10,000 or 20,000 documents of the 80 million, we can make estimates of how many people, for example, are mentioned in these documents.

Q: You’re trying to find, as you said, the story that the entire archive is telling. What can you say about the story that you’re finding?

One fascinating process that we have been a bit surprised by is that the police talk about violence a lot in their documents. They don’t say we’re going out to commit a disappearance, of course, but there is quite a bit of talk about discovering bodies, of releasing people, of capturing other people. There is a much greater discussion about things that are potential human rights violations than we expected. Many people said when you get to the documents, you’ll find that the police have removed everything that might be of interest to you. That’s certainly not the case. Indeed, there’s a great deal of information -- perhaps 15 percent – that speaks to events which are potentially human rights violations. Now, they may not be human rights violations, that really depends on a legal analysis of the context of the violence. We identify the violent acts, and it's then up to a careful legal analysis to determine which of those are legal and which are illegal.

Q: And the legal analysis -- that’s the Guatemalan's job?            

Absolutely. That’s not something we touch at all. Our work is to identify an event that involves violence. And then there’s a much more complex analysis about whether that violence is, in fact, legal or illegal.

Q: But you are committed to giving them information and evidence from the archive that will be scientifically sound and could stand up as part of a court case?

Yes, our work is always focused on building arguments that will withstand any adversarial criticism, whether in the court of law or in the court of public opinion. I prefer the court of law because the debate is much more controlled and rational than in the court of public opinion.

Q: How does the relationship between the Guatemalans and Benentech work?

We provide scientific and technical support [so] the ombudsman’s office can do substantive intellectual, historical interpretation of the documents. We provide a database for them. It’s called Martus. They key the data into this database, and the data is stored in an encrypted way on their local machines. Their local machines then copy the data in its encrypted form to a network of servers around the world. Even if the computers were to crash or to be stolen from the archives, the data that has been coded is safe. The data is immediately secure, and that’s a key part of our work. [We] give them the technical tools to preserve their information in a way that can be used both for analysis [and] backed up securely to a network of servers around the world.

We do the statistical work on the data that the archives code. We’re the ones who look for the patterns, the relationships, the magnitude of particular roles. We can extract from the data a version that you can imagine fitting in a spreadsheet. Then, we can conduct statistical calculations with it.

Have you encountered any surprises in the initial statistical analysis that you’re doing on the archive?

We’re generating an enormous number of names; this is another thing that was not anticipated. It was not anticipated that the police would be so specific about people who they were interacting with. So we’re generating thousands and thousands of individual names. Some of these people are targets of surveillance, some of them are people who’ve been arrested; some of them are people who are collaborating with the police. And some of them are police officers, of course. We had not anticipated such a rich directory of people.

Q: Do the Guatemalans working at the archives have structures in place that will enable them follow up on those names and investigate further into particular cases? Is that the idea?

It’s more that they’re looking for particular names that are famous cases; now, we’ve given them a pool that they can look in. There are a number of high-profile cases in Guatemala, disappearances that have been on the human rights community’s mind for quite some time, for 20-plus years in many cases. And the ombudsman’s job is to receive complaints from people who are looking for disappeared relatives or friends. So we have a number of cases that people are specifically looking for. Now we can compare the names that are in the documents to the names of interest. We then have an immediate linkage to a particular piece of evidence.

We did not expect so many names [and] so much detail on each one. It presents a challenge. How do we work with so many names in a useful way? How do we link those names back to the people of interest? I mean, this is hard because you know there are a lot of people named Juan Perez. It's hard then to “disambiguate” – a computer term – which records fit with which records, as you look across information from the families’ and victims’ point of view, to the information from the police’s point of view.

Q: You had another example you wanted to give of the big picture point of view that you’re finding? 

The final piece that’s of great interest is the information flow within the institution of the police. Where are the police focusing their attention over time? Are they focusing their resources and attention on fighting crime? Are they focusing their research and attention on political monitoring, surveillance and the repression of dissidents? One way to measure that in a historical sense is by looking at the amount of information that flows between different parts of the institution, [as well as] what is the content of that information.

We can now map the flow of information that governs the interaction of the police and the army. Who’s sending orders? Who’s receiving orders? Who’s sending reports back? By mapping that flow, we get a good sense of not only what the informal power relationships are within the organization, but also simply the resources and attention of the institution. Where are people focusing? What’s important to the institution?

Q: What is the importance of that information flow? At the end of the day, what is that telling you that may be useful for legal cases or historians?

The point of all human rights work is to understand the past, so we can build a future that doesn’t repeat it. One of the things that’s of fundamental interest to Guatemalans is how do we build institutions that provide security to the population, that are not institutions that are immediately re-purposed for political repression. This is the key question. One of the things we want to understand is how did the police become an institution that at least, in part, committed acts of political repression? Were they actually trying to fight crime, but there were a few bad detectives who committed this violence? Or was the violence a systematic and structural part of the institution? Was it in fact their purpose, as they understood it, to commit repression? Those are very different views of the police in the past, and [those] views will inform what we understand the police [might] do in the future.

Q: Is there any frustration concerning the pace of the Guatemalan authorities? We hear that the reports keep getting delayed for political and safety reasons.

I think the Guatemalan National Police Archive Project is moving at a reasonable pace. One needs to keep in mind the scale of work. There really isn’t any other work in the world that happens at this scale. We’re talking about more than one hundred people reading documents. We’re talking about dozens and dozens of different pieces of information that have to be drawn together and synthesized. There are very, very few research projects in the world that have anything like this scale, and fewer people in this world who have experience running them. So it's really hard to figure out when you’re going to be done, or when you’re going to reach an intermediate point that allows you to publish some sort of early report.

We’ve worked with nine truth commissions, and [they all] have a lot of political pressure to conduct thousands and thousands of interviews and then synthesize a report very quickly within a matter of a few years. I’m neither surprised nor frustrated or even impatient for them to finish this first round – I know how hard it is. It’s much more complicated than people think. You get onto the ground and you realize, Wow there are 25 individual reports, each of which brought five people’s perspectives together, and we have to synthesize those. How are we going to produce statistics, which help pull this whole thing together? What are the big questions, and what have we learned?

Q: When you are finally finished with all of these documents, what will the final result look like?

In my experience, years and years of quantitative work, careful database work, lots of on-the-ground documentation – all the technical things we do to get the job right – often comes down to a single [para]graph when we present it to the public. All the work I did with the truth commission [in Guatelamala] came down to a single bar graph, which showed the disproportionate impact of mortality due to political violence on indigenous and non-indigenous groups in six places. And this [ultimately indicated] acts of genocide [took place]. It comes down to a very small number of comparisons, but those comparisons often are the piece that tells the story. I have several other graphs that, for me, capture the reality of a particular county. You can learn so much from one line or a set of bars.

I suspect that our work at the archives may come down to a single graph, but I have no idea what that graph is right now. Right now, it's a whole bunch of graphs. It's dozens, if not hundreds. We’re trying to figure out  where’s the pattern? What are we really trying to find here? And that’s a combination of asking the right question [and] asking it of the right sub-piece of data.