homethe billthe standardsthe testdiscussiontesting our schools
homethe billthe standardsthe testdiscussion
photo of qualls
interview: audrey qualls

Audrey Qualls is an associate professor of education at the University of Iowa and co-author of the Iowa Test of Basic Skills (ITBS), one of the most widely administered tests in the country. Qualls tells FRONTLINE that the tests she has developed were never intended to be used for high-stakes purposes. She also believes that the test-publishing companies will not be able to handle the increased demand created by President Bush's new mandatory testing policy. This interview was conducted by producer John Tulenko on April 25, 2001.

... [C]ompared to 15 years ago, how important are tests in the world of K-12 public education?

They play a huge role. It's a role of importance. It's probably out of line with the type of information they provide.

What do you mean?

Within the last 15 years, tests have been used to ... show that educational systems are doing what they're charged with doing. And an easy way of trying to show that is through the results of standardized tests. That's what a lot of policymakers have believed, and politicians. So you're seeing an increase in the mandated use of tests for accountability purposes. Most of the tests that are being used in that way were not built to serve that purpose in isolation. ...

What were they built for?

Most of the standardized achievement [tests] that are being used were originally built to provide information to the classroom teacher on a student's strengths and weaknesses for instructional purposes. ... [I]t's an external check of how the kid is growing, how they're changing, where they're weak, where they're strong. ...

You said the tests point out strengths and weaknesses for the student. But if the test is showing strengths and weaknesses throughout the whole school system, what's wrong with that as a measure of school quality?

A test can only measure a sample of the curriculum, a sample of the things that schools think are important, that we want children to learn and be able to do. If I take test results in isolation and try to evaluate the worth of a school, I'm taking a very narrow picture that could be distorted, that may not be representative of the whole. If I balance it strictly on test results and I'm in a district where I have very bright students that are very wealthy, well off, the system is providing lots of resources, that school is going to automatically perform at a different level than a school that's got students that are struggling. [And it's] not because of things that you can really control for. ...

But many people might have [said] that the tests sample the most important parts.

Not necessarily the most important parts. For example, national achievement tests, standardized tests, are going to measure those things that are most common within a particular grade level. So it's going to measure the most common things that are taught to fourth graders and fifth graders in certain areas. It may be very, very important to teach students how to actually set up an experiment. That's a major goal of science. ... I can't measure that very well with a multiple-choice type test. Most of the national achievement tests that are being used are of multiple-choice formats, so they narrowly have to go in and measure those things that are best suited to a multiple-choice format.

And what's best suited for a multiple-choice format?

Whether or not a student actually asks questions. ... Whether they're part of a conversation. ... I can't pick [that] up with a paper-pencil test.

Well, it's going to depend. A lot of people have an argument that multiple-choice tests only measure basic skills, that we cannot measure the student's ability to evaluate. When you think of a taxonomy of behaviors, that's not true. You can measure evaluation; higher-order skills can also be measured with multiple choice. But I can't measure per se the student's ability to produce something. I can measure underlying skills that may be associated with that, but not their ability to actually produce a product. ...

Why do we use multiple choice so much?

... The best [argument for using] multiple choice would be [that] it's an efficient method to measure certain skills, certain knowledge. There are other ways of measuring different skills, different knowledge. So you could use a performance assessment to pick up other types of skills [and] combine the information with the information you gather from a multiple-choice [test]. The reason we tend to go first to a multiple choice is practical: I can measure a lot of things very efficiently, very quickly, and fairly cheap. I can't do that with a performance assessment.

So what important thing can you not measure in the multiple-choice format?

Writing. Direct samples of writing. ... When I look at young children, there are other types of behaviors that are related to learning that I can't measure with a paper-pencil type instrument.

Such as?

Well, staying on task. Whether or not a student actually asks questions. Whether they use background information appropriately. Whether they're part of a conversation. Those are all very important things to know about ... young learners. But I can't pick it up with a paper-pencil test. So you want multiple pieces of information from multiple sources. ...

At the beginning, you said that the national achievement tests ask what's in common. Is what's in common universal?

Oh, no. Absolutely not. You're going to have some pieces of the curriculum that are pretty standard at particular grade levels. There are certain things that fourth graders will learn in mathematics. When I think about social studies, the first thing most students learn in social studies is the notion of family. Then you move from family to neighborhood. You move from neighborhood, broader, to city. Those pieces are common. But when I think about the different ways that a textbook might present that, or curriculum material might present that, that's going to vary. When you look at the national tests, what they've tried to do is to find those pieces that are most common, recognizing it's not going to be a perfect match for any given school district.

How much do they miss?

Depends on the school. Any state that's looking at a particular achievement test [to] use has to have a responsibility. They need to start with their curriculum, because those educators in that system have decided [that] these are the most important things for [their] students to learn. ... [T]hey should compare [the curriculum] to what's being measured by each of the possible competing national achievement test. And you pick the best match.

But if you just pick a test off the shelf, what's the risk you're taking?

You're taking a fairly big risk of mismatch, without looking at the test. I would think you're being very irresponsible.

How many choices do you have?

There are [five] major achievement tests. There are lots of different ways to create customized tests, depending upon which vendor you go to, that might better measure your local curriculum. ...

Doesn't sound like a whole lot of choices.

Probably not. When I look at choice within educational testing, you're looking at a market. Like anything else, [it's] a private market. Probably 10 years ago when tests were used slightly differently, there wasn't a heavy emphasis on a single test result. Those ... tests, used in combination with other sources of information, probably did a very, very good job of providing external information to the schools.

If I'm now in a situation where I'm saying, "I've got to choose a test that's going to be used to make very high-stakes decisions either about the school or the student," sure, you want to find the best possible match. And [five] tests are going to leave you wanting. ... I think you're going to see new players coming into the market of developing standardized tests. Right now, with Bush's proposal, with the state mandates on accountability, there's a very, very high demand for tests. And the major publishers in the market are not going to be able to handle the demand. It's too much.

... So when I pick up my newspaper at home and I look at the chart that ranks all the schools according to test scores and I see a school at the bottom, I think, "Gee, that must be a bad school." What do you say to me?

I say you could have ranked those schools without test results and gotten the same thing. I would tell you that ranking of schools is really a very unfair comparison of schools.

So I'm not right to think that?

No, you're not necessarily right. You may not be wrong. If I look at the factors that impact achievement, it is not as if all schools start off equal. No question. Most of the communities in the U.S., you have some degree of segregation, at least by socioeconomic standards. So you're going to have your very affluent schools, you're going to have more of your middle-class schools, and this is based upon neighborhood. And you're going to have lower-income schools. We know there's a very high correlation between income and achievement.

So if I think about just the median income level of the families in the school, I know some schools at the bottom are starting with some of the weakest students. And when I look at schools starting with weak students, in order to help those students get up to now this notion of one common standard, it's going to take more than a school that's starting at the top. ...

[I]f you have a kindergarten child that enters school that doesn't know their alphabet, they can't count [to] 10, you're going to first have to teach that student how to do that. ... [You] have to establish some pretty basic skills that aren't in place, and in other students, they are there. And at the end of kindergarten, [if] I want to get this student up to where maybe they have beginning reading skills in place, it's going to take more time. ...

And at the end of that first grade, if that student learns the alphabet, learns some reading skills but still scores poorly on the test, is that a reflection of poor teaching?

No, not necessarily. ... If you look at the growth of the student, from where they started to where they finish at the end of the school year, that could be an outstanding school.

Do the tests tell you that information?

No piece of information tells you that in isolation. None. So what you're looking at is teacher grades or test scores. ... We need to take all the pieces. I need some idea of what the kid is bringing to me when they come in the door, so I have to have some early informal assessment or formal assessment of where the student is. ...

When I think about what a school should do or what makes a school good or bad, it's far more than how well they teach a few sets of skills. When we think about what we've [asked] schools and teachers to do, it's the developmental growth of these little five-year-olds and six-year-olds that come in a door, all the way to being productive citizens, able to go to college or able to go on to work. And if I think about, what does that mean, it truly has to be more than just achievement in a few basic areas.

One, I want them to develop an appreciation and value of societal norms. Being a good citizen. Caring about learning. Contributing to your community. I can't measure that with a test. When I think about what makes a good school, I want to think about, do teachers have the commitment to help a student who may not be the traditional student, who needs a little bit more? ... A test score doesn't capture that. ...

So if I were to try to say, "OK, all of that sounds wonderful. And I want the school to do it," but all that I can easily measure is your achievement over some basic skills, I've cheated the schools. ...

[S]ay that student scores in the in the 5th percentile [of all the students who take a test]. How fair is that to the teacher who's being rated according to where the kids stand?

My opinion [is that] it would never be fair to rate a teacher whether the kid's at the 5th percentile or the 95th percentile. My rating on that teacher should be based upon more information than an achievement test [can] score. So I don't think it's fair. ...

And yet [students' scores are used to measure teacher performance]. We make this huge leap of faith.

We sure do. And it's because we're comfortable with this objective test scoring. It's safe. We think we know what it means. And we think it can stand in a proxy to the value of the whole, which is wrong. The inference is just wrong. Possibly, it's our fault as educators because, in the past, we haven't wanted to be accountable. We've avoided this notion of testing or accountability, or what do I do with my day. University professors have it, too. We've given the public a sense we're hiding.

So I think a lot of it is lack of communication, an unwillingness to accept when the public says they're not happy with us. They want more. They want schools to do something different. A lot of times, we run in fear, rather than trying to sit down and communicate: "This is what we do. We don't have a good way to show you that we're doing it. What would be better?" ...

What is a "norm-referenced" test?

... [If] I want a national norm reference test, I'm going to get a representative sample of students in the nation and I'm going to give them this one test that represents common curriculum. ... The scores that are going to be yielded are scores that give me information of how children are doing relative to each other. So relative to third graders, how [does this particular third grader] look? ...

[Percentiles, that's one way to score a norm-referenced test. That is, if you're a student who scored in the 5th percentile, you scored better than 5 percent of the other students who took the test. And 95 percent of the students who took the test outperformed you]. ... What does a norm-referenced percentile score tell you about whether or not the child has met standards for learning?

If you were in a situation that you were standards-based -- and your main focus was, has the student learned the standard? You wouldn't use a percentile rank, or a percentile. ...

A lot of educators in the public, we've gotten comfortable with the [percentile scores and the] comparison to others. [We can say], "I don't have a firm way to defend any given standard, but I do know that compared to other third graders, you're performing well."...

[So it's not appropriate to use norm-referenced tests that are scored using percentiles. But what if a norm-referenced test is scored differently? Is it the right test to use for measuring whether a student has met the learning standards?]

It may or may not be. ... The best test would be a test that's built specifically to measure the curriculum of your school, a test that has enough items measuring any given objective that you can make a trustworthy decision that the child has either achieved the standard or not. That, in all probability, would lean more towards what we think of as "criterion reference," as opposed to norm reference, because the emphasis in norm reference is, how do you perform relative to someone else? ...

[With criterion-referenced tests], it's not how your third grader does relative to other third graders. It's, "Does your third grader know what we've deemed to be enough? Do they know enough to be called excellent, or basic, or needs improvement?" But how do you decide?

Real tough. With the criterion reference ... you can define a set of material that students should learn. So within reading -- which has got to be the hardest one in the world to even think about setting a standard -- I have a set of objectives and content and types of material students should be able to read in fifth grade. Now, at some point, I have to decide how much is enough. And in order for me to say that you are a good reader, let alone an excellent reader, someone has to come in and set a [score], and decide in order for you to earn this label "Good," you have to answer X number of questions in the reading section correctly. The standard is set subjectively. There is no other way to set a standard. ... Somebody or some groups of bodies -- educators, politicians, businessmen -- have to decide how much is enough. ...

Another way that you can actually set a standard is to take groups of students that you have labeled as excellent, good, poor, what have you, and look at their actual performance on that test and find the breaks and the distribution and set the cuts there. That's one way you can do it.

Or if you want to raise the bar, which is what we're doing now, it's not so much what students actually can do, but it's what we believe they should do. And I can set a standard based upon my beliefs of what students should do. That's what NAEP does.

At the state level, expectations of what students should be able to do ... How often are [those expectations] pie in the sky versus realistic?

They're most often, in my opinion, pie in the sky. The reality ... come[s] down to the decision, how many students can you afford to fail? How many students can you afford to remediate? Because if they don't reach the standard, what do you do with them? Do we just throw away generations of students? So when you're setting this standard ... you have to look at the outcome too. ...

Well, if they ask that question, isn't there some incentive to [say], "I don't want to fail a lot, it's going to make me unpopular?" ...

There are certain states where projected failure rates right now in their new standards are 50 percent. And we have commissioners and politicians who are saying, "That's OK. I'm going to live with that 50 percent now, until people understand it's a new day and we have to get the students up to speed." I don't know how you live with failing 50 percent of your students. I don't know how you really can say that 50 percent of your students are not performing to a standard that's been subjectively set. ...

States may set the bar high. Does everyone have an equal chance of getting over the bar today?

No, absolutely not. ... But what do you do with the bar [in states with high failing rates]? The only way you're going to change that is to drop the bar.

You could say, "We'll get a qualified teacher in every classroom."

Oh, absolutely. I wish we would say that. I wish we would say we're going to make sure all students have a qualified teacher, [that] all students have the resources necessary to learn, all students have access to a computer. School's open, you can go to school, it's a safe zone. ... But all students don't have those opportunities. ...

[With norm-referenced tests], we're comparing one third grader to another third grader. You have 40 questions roughly, and 40 minutes in which to create [statistical variation between the students -- the "bell curve."] How do you do that? ...

Before I pick those 40 questions, I have a curriculum. And let's say I write 120 to 150 questions. I write far more than what I'm going to use. I know that I have to have questions to challenge the brightest students, I have to have questions to challenge those students in the middle, and I have to have easy questions to challenge those students on the bottom. ...

Someone said to me, "On norm-reference tests, national achievement tests, if a teacher does a stellar job teaching content, it's very likely that that won't appear on the test, because too many students will get that question right." Say a whole school system focuses on punctuation, so that [the students all] know their punctuation. Can you ask a question about that?


And what if they all got it right? ... Would you throw out that question?

No. ... When I'm building [a] test of basic skills, I'm going to draw a representative sample of schools and classrooms from across the country. School district X in the Northwest, teacher Jones may in fact have done an excellent job of teaching that particular objective of punctuation. But the rest of the country may not be doing that. So that when I look at the average difficulty level, it's not going to reflect that everyone's doing the same thing. ...

How much material on a norm-reference test is at grade level?

That's tough. The majority would be what you would think of as grade level. But if you go into a fourth-grade classroom, all students are not performing at one point. Some fourth graders are working on material a little bit above, some fourth graders are working on material a little bit below. Typically, if you want to think about a test, some of the easier questions for fourth grade may in fact be material that the average or above average third grader can do. So that would be off level for fourth grade. But it would challenge those least able fourth-grade students. ...

So what is the difference if you were, say, writing a norm-reference test for fourth grade [math] and a criterion-reference test of fourth-grade math? Would you take a different approach to the questions you write in some way? How would the test writer approach it differently?

I wouldn't take a different approach to the question I write. But I would write more questions first, to cover a particular objective, and the manner in which I selected the questions to include on the test would be different.

How would it be different?

... I wouldn't have to worry about having a set of items that challenges the very bottom, or the very top, or the majority in the middle. My focus should purely be on covering the material. And if all students can get it right, so be it. [But] if I have absolutely no variability, if every student got every single question right, I don't have any information. So what I would do on the criterion-reference test is really try to have questions that spread students out from those who are good, versus not good, or on level, proficient, not proficient.

... All the public really wants to know is, "Has my kid learned third-grade material?" ...

What about those third graders who are already beyond third grade? They're still in a third-grade classroom. A good teacher has to be able to structure instruction in such a way that she can challenge the range of ability. If she really has a third grader who's very, very bright, who learns very, very quickly, she has to have a tool that allows her to challenge that student, to push them on. We don't want her to just stop and [say], "You're done for the year." You want to keep them learning. The most common way to do that is to reach to the grade level above. ...

So the test then would help the teacher do a better job in that sense?

That's why we originally thought we were building these things, and not [for] accountability. Absolutely. ... I build tests to provide instructional information to help a student learn.

... If you built it to help the teacher, how has it been twisted? ...

It has been viewed by policymakers as a cheap, efficient, external tool that I can mandate onto a school or a system, get a magical number, to see how well you're doing. And I can hold you, as a district, accountable. When the public became dissatisfied with what students knew, what they were learning, what our schools were doing, everyone in a good faith effort reached for a fix. No one really knows how to fix public education. But one very easy way of at least indicating a problem, or strength, is through a number. It's objective. ...

There are a lot of people who look at the test scores and they say, "Oh, my child got a 75." ... Is a 75 always a 75?

Absolutely not. That score of 75 contains some amount of error. What I want to do is try to make sure it doesn't contain a lot of error. You happen to take the test at 10 a.m. Well, maybe you would have done differently at 12 p.m. ...

What is the standard measurement error in the history portion [of the eighth-grade ITBS]? ...

I can't give you a standard error off the top of my head. ... It's going to be based on a different test and a different score scale and what have you. For example, hypothetically, let's assume the standard error for scores near the center (because [standard error] depends upon where the score is) might be five score points. So if you have a score of 75 ... I know that 75 is not an accurate pinpoint estimate. There's some amount of error in there. And what I'm saying with this notion of a standard error [is that] ... if I could measure your true ability with no error, your score could be captured within the range of 70 to 80. ... So it's the notion of providing a range within which I believe your true ability lies.

Parents and policymakers. How much do they know about that [range when they] look at tests?

Very little. They believe the score you get is accurate and perfect. They have very, very little sense of understanding of error, which is a problem. ... [I]f I'm one point below a cut score, and I have a standard error of measurement that's plus or minus three points, it's very likely that my true ability is above the cut. And if you have absolutely no knowledge of what a standard error means, you're going to assume you have perfect measurement and fail the student. It's a problem.

Why do you think people are so ignorant about tests?

Think about just the label of people who [devise] tests. The professional label is a psychometrician. ... It sounds nice, magical. You don't have a clue what it means. It's a field where very little people know about it. ... And it's a field where we haven't done a good job of educating the public.

We build all sorts of interpretive material to support our tests. They cost a lot of money. So rather than going out and doing a better job ... of educating people -- this is what you should use the tests for, this is the type of information or decision it best supports -- I think we have to blame ourselves for not finding a way to communicate.

Part of the problem could be that, as a parent, I can't see the test.

That hasn't always been the case. There were times when the teachers could see the tests. The parents could see the tests. The tests are now a mystery. They're a mystery because we have high stakes attached with the outcome. We're holding folks accountable. You put high stakes on it [and] the pressures to cheat, the pressures to gain an advantage are so high that we're now just putting the test behind this shield. So I agree -- part of the problem is, you don't have a clue what we're testing.

So now will this high stakes direction that we're going in shroud tests further, do you think?

I'm sure, to some degree, yes. And to others, no. If you actually see movement on the federal level to produce a national test, it's going to be under a deep layer of secrecy until it's released, until the students take it. The minute they take it and it's over, release the test. Let everyone know: This is what was out there. This is federal money.

There are only two states now, as far as I understand it, who actually release complete tests. Why don't more states do that?

The cost of building these tests is enormous. The amount of time that's committed is enormous. If you look at something like the ACT or the SAT, where you're building one test on one level ... you're only measuring four or five areas. And you're charging students $25, $30 each to take it. With that source of income, I can replace the test completely every year.

If you look at an achievement battery that's kindergarten through the eighth grade, there are only two forms. They're developed every seven years, on average. And you have 11 to 13 different content areas. The average student in the country may be paying, I don't know what it is now, $6, $7 per kid. It could be $10, $12. You can't afford to release and start over. So from a commercial standpoint, the tests that are being used in the current movement are too expensive to redo. The resources aren't there.

Or we're not just willing. They may not be too expensive. It may be, we're just too cheap.

Oh, I don't think we're too cheap. I think there are lots of other things that our educational funds should go to rather than paying to build a new test. I'd rather put the money on teacher training, put the money into computers. The emphasis on testing is far too high. If we have extra resources, additional tests aren't going to make the difference. The difference is going to come from what the teacher does in the classroom. So give the money to the classroom teachers, to help them.

How appropriate is it to base a graduation requirement on a test score [alone]? ...

Inappropriate. Completely inappropriate. ...

The notion of [a test's] validity is tied to a specific test use. Is the test valid for supporting the types of inferences you're trying to make from a given test use? ... Do I have information to support that my test score or my test can yield a set of scores that you can use to make graduation decisions? That's a notion of validity. ... Most of the tests that are being used now, I don't have information that would support that I could make a graduation decision, based upon a test score. ...

Now, the community of test makers seems to [be] in fairly strong agreement on that. ... Why don't people listen to you?

If you now challenge any of the policymakers, they will tell you they're not using a test as the sole source of information. They're using multiple measures, they're taking into consideration teacher grades, other types of pieces of information. I can't tell you what they all are. ... I have no reason to believe they're lying to us. So I'm assuming they're starting to use other pieces of information.

Well, in Massachusetts, you have to pass a test to get a high school diploma.

Right. But that's not the only thing you have to do. The problem is, the test is the final gatekeeper. So regardless of all the other pieces of information, this test is still serving as the final single source. We would say that's inappropriate. I can't tell you why they're not listening. I wish we knew. Maybe we haven't communicated strongly enough. Maybe we don't know or have the tools necessary to enforce appropriate use of tests. I don't know.

You make these tests and you turn them over and you have no say over [how] they're used?

Which is pretty sad. ...

You hear a lot of politicians these days and business leaders [who say] ... "I'm for tough accountability. Let's test them and make sure they know [the material]. Let's expand testing in more grades." It's as if you could pull these tests somehow out of thin air. I'm wondering, do these people have the realistic notion of how long it takes and what's involved in writing a test?

No. Absolutely not. ... We actually spend seven years developing new forms [for the ITBS]. ... We're constantly making changes. We're trying out items all the time. It's extremely expensive and difficult to get the cooperation from schools across the country to try out the test questions.

So when you hear President Bush say, "Next year, I would like to mandate testing in grades [3] through 8 in ... reading and math, and then we'll expand it," ... what goes through your head?

It's nuts. ... It's not going happen. It's an impossible task. That can happen if it's planned in time. The biggest piece is, what is the information going to be used for? Bush's plan actually allows the district or state to choose or design any assessment of their choice. ... We do not have the capability of scoring, [of] producing reliable, good reports in a timely manner, if that plan goes through. The resources aren't there. ... For me, it makes me believe it's going to lead to a federal national testing plan where I'm going mandate a particular test for all kids. And that's a scary thought. ...

home · no child left behind · challenge of standards · testing. teaching. learning?
introduction · in your state · parents' guide · producer's chat · interviews
video excerpts · discussion · tapes & transcripts · press
credits · privacy policy · FRONTLINE · wgbh · pbs online

some photographs ©2002 getty images all rights reserved
web site copyright 1995-2014 WGBH educational foundation