How does that change things for teachers -- change the job of teaching, the
experience of being in the classroom?
There was a time when teachers worried chiefly about the extent to which they
could transmit important knowledge and skills to youngsters. Now, that
situation has been altered, because they're being held accountable to produce
high scores on tests. As a consequence, the preoccupation with raising test
scores has become dominant throughout most parts of the country. ...
I've spent a lot of time working with teachers in the past several years, and
many of them will recount instances where they or their colleagues had devoted
inordinate amounts of attention simply to raising test scores. The
preoccupation was with test-score raising, not necessary with teaching kids the
things that children ought to be learning.
You said "preoccupation with raising test scores." How do they do
The pressures on teachers are so immense to raise test scores, the pressures
that we see from their administrators, from board members, policy makers, that
teachers are sometimes driven to boost those scores using techniques that are
not all that defensible. For example, they may employ test items that are very
similar to the actual test items on the test and, in some instances, teachers
have actually used the real test items on the test preparation activities. So
that kind of preparation is so enormous, so relentless, that the quality of
schooling, frankly, sometimes is reduced.
How does that change the experience of school for kids? Does it send them
any kind of message if your teacher is spending all of his or her time, or much
of his or her time, doing practice exercises?
To me, one of the most frightening things about the preoccupation of raising
test scores is the message it sends to children about what's important in
school. Rather than trying to make the classroom a learning environment where
exciting new things are required, the classroom becomes a drill factory, where
relentless pressure, practice on test items, may raise test scores -- but may
end up having children hate school.
Describe the climate today. You've been around a while; has it always been
this way with test scores?
Eons ago, I was a high-school teacher, and we had standardized achievement
tests way back then. They were given and they didn't make a lot of difference.
We sometimes used them to make judgements about children -- who is better or
worse than whom -- but they did not influence our conduct. Then these tests
began to be used as part of an accountability drive to make sure that educators
were doing their job well. And the minute those tests became the indicator of
educational quality, all of a sudden they became terribly important.
Is it true that educators were not doing their jobs particularly
There is the belief on the part of the public that schools are not as effective
as they should be. I would share that view. When youngsters are getting
high-school diplomas without being able to read, write and compute, that's not
a good thing. So taxpayers want to make sure that their schools were actually
functioning properly and the accountability movement was initiated. It was
enacted by state legislators so that we could have evidence that the schools
were, in fact, running properly. And with that evidence, the role of tests
became very dominant.
At that time, how much thought was given to how you measure educational
quality or educational success? I mean, conceivably there could be many
At the very beginning of the accountability movement, I don't believe the
policy makers really understood what kinds of measures should be used to judge
schools; the policy makers stipulated that student test scores would be the
prime determiner of educational quality. They were nationally standardized
tests. They were produced by reputable companies. So the belief was these will
be the appropriate tests to use. The fact is, however, these are not the
right kinds of tests to use to judge the quality of schooling.
There's an equation out there in the public view: high test scores equals
good quality school; low test scores equals poor performing, bad school. Do you
believe that equation holds true?
The common belief that schools that score high on a standardized achievement
are effective and that schools that score low are ineffective is simply
misguided. It reflects ignorance about the nature of the test being used,
because both tests, frankly, in many years measure the kind of conduct,
knowledge and skills that children bring to school -- not necessarily what they
learn at school. What you want to judge the quality of schooling is the test
that measures how well children were taught, not whether they come from a ritzy
How could the test be measuring children's backgrounds? It's supposed to be
asking questions about what's being taught in school. What's going on
Traditionally constructed standardized achievements, the kinds that we've used
in this country for a long while, are intended chiefly to discriminate among
students ... to say that someone was in the 83rd percentile and someone is at
43rd percentile. And the reason you do that is so you can make judgements among
these kids. But in order to do so, you have to make sure that the test has in
fact a spread of scores. One of the ways to have that test create a spread of
scores is to limit items in the test to socioeconomic variables, because
socioeconomic status is a nicely spread out distribution, and that distribution
does in fact spread kids' scores out on a test.
What would a question look like that fit into that category?
An example I often use is a question that involved a child's familiarity with
fresh celery. There are actually questions on one of the currently used
standardized achievement tests where you have to know what fresh celery looks
like. But kids from upper-class homes, middle-class homes, where they buy fresh
celery all the time, have a much better shot at that question than do kids from
families where they're getting by on food stamps.
Now, there are many such questions in a test. You wouldn't think there would
be. Why would they have them? But those tests spread out examinee performances
Can you think of some other examples?
I'm thinking of one that I saw in a standardized achievement test recently.
This is one that's currently used right now, where the emphasis was on the
youngster's being able to tell what the word "field" meant. "In which field do
you plan to work after you graduate?" Well, children from families where a
mother or father has a professional field, like a lawyer or a dentist or a
physician, they're going to be more familiar with the world "field" in that
connection than would be a child from a family where a mom is a grocery store
clerk or a dad who works in a car wash. So the kids from the middle- and
upper-class families, where they have fields of occupation, will clearly have a
better shot at that item than will kids from disadvantaged families.
So on these national standardized achievement tests, how much of what's
taught in school shows up on the test itself?
A nationally standardized achievement test is given in about an hour. In about
an hour, you can't test all that much, so you have to sample from larger
domains of knowledge and skills. And what you end up with sometimes does not
match at all well with what's being taught in school or what's supposed to be
taught in school. Some studies suggest that fully 75 percent of what is on a
test is not even supposed to be covered in a particular school. Clearly, it's
unfair to judge the quality of schooling based on a test that's largely
covering things that ought not be taught.
How does that situation occur? How's the test written?
... The tests are created by companies that are in the business of selling
tests, and so they want their tests to be as attractive, as marketable as
possible. So, they try to isolate the content they want to include in the test
by looking at national curricula preferences, like preferences of the National
Council of Teachers of English or mathematics, they look at state curricula.
They look at textbooks. And they try to create a test that does the best job in
being acceptable to many people.
But you really can't create a one-size-fits-all test because, in attempting to
do so, you still may have some real gaps between what your test measures and
what is taught in a given situation. ...
If one compares the content of textbooks used in mathematics with standardized
achievement tests in mathematics, you will frequently find that fully half of
the content in the test is not addressed in those textbooks -- simply not
addressed in those textbooks.
Sounds patently unfair.
Remember, we live in a country where we're trying to allow local curricula
choice, where the states determine the curriculum. Beyond that, the districts
determine the curriculum. That being the case, there's a lot of variability. We
do not have a national curriculum in this nation. And as a consequence, the
gaps between what is on a test and what is actually taught sometimes are
The tests were designed to spread out the scores. How do they go about doing
There are two kinds of items that are very effective in spreading out
youngsters' scores. One kind is an item that is indomitably influenced by the
youngster's socioeconomic background. If you have a middle-class or upper-class
background, you'll do better on the item, because it deals with content more
like to be encountered by youngsters from that background.
The second kind of item is one that is linked to the inherited academic
aptitudes with which kids are born. Some kids vary in the way they are born
with more verbal aptitude or more quantitative or more spatial aptitude. And
those variations can be used in the test. You can build the test item to
capitalize on what kids are born with, not what they learn in school. ...
What's most disturbing to me is in traditionally constructed standardized
achievement tests, many of the items, such as those that are linked to
inherited academic aptitudes or socioeconomic status, do not measure at all
what is supposed to be taught in classrooms. ... They measure things that
children bring to school. They measure how smart a kid is when he walked
through the door, and not what he was supposed to learn in that school. ...
If the test writers prefer to write questions that may tap an innate ability
or discriminate according to socioeconomic status, what can they ask about?
What are they not asking about? ...
Another problem with standardized achievement tests, traditionally constructed
ones, is that you want to have a very substantial spread of scores. And one of
the best ways to do that is to have questions that are answered correctly by
about 50 percent of the kids; 50 percent get it right, 50 percent get it wrong.
You don't want items in there that are answered by large numbers of youngsters:
80 percent, 90 percent. Unfortunately, those items typically cover the content
the teachers thought important enough to stress.
So the more significant the content, the more the teacher bangs at it, the
better the kids do. And as soon as the kids do very well in that item, 80
percent, 90 percent getting it right, the item will be removed from the test.
... So you miss items covering the most important things that teachers teach.
It may seem strange that these tests are designed not to measure the most
important things that teachers teach. But these tests were not designed to
judge the quality of schooling. The tests were designed to spread out
examinees. ... You don't want items in there that most of the kids get right,
because those items don't spread out examinees. ... So you don't include those.
Unfortunately, it turns out that those items often cover the very most
important things teachers should be teaching.
California uses the SAT-9 to measure standards. What can you conclude from a
student's test scores on the SAT-9 as to whether or not he or she has learned
enough or met the standards?
A number of states now use standardized achievement tests to measure the
content standards, that is, the knowledge or skills that the state wants
taught. And sometimes the off-the-shelf test is said to be sufficiently aligned
with the standards to serve as a reflection of those standards.
This is simply not the case. If you look at the degree of match between any
commercialized standardized achievement test and a state's content standard,
it's not good enough to make a judgement about whether those standards have
been achieved, and you certainly don't know which standards have been achieved.
So this is simply a pretend assessment. It's not useful for helping teachers
judge or parents judge whether their kids are really learning what they're
supposed to learn.
So, then, this is a misuse of tests. ... Not only as a measure of standards,
but it's also a misuse as a fundamental measure of school quality?
The most profound misuse of educational tests these days is to employ a
traditionally constructed standardized achievement test and base the student's
scores, use those scores, as a reflection of school quality. These tests should
not be used to evaluate school quality. And many citizens think that
should be done and many educators can't disabuse them of that notion, because
they don't know better.
If the tests aren't measuring what's being taught in school, what are
Traditionally constructed standardized achievement tests measure a bit of
what's taught in school. But, by and large, they measure what children bring to
school, not what they learn there. They measure the kinds of native smarts that
kids walk through the door with. They measure the kinds of experiences the kids
have had with their parents. They do not measure, in the main, what is taught
Do you think the politicians know this? They're the ones who sign off on
Most educational policy makers, state board members, members of legislatures,
are well intentioned, and install accountability measures involving these kinds
of tests in the belief that good things will happen to children. But most of
these policy makers are dirt-ignorant regarding what these tests should and
should not be used for. And the tragedy is that they set up a system in which
the primary indicator of educational quality is simply wrong. ...
Because of the misuse of traditionally constructed standardized achievement
tests to judge the quality of schooling, there's some really terrible things
happening to our children in schools these days. One of those is important
curriculum content is being driven out, because it isn't measured by the test.
Another is that kids are being drilled relentlessly on the content of these
high-stakes tests and, as a consequence, are beginning to hate school. And a
third is that, in many instances, teachers are engaging in test preparation,
which hovers very close to cheating, because they're boosting kids' scores
without boosting kids' mastery of whatever the test was supposed to measure.
What's the message to teachers?
Today's accountability framework sends a message to teachers that raising test
scores is all-important. And, as a consequence, teachers frequently don't worry
about the whole education they're providing. They're worried about only what
happens to be covered on that particular high-stakes test that will indicate
how well they're performing. So it's test boosting -- at all costs. And it's
really unfortunate, because the quality of schooling is being lowered as a
A lot of states are moving toward writing their own tests, so-called
criterion-referenced tests, and the feeling is that these tests will reflect
more of what's going on in classrooms. Is that how you see it?
Many states are currently abandoning off-the-shelf standardized achievement
tests and developing customized versions of those tests that supposedly relate
better to the state's curriculum content and what's taught in schools. But the
reality is these tests are typically created by the very same companies that
generated the original traditional standardized achievement test. And in many
instances, there's no reason to believe they function any differently than a
standardized achievement test. Just because a state says it has a so-called
criterion-referenced customized test does not automatically mean that that is a
What could be wrong with that test?
The customized tests that are being built for many states now have the same
kinds of items in them that you'll find in a traditionally standardized
achievement test. They're created by the same companies who have the same item
developers who create the same kinds of items, and they simply try to make it a
little more related to the state's curriculum. The fact is they function
identically to traditionally constructed standardized achievement tests. ...
The people in state departments of education frequently do not know how to
demand the creation of an alternative kind of test. You can have a test that
simply indicates what a student knows and doesn't know. But when these
customized tests are developed, there has to be a new vision of a different
kind of test, and many times that new vision simply doesn't sit there in the
state capital. ...
Now, you have all these standards. You've got 49 states that have adopted
academic standards, and many of them in core subjects. I'm wondering how
helpful are these standards in terms of directing test writers, helping them
know what kinds of questions to ask?
Content standards describe the knowledge and the skills you want kids to learn.
And that's very sensible, to lay out in advance what it is you want children to
learn. Unfortunately, the standards movement in this country is not working as
well as it should, because the people who put together the content standards
are usually curriculum specialists who want children to learn all sorts of
great things. And so the content standards become wishlists of the many things
that you would like children to master. So when you present the content
standards to teachers in that state, there's way too much to cover, there's way
too much to test. And, as a consequence, the standards movement is not having a
positive impact we hoped it would.
On the other hand, you have standards that are incredibly vague. I read one,
"Students will understand historical events in the twentieth century." What do
you do with that one?
There are many standards that are far more vague than they ought to be. My
favorite was that "The student will relish literature." I kept looking for one
that would have mayonnaise mathematics. But those are of no utility to
educators; they have no utility to item writers. They are simply pie-in-the-sky
kinds of aspirations. And so, although it is helpful to identify in advance
what you want children to be able to do after instruction is over, if you
describe this with a litany of vague, ambiguous statements, you haven't
So in some states, the standards movement is more pretense than reality.
But if the standards were very detailed, I would think that might help the
test writers. They would know exactly what they could be asking about. Is that
possible, or am I wrong there?
The virtue of detail is that it would help item writers and it would help
teachers, because they would have a more specific notion about what is to be
accomplished. The downside of that kind of specificity is that it usually ends
up with so many instructional targets the teachers have to cover, they're
simply overwhelmed, as are the test writers. So the trick is to isolate a small
number of really high-powered standards, standards that embrace lesser
sub-skills and focus your instructional energy on that modest number. In
general, the content standards we see in states across the land have not been
isolated in that fashion. ...
How do you suggest they go about writing standards? How do you do it
If I were standards czar, here's exactly what I'd do. I'd go to a specialist
and I'd say, "Isolate the things that you want children to be able to do and
put them in three piles: the absolutely essential, the highly desirable and the
desirable." And having done that, then I get those two piles away and just go
with the absolutely essential. And then I would say, "Now rank them from top to
bottom; the most important, the next most important," and so on.
And then I would have the assessment people come in and say, "These four can be
assessed in the time we have available, and can be assessed in such a way that
teachers will know how to promote children's mastery of them." And then we'd
have a reasonable standards-based assessment system.
You might have to bring in some outsiders; business people or lawyers,
doctors, people in the community.
It's perfectly reasonable to involve people other than educators in the
isolation of what ought to be taught in our schools. Citizens have a stake in
this game, business people, moms and dads. I'd get everyone involved in the
enterprise, just as long as they weren't cowed by the subject matter. I would
not have it decided only by subject-matter specialists, but I would most
assuredly rank in order of import what should be promoted, and then only assess
that which can legitimately be assessed in the time available. ...
You've got all these tests out there. They only seem to ask one or two kinds
of questions. Why is that?
As a practical matter, you can divide the kinds of test questions into all
sorts of categories. But there are really three: there are selected-response
tests, like multiple-choice or true-false tests, where the kid chooses from
choices you present to them; short-answer response, where the kid writes a
phrase or a sentence or two; and then performance-task, where the kid may write
an essay or do something more elaborate. Those are the three kinds.
Now clearly, the first kind, the selected-response test, is much less expensive
to score. The others take scorer time; they're somewhat less precise. The
consequence is most of the tests across the country tend to be dominantly
selected-response in nature. Sometimes a little constructive-response. A little
short-answer, and maybe an essay or two. But the reason you don't have more of
the latter kinds of tests is they cost too much to score.
You're saying we go for cheap tests -- tests that we know may not be as good
as the tests that we could have if we spent more? Is that what you're saying?
Yes. The distressing reality is that the amount of money available for this
kind of assessment operation is usually insufficient to provide for many
students' constructed responses. In some states where they went vigorously for
lots of performance-test items and lots of short-answer items, such as
Kentucky, they've been forced to reduce their attention now to a dominantly
multiple-choice kind of test. It just costs a lot of money. Should we be
spending the money? Of course we should. But that, of course, is a social
"Reliability." On the surface, it seems like reliability is usually described
as consistency in the scores. Is a machine-scored test always a reliable test?
Reliability is a technical characteristic of tests. It's very important. And
unfortunately, it comes in three flavors. When you think about consistency of
the test, it might be a test that's administered one time, and then a week
later, you get the same consistent scores. Or there are two forms of the test:
form A and form B, and you get consistent score reports on both. Or it might be
all the items in the test functioning in about the same way. So there are
different ways to look at reliability.
And once you see that a test is reliable, you always must ask, "What kind of
reliability?" So consistency in items is good. But one kind of reliability is
not identical to the other kinds. ...
Why is reliability important? Why do people make such a big deal out of
In general, if a test is reliable, it's more likely you'll make valid
interpretations from it. If a test is absolutely unreliable, that is scores
bounced around all the time, you wouldn't even know what the kids came up with;
how could you ever make a valid interpretation? Because this might be Molly's
high day, as opposed to Molly's low day. So tests that are unreliable cannot
yield valid score interpretations. ...
We talked about measuring schools. Let's talk about accuracy for individual
students now. The scores that you get back on these tests, whether they're
criterion state tests, or tests like the SAT-9, how accurate are they? Is a 62
always a 62?
When I first started teaching, I had a girl named Sally Palmer who had an IQ
test score of 126. I believed Sally Palmer had an IQ of 126 and not 127 or 125.
I believed in the accuracy of numbers. And many parents still do. They believe
that those numbers are so darn precise. Psychometricians, experts in testing,
have a term they call the "standard error of measurement," which indicates how
likely it is the kid's score will be off by a certain amount every time the kid
takes a test. It's a lot. These tests are not as precise as is widely believed.
How much is it?
The standard error of measurement depends on the particular test, and how much
variability there is on the scores. And sometimes it can be fairly modest, but
more often than not it's quite substantial. ... These tests are far, far more
approximations than most people believe. ...
Are we talking a few points? Are we talking 20 points? Thirty points?
The standard error of measurement for many tests is such that, let's say on a
50-item test, you might find three or four points as a standard error. ...
I get my score back and it says 62. How should I look at that number? As a
parent, what should I be thinking when I look at that number?
Fortunately, most testing firms these days are beginning to report results as a
score in a particular area. ... They don't give you a single score. Rather,
they'll give you a little chart that has a score in it with a little graph that
says how much higher or how much lower the kid might actually have scored. So
as a parent, you look at that and you say, "Ah, my youngster scored
somewhere in that range; not necessarily that precise point." ...
In many places, states are making high-stakes decisions -- who graduates,
who doesn't -- using a precise test score -- 220 you're in, 219 you're out. Is
The measurement community is universal in its condemnation of a single
criterion, like a single test score, to be used in making an important decision
like denial of a diploma. But to use a test score as one contributor to a
variety of evidence to make that decision, that's acceptable.
Now the question is, is this kind of test accurate enough to be a contributor?
In most instances, the states allow the youngster to take and retake the test
several times, to make sure that the kid didn't just have a bad outing the
first time and so on. And so if you are allowed retakes, and to have that test
be used as a contributor to the decision, I think that's acceptable. ...
If the scores are in that plus-or-minus five-point range, what makes the
The chief reasons that you have variation in the kids' scores are the kids
themselves. They may literally be approaching the test today with less sleep
than they had the night before; some kind of emotional disturbance, parental
and so on. It may be the way they're responding to the particular items that
preceded certain items on the test. There may be something that offends a given
child. A minority youngster finds minority children depicted in a way that
bothers them. It may be the way they were taught in a particular classroom
allows them to be confused by an item, which itself, had it not been taught
that way, would have been very clear. And so on.
There are all sorts of little things that go into making a kid's performance
less than totally accurate. ... It might be the temperature of the room. It
might be the day of the week. It might be so many things. And so the score,
even though it is a number, and even though it is earned on a test that comes
from a technical firm, may be inaccurate. ...
I'm going to guess that many people out there hear the president speak about
tests, and they hear everyone saying, "I want tough accountability in schools,"
and they think, "Oh, what's the big deal? We write a test." How simple is it to
do this and come up with a good test?
It's hard to write a test that does an accurate job in reflecting what students
have learned, and simultaneously give teachers and students guidance as to what
they should be promoting instructionally. It's very difficult to do that, and
there's an underestimate of that difficulty.
How long does that take? You're constructing a valid, reliable test of
fourth-grade social studies. Start to finish, how long is it going to take?
At one time, I was involved in the development of tests for states. And if you
were developing and test, let's say, of math and English and reading for a
given grade level, you were looking at, at least, a year and a half to two
years for the development of the items, for the field testing of the items, for
the review of the items, to make sure they weren't biased, to measure the right
kind of content. It's not an overnight enterprise, and my guess is, in general,
you're looking at somewhere between a year and three years to develop a
So what do you think when you hear President Bush calling for expanded
testing, every year in grades 3 through 8? Are we ready for that?
My concern about the president's call for more testing is that he and his
advisors may not recognize that if we have more of the same kinds of tests
we're currently using, good things will not happen in American education. I'm
not opposed to high-stakes testing. I think the proper kinds of high-stakes
test could be very useful for not only accountability, but for instruction. But
if we have same old same old, in this instance, we'll be harming the kids, not
And we'll be in fact measuring what we think we're measuring ?
We will not be measuring what we think we're measuring, because we'll create
these tests that are designed to spread people out, and not necessarily assess
precisely the knowledge and the skills that our children should be learning. If
you have the right kind of test in there, you can do good things for education.
The wrong kinds of tests can stultify and corrupt education in our country. ...
How long will it take to create the right kind of tests?
It will probably take anywhere from two to three, four years to develop
crackerjack tests across the board. These tests will call for a different way
of thinking about educational assessment, for a different way of thinking about
how you measure content standards. But that thinking would be worth it. ...
What is the proper role of tests in schools?
Educational tests, if properly developed, can be a marvelous tool, not only to
tell the world how well schools are doing, but to help teachers and children
promote the kinds of knowledge and skills that children should be mastering.
You have to think of tests differently than the traditional kinds of tests. My
criticism is not of high-stakes tests, but of traditionally constructed
standardized achievement tests. ...
I met this teacher in California who said, "I don't believe in any testing.
I don't want any testing in my classroom. I don't believe that there's a single
thing that tests can do that can help me do my job better."
There's a resistance emerging in our country to high-stakes tests of any sort.
I think that's unsound. I believe that properly constructed high-stakes tests,
tests that can help teachers teach more effectively, should be used. I think
the public has a right to know how well their schools are doing. So to resist
any kind of testing, I think is disadvantaging the children. You have to create
the right kinds of tests. But they can be a powerful force for instructional
design, for getting kids to learn what they ought to learn.
How can we do that?
You have to build tests in a different way. You build tests with instruction in
mind. You don't build tests to spread out examinees, and stop the action there.
You build tests where you're always thinking, "How could this be promoted by a
reasonably effective instructor? How could this really be taught?" And you
build tests in such a way that they capture worthwhile skills, but capture them
in a way that lets teachers know how they should be teaching. ...
Before the interview, we were talking about my kids and their public school,
and I think you said something like, "Let's do something about this use of
tests, so that when my kids are in school, there will still be public
schools." Do you really think that way? Is this what's at stake here?
The public disenchantment with American schooling is profound. And many people
are looking for alternative solutions, whether they're charter schools or
vouchers or something else. I believe in the public schools. And I believe
those public schools can be made effective if they are not judged with the
wrong assessment tools, and they're given assessment tools that help them do a
better job. I want to see our public schools persist. But I think you have to
start focusing on a different way of measuring their performance.
Are we setting up schools for failure with these tests that we currently
We're making it impossible in many instances for teachers to do any better
unless they cheat. If we build tests that fundamentally measure what children
bring to school, not what they learn there, then quite clearly those children
are never going to get better than what they brought to school. We have to
create tests that really do reflect how well teachers have been teaching. Those
kinds of tests will allow, I think, public education to survive. The kind of
test that we're using now is setting up public educators for absolute failure.
Care to predict where we'll be three years from now?
I think the only way that we're really going to make progress in this arena is
for more people to learn about the subject matter. For more policy makers, for
more citizens, for more parents, to learn about assessment, what tests should
be used, what tests shouldn't be used. Because if they don't learn this, we'll
continue to use the wrong assessment tools.
home · no child left behind · challenge of standards · testing. teaching. learning?
introduction · in your state · parents' guide · producer's chat · interviews
video excerpts · discussion · tapes & transcripts · press
some photographs ©2002 getty images all rights reserved
web site copyright 1995-2013 WGBH educational foundation