First Person

A Principled Defense Of Standardized Testing

This month, anxiety is high as students across New York State take the latest round of state tests, the first to be tied to the Common Core Learning Standards. There has been an incredible amount of energy invested in public criticism of the testing program, culminating in parents telling their children to refuse to take the state’s tests because they disagree with either how the test results are used or the impact they are having on schools. Additionally, a public hue and cry has gone up about details of the test that some members of the public deem unfair. But some of the criticism that has been directed toward the tests has been misplaced. Understanding basic principles of test design makes it possible to see that the tests are doing their best to accomplish a steep, socially important task.

There is something painfully Benedict-Arnoldian about writing this post for me. I am passionately, openly, and sometimes foolishly, in love with authentic assessment and portfolio assessment. I have seen working as a teacher, and now alongside them, the power that relevant and meaningful work holds for students. I found my way into my current position after leaving the classroom for a professional development team funded by the New York State Department of Education. Working on that team, together with personnel from SED, gave me a chance to get to know the people behind the names on mastheads and SED edicts. Those experiences and names didn’t leave me once the funds ended. From my work on the school support team, I was lucky enough to join an organization that devotes itself to learner-centered practices and supporting schools to design curriculum, instruction, and assessments that are rigorous, meaningful, and relevant. One of the amazing things I get to do for a living is help schools design performance-based assessments that ask students to do something with what they have learned, not just recall what they’ve learned. My job does not depend on the success or failure of state tests. I have no stake in testing itself, beyond that of a taxpayer and an educator privileged to work with teachers and schools. So my passionate belief in the craft of the teaching profession comes from my professional experience in classrooms and schools. I believe, adamantly, in using both the science of learning and the art of instruction to provide a quality public education to all of New York’s students.

I’ve attempted to pull out five things that parents and the GothamSchools community may find interesting, or should, know about psychometrics, or test design. As you read through this, I invite you to think about policy makers and test designers sitting in side by side boxes. Test designers design tests given certain parameters. Policy makers, and politicians, attempt to make decisions about the results of those tests. The wall between them may feel thin, but things like Race to the Top, teacher evaluations, etc. are separate from test design. It can feel easier to see into the policy maker box — which houses State Education Commissioner John King and Mayor Bloomberg, among others — than into the “test designer” box. It might seem like the test designer box is plastered with Pearson stickers, and while in some ways, that’s accurate — the state did award Pearson the latest contract for test design — Pearson is following a protocol established by New York State and the field of psychometrics to inform the creation of the New York State Assessments. The test designers, the people moving the tests from conception to the tests seen by students this month, are not sitting in their box conspiring to make students fail. They are highly educated members of a field committed to honoring the art of art of teaching through the science of assessment. (Most have doctorates in psychometrics or statistics. That’s why I’m a groupie and not a card-carrying psychometrician myself: My degrees, certification, and teaching experience are in special education.)

So without further ado, here are five important things every New Yorker should know about test design:

  1. Learning, like health, is a construct. Your doctor can’t directly measure how healthy you are, but he or she can directly measure variables that reflect health. For example, your heart rate at this moment doesn’t describe if you are healthy or not. It’s just a number that reflects an attribute of health. Your doctor can take your blood pressure, your temperature, and ask you how you’re feeling and combine those data points to ascertain if you’re healthy. Test design is similar. When we assess (strategically collect evidence of students’ learning), we can only assess a proxy or an attribute of that learning. We can’t pull out a child’s brain, slap it on a scale and say, yup, they’ve learned this much (and for the record, I didn’t just reveal some grand conspiracy. No one wants to weigh children’s brains). No standardized test that a child will experience can capture those amazing traits and attributes that make that child such a beautiful little person. Most importantly — they don’t claim to. A test designer’s job is to create a tool that can measure particular skills and particular standards in a particular way. Designers will tell us what they are measuring in two ways. First, they will establish the purpose for an assessment. Second, they will release a table of specifications (also known as a test map). New York State shares both of these details from its tests (here and here).The history of assessment design has some parallels in the evolution of the medical field. A hundred years ago, doctors were boring into patients’ brains to relieve migraines. The profession got better as the science got better. Much of the science behind medicine is inaccessible to the public, but SED, and psychometricians, put the science of their field right out for the public to access. (It was this transparency that uncovered the serious problems with the way Pearson scored New York City’s gifted and talented exams, discussed in more detail later on). The technical reports describe step by step how NYSED ensures the tests are measuring the right constructs and the design consideration documents clarifies what the designers need to attend to. This transparency is important yet often overlooked.
  2. Security does not mean secrecy. Many are clamoring to see the items so they can judge them, and they view the state’s decision to keep the items secure as a way to keep parents from seeing and critiquing any bad items. Yet that’s not really what’s happening. The grade 3-8 tests were not secure for years, and I could not find any evidence of parents or the public successfully challenging or critiquing an item. When looking at an item, it’s important to remember that being an adult means having years and years of background knowledge. When adults look over the shoulder of a child or sit down to take a test designed for children, their response is influenced by their own background experiences. Even the most experienced third-grade teacher cannot see through the eyes of an 8-year-old. This does not mean adults can’t empathize or critique some aspects of item quality, but their assessment is complicated by the fact that what makes an item quality may vary from parent to parent, teacher to teacher. And the most useful evidence of the actual difficulty of an item is the feedback from the children taking a test — a child saying, “Question 7 was hard” is a powerful perceptual data point — but children might well report an item as easy or hard when their responses indicate the opposite.Knowing all of this, when New York State educators helped review the items students are seeing on the tests (the state requires that Pearson includes teachers in item review), we have to assume they did their very best to ensure the items student see are fair, rigorous, and aligned to the Common Core standards. Moving forward, field testing gives the test makers a real sense of an item’s quality and usefulness, and the students’ answers this week give them the final components needed to determine how high-quality a test is or isn’t. There are lots of checks and balances in test design to ensure that items are quality, including ways to handle students guessing, leaving items blank, and being confused by a particular wrong choice. Technical reports provide multiple examples of these balances. Keeping items secure after administration is a hard nut to swallow for those of us who want to dismantle the item response data and technical reports, but it’s a way to keep costs to taxpayers down and increase the strength of the tests from year to year.
  3. Public accountability is a part of the social contract. Ever since President George W. Bush asked the question “[are] our children learning?” we’ve been thinking differently about how we capture evidence of success in public schools. Can success be determined by looking at some students in some schools? Or do we need to look at all students in all schools? New York, and most other states, concluded that we have an obligation to generate a data point for each child. (Again, no conspiracy theory. There is no goal to view a child as just a number or a data point. It’s akin to taking the temperature of all of children in the state. At the same time. In the same way. Your opinion of testing may influence what type of thermometer you picture in that metaphor.)Assessing every public school child in the state is an awesome undertaking that might or might not be warranted (there is lots of conversation among psychometricians about population sampling on large-scale assessments), but asking every teacher to report on student learning in such a way that can be accessible to taxpayers is lovely in theory, but impossible in design on a very large scale. In New York, we’re asking a slightly different question this year: “Can New York State public school students successfully respond to questions and tasks aligned to our new standards?” The tests we’re seeing now, I believe, come from a sincere effort on King’s part to answer that question. While some argue that student performance is a conversation that should be limited to parents and teachers, the social contract of public education is about all Americans, all taxpayers. Alternative assessments have been successful on a smaller scale through performance and portfolio consortiums, but the time those schools invest in assessment design, administration, and scoring is considerable.Where this public contract runs into problems is what happens with the data once they’re generated — and policy makers start making policy. How we communicate about student learning with the public is an important conversation. It’s a different fight, though, than about the tests themselves.
  4. NYSED wants assessments that are worthy of the state’s students. Our state takes pride in having one of the longest standing departments of education. Look at national or federal panels around practically any educational issue, and you’ll likely see a New Yorker’s name. Our state has a reputation to uphold when it comes to innovation and commitment to students. (Case in point, the state added a bunch of standards to the Common Core before adoption. The most frequently occurring verbs in the New York State-added standards? Create, engage, and seek to understand.) Linda Darling-Hammond at Stanford University is doing very exciting work about the next generation of assessments, and the two national test consortiums are spending a lot of time figuring out what that means. Test designers write, research and think about what it means to measure learning, and a considerable amount of thinking is done about making sure the tools we use are the best possible.Ten years ago, the fourth-grade state assessments almost always included a fable. Fourth-grade teachers across the state taught fables, regardless of their curriculum. Our system will respond to the measurement tools we use, which is why we have an obligation to make those tools the best they can be. There is a lot riding on the line with these Pearson contracts. The company dropped the ball last year by overlooking final eyes on the Spanish language tests, and King rightly reminded Pearson officials of the terms of its contract with the state. Again last week, much to the chagrin of those advocating for getting this right, Pearson announced a serious scoring issue with the New York City’s gifted and talented tests. The gaffe is inexcusable. The only bright spot in the G&T debacle is how the problem was uncovered. It was the transparency mentioned in point 2 that allowed parents to review their child’s score. Pearson is big but not so big as to withstand getting dumped by New York.  This is not a defense of Pearson. What happened was indefensible and will likely be tracked to simple, boring human error (i.e. the wrong scale was used when converting scores). However, it is important to note that Pearson worked to right the wrong once brought to their attention and knows that the spotlight is brighter than ever on their test design departments — as it should be.However, as justified as the anger is around the G&T test, other complaints may be not rest on as solid ground. As we move towards worthier tests, our state is making small moves. Some of the details that the state put in the Pearson contract include the inclusion of authentic texts. Since textbook publishers, of which Pearson is one, also use authentic texts, people noticed overlap between textbooks and the tests. As provocative as the overlap sounds, it’s a coincidence. For each passage that appears in a Pearson textbook, we could likely find a passage in a McGraw Hill or Harcourt reading series. Authentic texts also talk about the world around us — which at times, includes trademarked names. Including a trademarked name isn’t a way for psychometricians to get kickbacks but instead, is a way to use literature or informational texts that reflect students’ worlds. So while these details — Pearson textbooks containing passages and the inclusion of trademarked names in text — feel like design errors or additional reasons to fault Pearson, they are the natural consequence of increasing the texts’ authenticity.
  5. Test designers don’t control what happens in the classroom. As many have reported, students are spending 540 minutes taking the state’s math and reading tests this year. If students go to school for 180 days and spend, let’s say, five hours a day learning, that’s 1 percent of their school year devoted to taking part in an assessment of public education across the state. Yet schools have reported weeks spent on test prep and hours spent taking practice tests. Test designers and Pearson make for a great focus of our anger because they seem faceless and powerful. But they have no direct say in what happens in the state’s classrooms the other 99 percent of the time. Since the state’s tests are not yet worthy of our students (but getting better) it is critical that the assessments students see on a daily basis — teacher- and school-designed assessments — are rich, complex, challenging and authentic. Pearson doesn’t dictate how to talk about the tests with students or what happens at the local level with the results. Parents reported children cried because they were told they would fail if they didn’t pass. I wish I could sit down with teachers who believe that and walk them through the 1999 APA Testing Standards and the Code of Fair Testing Practices in Education to show them what testing is intended to do and the foundations that test designers operate on. Part of the challenge of establishing the science of our profession is figuring out how we talk about the science, and how we balance the construct of learning with the concept of quality public education. Test designers are not evil. They don’t want to trick children. They don’t want students to fail. They want to measure proxies of learning in order to provide evidence to answer questions about the health of our public schools.At this point, I would want to point to how the tests are supposed to be used. The first item in the 1999 APA Testing Standards is around purpose. That is, test results should only be used for the purpose the test was designed for. This item is usually the first thing that appears in a test technical report. It’s on the first page of the state’s 2011 technical report and appears on the list of required items for the required technical report in the state’s contract with Pearson. Going right to the source — how does the state intend for the results of these assessments to be used? — would be a great way for me to demonstrate the difference between test designers and policy makers. Except I can’t do that, because the 2012 technical report hasn’t been posted yet and the 2013 report won’t be out until at least 2014. That’s a problem that makes it harder for us to separate design and policy. Despite the report’s absence, we can still push back against bad policy. We can continue to raise questions about “value-added” measures of teacher quality, as many test designers and psychometricians are doing. We can demand that King honor the statements he made about no school being penalized based on this year’s scores. We can work to convince Bloomberg to change his mind about basing summer school enrollment and retention on state assessment scores.

There are, to be sure, issues around this year’s assessments, such as the time and length. Some of these issues will be resolved in the post-test reliability procedures. Some will be resolved by the movement to computer-based testing in the future. But regardless of how they are resolved, please be assured that there is a science to test design. What students saw last week and will see this week was reviewed by local teachers, vetted for quality and worthiness, and represent a sincere effort to answer the question, “Are we providing New York State public education students with a quality education?”

First Person

I’m a principal who thinks personalized learning shouldn’t be a debate.

PHOTO: Lisa Epstein
Lisa Epstein, principal of Richard H. Lee Elementary, supports personalized learning

This is the first in what we hope will be a tradition of thoughtful opinion pieces—of all viewpoints—published by Chalkbeat Chicago. Have an idea? Send it to

As personalized learning takes hold throughout the city, Chicago teachers are wondering why a term so appealing has drawn so much criticism.

Until a few years ago, the school that I lead, Richard H. Lee Elementary on the Southwest Side, was on a path toward failing far too many of our students. We crafted curriculum and identified interventions to address gaps in achievement and the shifting sands of accountability. Our teachers were hardworking and committed. But our work seemed woefully disconnected from the demands we knew our students would face once they made the leap to postsecondary education.

We worried that our students were ill-equipped for today’s world of work and tomorrow’s jobs. Yet, we taught using the same model through which we’d been taught: textbook-based direct instruction.

How could we expect our learners to apply new knowledge to evolving facts, without creating opportunities for exploration? Where would they learn to chart their own paths, if we didn’t allow for agency at school? Why should our students engage with content that was disconnected from their experiences, values, and community?

We’ve read articles about a debate over personalized learning centered on Silicon Valley’s “takeover” of our schools. We hear that Trojan Horse technologies are coming for our jobs. But in our school, personalized learning has meant developing lessons informed by the cultural heritage and interests of our students. It has meant providing opportunities to pursue independent projects, and differentiating curriculum, instruction, and assessment to enable our students to progress at their own pace. It has reflected a paradigm shift that is bottom-up and teacher led.

And in a move that might have once seemed incomprehensible, it has meant getting rid of textbooks altogether. We’re not alone.

We are among hundreds of Chicago educators who would welcome critics to visit one of the 120 city schools implementing new models for learning – with and without technology. Because, as it turns out, Chicago is fast becoming a hub for personalized learning. And, it is no coincidence that our academic growth rates are also among the highest in the nation.

Before personalized learning, we designed our classrooms around the educator. Decisions were made based on how educators preferred to teach, where they wanted students to sit, and what subjects they wanted to cover.

Personalized learning looks different in every classroom, but the common thread is that we now make decisions looking at the student. We ask them how they learn best and what subjects strike their passions. We use small group instruction and individual coaching sessions to provide each student with lesson plans tailored to their needs and strengths. We’re reimagining how we use physical space, and the layout of our classrooms. We worry less about students talking with their friends; instead, we ask whether collaboration and socialization will help them learn.

Our emphasis on growth shows in the way students approach each school day. I have, for example, developed a mentorship relationship with one of our middle school students who, despite being diligent and bright, always ended the year with average grades. Last year, when she entered our personalized learning program for eighth grade, I saw her outlook change. She was determined to finish the year with all As.

More than that, she was determined to show that she could master anything her teachers put in front of her. She started coming to me with graded assignments. We’d talk about where she could improve and what skills she should focus on. She was pragmatic about challenges and so proud of her successes. At the end of the year she finished with straight As—and she still wanted more. She wanted to get A-pluses next year. Her outlook had changed from one of complacence to one oriented towards growth.

Rather than undermining the potential of great teachers, personalized learning is creating opportunities for collaboration as teachers band together to leverage team-teaching and capitalize on their strengths and passions. For some classrooms, this means offering units and lessons based on the interests and backgrounds of the class. For a couple of classrooms, it meant literally knocking down walls to combine classes from multiple grade-levels into a single room that offers each student maximum choice over how they learn. For every classroom, it means allowing students to work at their own pace, because teaching to the middle will always fail to push some while leaving others behind.

For many teachers, this change sounded daunting at first. For years, I watched one of my teachers – a woman who thrives off of structure and runs a tight ship – become less and less engaged in her profession. By the time we made the switch to personalized learning, I thought she might be done. We were both worried about whether she would be able to adjust to the flexibility of the new model. But she devised a way to maintain order in her classroom while still providing autonomy. She’s found that trusting students with the responsibility to be engaged and efficient is both more effective and far more rewarding than trying to force them into their roles. She now says that she would never go back to the traditional classroom structure, and has rediscovered her love for teaching. The difference is night and day.

The biggest change, though, is in the relationships between students and teachers. Gone is the traditional, authority-to-subordinate dynamic; instead, students see their teachers as mentors with whom they have a unique and individual connection, separate from the rest of the class. Students are actively involved in designing their learning plans, and are constantly challenged to articulate the skills they want to build and the steps that they must take to get there. They look up to their teachers, they respect their teachers, and, perhaps most important, they know their teachers respect them.

Along the way, we’ve found that students respond favorably when adults treat them as individuals. When teachers make important decisions for them, they see learning as a passive exercise. But, when you make it clear that their needs and opinions will shape each school day, they become invested in the outcome.

As our students take ownership over their learning, they earn autonomy, which means they know their teachers trust them. They see growth as the goal, so they no longer finish assignments just to be done; they finish assignments to get better. And it shows in their attendance rates – and test scores.

Lisa Epstein is the principal of Richard H. Lee Elementary School, a public school in Chicago’s West Lawn neighborhood serving 860 students from pre-kindergarten through eighth grade.

Editor’s note: This story has been updated to reflect that Richard H. Lee Elementary School serves 860 students, not 760 students.

First Person

I’ve spent years studying the link between SHSAT scores and student success. The test doesn’t tell you as much as you might think.

PHOTO: Photo by Robert Nickelsberg/Getty Images

Proponents of New York City’s specialized high school exam, the test the mayor wants to scrap in favor of a new admissions system, defend it as meritocratic. Opponents contend that when used without consideration of school grades or other factors, it’s an inappropriate metric.

One thing that’s been clear for decades about the exam, now used to admit students to eight top high schools, is that it matters a great deal.

Students admitted may not only receive a superior education, but also access to elite colleges and eventually to better employment. That system has also led to an under-representation of Hispanic students, black students, and girls.

As a doctoral student at The Graduate Center of the City University of New York in 2015, and in the years after I received my Ph.D., I have tried to understand how meritocratic the process really is.

First, that requires defining merit. Only New York City defines it as the score on a single test — other cities’ selective high schools use multiple measures, as do top colleges. There are certainly other potential criteria, such as artistic achievement or citizenship.

However, when merit is defined as achievement in school, the question of whether the test is meritocratic is an empirical question that can be answered with data.

To do that, I used SHSAT scores for nearly 28,000 students and school grades for all public school students in the city. (To be clear, the city changed the SHSAT itself somewhat last year; my analysis used scores on the earlier version.)

My analysis makes clear that the SHSAT does measure an ability that contributes to some extent to success in high school. Specifically, a SHSAT score predicts 20 percent of the variability in freshman grade-point average among all public school students who took the exam. Students with extremely high SHSAT scores (greater than 650) generally also had high grades when they reached a specialized school.

However, for the vast majority of students who were admitted with lower SHSAT scores, from 486 to 600, freshman grade point averages ranged widely — from around 50 to 100. That indicates that the SHSAT was a very imprecise predictor of future success for students who scored near the cutoffs.

Course grades earned in the seventh grade, in contrast, predicted 44 percent of the variability in freshman year grades, making it a far better admissions criterion than SHSAT score, at least for students near the score cutoffs.

It’s not surprising that a standardized test does not predict as well as past school performance. The SHSAT represents a two and a half hour sample of a limited range of skills and knowledge. In contrast, middle-school grades reflect a full year of student performance across the full range of academic subjects.

Furthermore, an exam which relies almost exclusively on one method of assessment, multiple choice questions, may fail to measure abilities that are revealed by the variety of assessment methods that go into course grades. Additionally, middle school grades may capture something important that the SHSAT fails to capture: long-term motivation.

Based on his current plan, Mayor de Blasio seems to be pointed in the right direction. His focus on middle school grades and the Discovery Program, which admits students with scores below the cutoff, is well supported by the data.

In the cohort I looked at, five of the eight schools admitted some students with scores below the cutoff. The sample sizes were too small at four of them to make meaningful comparisons with regularly admitted students. But at Brooklyn Technical High School, the performance of the 35 Discovery Program students was equal to that of other students. Freshman year grade point averages for the two groups were essentially identical: 86.6 versus 86.7.

My research leads me to believe that it might be reasonable to admit a certain percentage of the students with extremely high SHSAT scores — over 600, where the exam is a good predictor —and admit the remainder using a combined index of seventh grade GPA and SHSAT scores.

When I used that formula to simulate admissions, diversity increased, somewhat. An additional 40 black students, 209 Hispanic students, and 205 white students would have been admitted, as well as an additional 716 girls. It’s worth pointing out that in my simulation, Asian students would still constitute the largest segment of students (49 percent) and would be admitted in numbers far exceeding their proportion of applicants.

Because middle school grades are better than test scores at predicting high school achievement, their use in the admissions process should not in any way dilute the quality of the admitted class, and could not be seen as discriminating against Asian students.

The success of the Discovery students should allay some of the concerns about the ability of students with SHSAT scores below the cutoffs. There is no guarantee that similar results would be achieved in an expanded Discovery Program. But this finding certainly warrants larger-scale trials.

With consideration of additional criteria, it may be possible to select a group of students who will be more representative of the community the school system serves — and the pool of students who apply — without sacrificing the quality for which New York City’s specialized high schools are so justifiably famous.

Jon Taylor is a research analyst at Hunter College analyzing student success and retention.