The study measured teachers against the criteria in Charlotte Danielson's Framework for Effective Teaching rubric, which is used in New York as a tool for observing teachers. Teachers scored better at classroom management than they did on measures of higher-order instructional challenges, such as asking productive questions.

A historic look inside the nation’s classrooms, including some in New York City, painted a bleak picture, according to a report released by the Bill and Melinda Gates Foundation today.

The second installment of the foundation’s ambitious Measures of Effective Teaching study, the report focuses on the picture of teaching yielded by five different classroom observation tools. It also scrutinizes those tools themselves, concluding that they are valuable as a way to help teachers improve but only useful as evaluation tools when combined with measures of student learning known as value-added scores.

The conclusion is a strong endorsement of the Obama administration’s approach to improving teaching by implementing new evaluations of teachers that draw on both observations and value-added measures. New York State took this approach to overhauling its evaluation system when it applied for federal Race to the Top funding.

Among the group of five observation tools the foundation studied is the rubric now being piloted in New York City classrooms as part of stalled efforts to implement the changes to teacher evaluation, Charlotte Danielson’s Framework for Effective Teaching.

Through all five lenses, instruction looked mediocre in an overwhelming majority of more than 1,000 classrooms studied, the report concludes. There were some bright spots. Many teachers were scored relatively well for the aspect of teaching known as “classroom management” — keeping students well-behaved, making sure they are engaged.

But teachers often fell short when it came to other elements of teaching, such as facilitating discussions, speaking precisely about concepts, and carefully modeling skills that students need to master. These higher-order skill sets, the report notes, are crucial in order for students to meet the raised standards outlined in the Common Core.

The study is the most expansive known examination of instruction in the U.S., reviewing more than 1,000 teachers for this report and nearly 3,000 for the study. Its lead authors are the economists Thomas Kane, of Harvard, and Douglas Staiger of Dartmouth, although more than a dozens researchers contributed to the study.

The evaluations were conducted by trained evaluators, who watched clips from videotape of more than 1,000 teachers around the country and then judged whether the teaching exhibited certain traits outlined in the observation tools.

One complicated aspect of the study is that it doesn’t just ask what the observation tools have to say about teaching; it also asks whether those observation tools are good ways to measure teaching at all. The question is crucial to the contentious teacher quality debate.

Motivated by the Obama administration’s focus on improving teaching by improving the way teachers are evaluated, the teacher quality debate has been dominated by a search for a better evaluation tool. The idea is that if school districts could have a better way to sort teachers, then they could increase quality by rewarding those who are most effective and improving or removing those who are less effective.

The study offers a qualified endorsement of the five observation tools it studied, saying that they should be one of multiple evaluation measures but that no one observation tool should be a sole measure. While the study found that all five observation tools had a positive association with student achievement as measured by value-added scores, the associations were not perfect.

And the tools’ reliability was relatively low — lower, in some cases, than the famously volatile judgments of value-added measures. When different observers used the same tool to evaluate the same teacher, they sometimes gave very different scores.

But the report does endorse using the observation tools in combination with value-added measures, as New York’s new evaluation system does. When researchers combined multiple observation tools’ judgments of teachers together — and then combined those with the teachers’ value-added scores, the result was a view of a teacher that was more able to predict future student achievement, the report says.

A final complication worth noting is that the study’s ultimate arbiter of what makes a good evaluation tool is itself under heavy scrutiny. That arbiter is a teacher’s value-added score, an estimate that attempts to extrapolate the amount of student learning for which a teacher can be held responsible, excluding other factors such as a student’s family income level.

A study that was the subject of a story in today’s New York Times found that value-added scores indeed are useful predictors not only of student achievement, but other measures of life success. Researchers have cast doubts on value-added measures’ validity, citing a host of concerns from the measures’ volatility to whether a high value-added score reflects true student learning or simply effective test prep.

Though an overhaul of teacher evaluation in New York has been stalled by the failure of teachers unions and school districts to agree on how to conduct it, both the New York City teachers union and the Department of Education agreed to participate in the Gates Foundation study when it launched in 2009. The union helped recruit teachers to join, and ultimately, teachers from about 100 schools signed up to have their lessons videotaped and analyzed.

“It takes the politics out of what’s being measured,” UFT president Michael Mulgrew said when the union first agreed to participate. “Teachers are very frustrated with the political debate. They are always saying, ‘why don’t you just come into the classroom?’ That’s what this is doing.”

Since then, the politics over teacher quality has grown even more heated.

Last summer, a GothamSchools reader who had worked in a school piloting the Danielson evaluation said it was very hard for teachers to be rated “effective.”