Although value-added assessments are defensible for evaluating teacher effectiveness, student test scores need not be the only measure of teacher quality. Principal and vice principal evaluations can also help pinpoint good teaching, and policymakers who face resistance to value-added assessment may want to consider offering to include supervisor evaluations as well. As a practical matter, however, many of the same groups that unremittingly point out flaws of value-added measurements also argue that supervisor evaluations are biased and capricious.
Yet principal or vice principal evaluations are superior to peer evaluations or parent evaluations, which are more likely to suffer from subjectivity.[*] Research findings also suggest that principals are capable of measuring teacher effectiveness.[†]
A recent RAND Corp. working paper on merit pay by Richard Buddin and colleagues lists some potential limitations to supervisor evaluations of worker effectiveness. The researchers explain that it can be difficult to correct for the inherent subjectivity of any performance evaluation that involves individual supervisor judgment. They add that problems can also arise when workers perceive favoritism and that a subordinate’s personality or demographics can interfere with supervisor objectivity. They also note that supervisors may be hesitant to judge performance accurately out of fear of reprisals from disgruntled workers. Finally, they write, "Compression of scores or rankings towards the upper end of the distribution is likely to occur when evaluations are used as part of a pay setting." Buddin et al. also refer to a recent study of principals’ ability to evaluate teacher performance by Brian Jacob of the University of Michigan and Lars Lefgren of Brigham Young University.
Jacob and Lefgren asked principals in an unidentified Midwestern school district to rate 202 teachers of core subjects during the 2002-2003 school year in grades two through six on a scale from one to 10 on a number of different traits traditionally seen as related to teacher effectiveness, such as classroom management skills. Jacob and Lefgren also calculated the student achievement test score gains for each teacher. Then they compared principals’ ratings of effectiveness to actual effectiveness as measured by student achievement gains. They found that principal ratings and value-added calculations were roughly equal in identifying the most and least effective teachers, but that principals were less able to differentiate effectiveness in the middle of the teacher quality distribution. They also examined the extent to which a teacher’s education and experience, which are the basis of the single salary schedule, are good predictors of student achievement growth. On this question, they found that education and experience were inferior predictive measures of teacher quality.
Interestingly, Jacob and Lefgren found that principal evaluations were better predictors of parent preferences for specific teachers than were the teachers’ value-added achievement measures, years of experience, education or compensation. While this finding could be taken as a sign that principals and parents are equally "wrong," the finding probably indicates that principals perceive teacher characteristics that parents tend to value, even though these characteristics may not be measured by standardized tests.
Despite the fact that principal ratings are good indicators of teacher effectiveness in the classroom, Jacob and Lefgren are careful about recommending the use of this rating mechanism. They note that their experiment was carried out in a setting in which principals did not face job pressure to identify effective teachers. They explain that the effect of a higher-stakes environment is unclear: While the increased importance of the evaluation might motivate principals to be even more accurate, it might also make them reluctant to assess teachers honestly for fear of reprisals. (Principals’ evaluations were kept confidential and not made available to the teachers themselves.)
Jacob and Lefgren also found that principals, regardless of their own sex, routinely discriminated against male and untenured faculty. They wrote: "Specifically, principals rate both male and untenured teachers roughly 0.3 to [0.5] standard deviations lower than their female and tenured colleagues with the same actual proficiency." They offered a lengthy set of possible explanations for this discrimination without any firm conclusion, but stated, "Regardless of the cause, however, this discrimination may place male and untenured teachers at a disadvantage in a system that relies more heavily on principal assessment." Ultimately, this and the study’s other findings indicate that although principal evaluations may have drawbacks, they can help identify good teachers.
Recent research findings by Douglas Harris and Florida State University’s Tim Sass also suggest that principal evaluations can help identify teacher quality. In a 2007 study, Harris and Sass compared principals’ private ratings of teachers in an anonymous Florida school district to value-added calculations of teacher effectiveness. The 30 principals included in the study spanned elementary, middle and high school grades. Harris and Sass wrote, "We find a positive and significant correlation between teacher value-added and principals’ subjective ratings and that principals’ evaluations are generally, though not always, better predictors of a teacher’s value-added than traditional approaches to teacher compensation that focus on experience and formal education." Like Jacob and Lefgren, Harris and Sass advised caution in the use of principal evaluations for use in teacher accountability or reward systems; they do not dismiss this possibility, however.
As this research suggests, principals are generally capable of evaluating teacher effectiveness. Principals’ input can be used as a supplement to value-added assessment and to help address concerns over value-added measures of teacher effectiveness.
[*] For an argument on site-based management reform, see Angus McBeath, “The Edmonton Public Schools Story: Internationally Renowned Superintendent Angus McBeath Chronicles His District's Successes and Failures” (Mackinac Center for Public Policy, 2007), www.mackinac.org/archives/2007/s2007-13.pdf (accessed May 18, 2008).
[†] In a recent report on teacher evaluation systems, Thomas Toch and Robert Rothman of Education Sector, an education policy think tank in Washington, D.C., raise concerns about the current methods of measuring teacher quality (see Thomas Toch and Robert Rothman, “Rush to Judgment: Teacher Evaluation in Public Education” (Education Sector, 2008), www.educationsector.org/usr_doc/RushToJudgment_ES_Jan08.pdf (accessed June 26, 2008)). In particular, Toch and Rothman criticize the common practice of having a single supervisor assess teacher performance through a single classroom observation.
It is valid to criticize the practice of principals’ making uninformed personnel evaluations, and it is reasonable to encourage principals to supplement the information gathered through their own observations of teachers with input from lead teachers, parents and students through formal and informal methods as appropriate. However, not all of Toch and Rothman’s recommendations for fixing the problems inherent in conventional rating systems are likely to bring about meaningful changes.
Toch and Rothman call for the use of multiple measures and multiple evaluators. Regarding multiple measures, they write: “The experiences of the leading comprehensive evaluation systems suggest that samples of student work, teachers’ assignments, and other ‘artifacts’ of teaching are valuable compliments to classroom observations and should be included in evaluations” (Page 19). Moreover, they write, “To get a fuller and fairer sense of teachers’ performance, evaluations should focus on teachers’ instruction — the way they plan, teach, test, manage, and motivate” (Page 18). As I argue throughout this primer, teacher performance is best measured by student outcomes. Including these varied measures of teacher inputs sounds compelling, but confuses the central focus of teaching. Planning, teaching, testing, managing and motivating can help a teacher to be successful, but at the end of these efforts, success on these tasks does not guarantee the desired outcome. Thus, teacher evaluation should stay focused on the outcome — student achievement — not the means of achieving that outcome.
Although they do not completely disregard the use of standardized test scores for teacher evaluation, Toch and Rothman argue that “test scores should have a minor role, accounting for under 50 percent of a teacher’s evaluation” (Page 18). They refine this recommendation by stating that test scores should not be used to measure individual teacher progress, only schoolwide progress. Toch and Rothman support this claim by writing, “That’s because many teachers don’t teach tested subjects, the small number of students that many teachers teach skews the results, and using schoolwide scores encourages school staffs to collaborate rather than compete” (Page 18).
The goal of value-added measurement is to improve upon teacher evaluation by centering on the outcomes that matter most. The fact that not all teachers teach currently tested subjects or large classes does not preclude the use of the test scores to measure the performance of teachers for whom we do have sufficient relevant data. Even so, some teachers will need to be measured by schoolwide gains. Under a bonus system, teachers measured by schoolwide gains could have lower potential rewards than teachers who are under higher level of scrutiny. Alternatively, schools can introduce new assessments in a wider variety of subjects. The data from these additional tests could be helpful for diagnosing student progress and for measuring teacher performance. The common complaint that teachers will compete, rather than collaborate, under evaluation systems that use test scores to measure individual teacher performance can also be addressed. Including a schoolwide performance measure for all teachers — including those who will be measured individually — will ensure that teachers continue to collaborate. In fact, it may drive them to collaborate more than before.
Concerning the use of multiple evaluators, Toch and Rothman argue that principals often fail to differentiate levels of performance when evaluating teachers. Toch and Rothman suggest that this phenomenon may be due both to the unwillingness of principals and their inability to measure teachers accurately. To address these problems and principals’ subjectivity, Toch and Rothman recommend the use of carefully trained peer evaluators (typically senior teachers) whose perspectives can broaden the pool of viewpoints.
Unfortunately, allowing teachers to evaluate one another simply replaces one type of subjectivity with another. Teachers can use evaluations of peers as a way to solve petty grievances and vendettas. The work of the University of Michigan's Brian Jacob and Brigham Young University's Lars Lefgren and of Douglas Harris and Florida State University's Tim Sass indicates that principals are capable of evaluating teachers accurately. The problems with principal evaluations arise under the current system of teacher tenure, in which the process of removing a low-performing teacher is doubtful and can take several years. Principals thus face real disincentives to giving negative performance evaluations and thereby alienating teachers.