Understanding Cowen Institute’s Botched VAM Effort
We are in an age of so-called “data-driven” education reform. Numbers and analyses are being worshiped as the end-all, be-all evidence of education quality. Student standardized test scores are at the center of the majority of such analyses. Moreover, in order to “drive” the privatization of public education using quantitative data analysis, corporate-reform-bent philanthropies and businesses are dumping money into “institutes,” groups of often questionably-credentialed individuals who promote attractive reports full of impressive numbers and analyses meant to wow the public into believing that test-driven “reform” is working.
The edge that these institutes (and other corporate-reform-promoting “nonprofits”) have in wielding statistical analyses is that neither the public nor the media is able to critically examine the quality of the work. Therefore, both are susceptible to swallowing whole the institute’s summation of its findings.
After all, if the physical appearance of the report is attractive, and if the report comes from *An Institute*, it must be trustworthy.
The public often does not critically consider the agendas of those financially supporting an institute; it often does not know the qualifications of those producing the reports, and it cannot discern whether the report outcomes amount to little more than a propaganda brochure “finding” in favor of the favored “reforms” institute donors.
All of this I had in mind as I read the retracted, October 1, 2014, Cowen Institute report, Beating the Odds. On October 10,2014, Cowen Institute removed the report from its website due to “flawed methodology.”
I wrote about the retraction in this October 10, 2014, post. I did not go into great detail on Cowen’s error because I needed to think of how to communicate it to readers in a way that is not too technical.
I will try to do so in this post.
What Cowen Institute did wrong was a major blunder– the kind that skilled researchers do not make. Cowen Institute’s researchers apparently thought they were conducting a value-added modeling (VAM) analysis. Instead, they conducted a more basic analysis known as multivariate linear regression (MLR)– and even that, they botched.
Not only did Cowen Institute conduct the wrong statistical analysis, and not only did it misuse and misinterpret the more basic analysis it conducted, but Cowen Institute also did not even realize it’s gross error until a week beyond publication.
I am baffled at how this happened. The incompetence astounds me. When I first read in the Times-Picayune that Cowen Institute had withdrawn the report due to “flawed methodology,” I expected a more sophisticated error. In fact, when I realized that Cowen had conducted (and interpreted, and published) the wrong analysis, I doubted my own senses.
It made me wonder just how crappy the rest of Cowen’s research actually is.
In their flawed study, Cowen Institute stated that it had produced predicted values on three outcome measures (EOC passing rates, an ACT index, and cohort graduation rate) for all Louisiana high schools. The stated goal of the study was to compare predicted outcomes with actual outcomes to determine which schools performed better, worse, or as predicted.
The focus was on actual school performance as compared to a predicted performance. Though many think of VAM in terms of evaluating the teacher based upon students test scores, Cowen was attempting to “VAM” the schools based upon student test scores and grad rates.
But the Cowen analysis was not VAM.
Before I proceed, let me note that I am convinced VAM cannot work. In December 2012, I analyzed Louisiana’s 2011 VAM pilot results and explained how erratic (and therefore useless) VAM is. Student test scores cannot measure teacher quality; neither can student test scores measure school quality, and attempting to hold teachers and schools hostage to statistical predictions on their students is lunacy.
VAM should not be connected to any high-stakes evaluation, period.
That noted, allow me to offer a brief word on VAM and MLR.
Both VAM and MLR (the basic analysis that Cowen actually conducted, albeit poorly) assume that lines can be used to capture the relationship among the variables. Thus, both rely upon the basic equation for a line. (Perhaps you remember it from algebra days gone by: y = mx + b.)
One can think of VAM as a more sophisticated version MLR. VAM has levels of lines to it because it considers layers, such as those evidenced when one considers that students are in classes; classes are in schools, and schools are in districts. One might think of VAM as having equations within equations. In contrast, MLR operates only on one level (i.e., no equations within equations).
The Cowen researchers conducted their analyses on one level, and they used MLR. Their report includes three separate MLRs, one for their three outcomes of interest: EOC, and ACT, and grad rates. In an attempt to predict these three outcomes, the researchers used five measures: 1) percentage of students who failed LEAP tests, 2) percentage of students who are over=age for their grade level, 3) percentage of students on free/reduced lunch, 4) percentage of students in special education, and 5) whether the school is a selective admissions school.
Two Cowen researchers thought these three MLRs were VAM.
In order to conduct school-evaluating VAM on the outcomes of EOC, and ACT, and grad rates, the researchers should have incorporated previous measures of EOC, and ACT, and grad rates, into their analysis. Makes sense, doesn’t it? For example, in order to predict future EOC scores for a given school, one must consider previous EOC scores for that school. Yet no such incorporation of previous scores is present in the Cowen study.
Really, really not good.
It gets worse.
Not only did the two Cowen researchers use the wrong analysis; they didn’t even use MLR well. That’s what gets me more than any true-yet-botched attempt at actual VAM. MLR is an analysis with which one with a stats and research background should be familiar, and these Cowen folks botched even it.
When used appropriately, MLR can be used for either of two purposes: to explain or to predict an outcome. If I have a theory about what factors contribute to a certain outcome, I can use MLR to test my theory and determine the degree to which my theory explains some outcome.
MLR for explanation does not evaluate individuals. It actually evaluates the researcher’s theory about what factors contribute to some outcome.
The more common usage of MLR is to predict. I have a friend from college, a fellow stats major, who tried to come up with an MLR equation to predict winners of horse races. (It seems that many stats people dabble in gambling since gambling is all about probability, as is stats.) In coming up with his equation, my friend tried to determine which predictor measures could help him determine which horses would win future races. As such, he wanted a useful equation, one that could predict a future outcome: the winner of a future horse race.
Now this is important: Even if one of the predictors of future wins is some tabulation of past wins, the purpose of the MLR was not to evaluate jockey “effectiveness” based upon the horse’s performance. That would have been a VAM goal (and a futile one, as previously noted). Instead, my friend’s focus was on the usefulness of his equation at predicting the winner. That’s an MLR goal.
Cowen tried to evaluate individual schools based upon an MLR prediction equation. This was wrong to do.
Had the Cowen Institute researchers properly conducted an MLR for prediction, here’s how their study generally might have looked:
First, the research question would have focused on the utility of the MLR equation in predicting future outcomes, not on evaluating the schools.
Second, in order to produce a MLR prediction equation, the researchers should have at least two random samples, one to use to develop the equation, and at least one other to test the usefulness of the equation.
Finally, the researchers could have then decided whether to make (or suggest, if not possible to make) adjustments to the equation in an effort to improve prediction, or they could have decided that the equation is satisfactory.
There you have it.
One should not test a prediction equation using the same sample data one uses to arrive at the equation because the equation has been tailored to fit the sample data as best as is possible. Nevertheless, in the case of the Cowen study, it seems that the “actual” outcomes were the very same ones used to arrive at the predictions.
And the three MLR prediction equations had error issues of their own.
A major factor in determining the usefulness of the MLR prediction equation is the amount of difference in outcome scores accounted for by the equation. This value is called R-squared. A perfect prediction equation would have an R-squared of 1.0. This utopian result would mean that in development, the prediction equation accounted for all differences in the outcome, and all actual data points fell perfectly on the line of prediction. Such does not happen in reality. However, it is possible for an MLR equation to have an R-square close to 1.0, such as .98.
The lower the R-squared value, the more unaccounted-for “noise” in the prediction equation, and the less likely the equation will be useful for predicting future outcomes.
In their analysis, Cowen Institute reported three values of R-squared, one for each of its three MLR lines: .684 (EOC), .768 (ACT), and .412 (grad rates).
The highest R-squared, .768, means that for the Louisiana high schools in the year that this analysis was conducted, approximately 77% of the differences in ACT scores can be accounted for by the predictor variables that the researchers included in the analysis (mentioned previously: 1) percentage of students who failed LEAP tests, 2) percentage of students who are over=age for their grade level, 3) percentage of students on free/reduced lunch, 4) percentage of students in special education, and 5) whether the school is a selective admissions school).
An R-squared of .768 indicates that approximately 23% of schools’ differences in ACT scores remains unaccounted for by the Cowen MLR prediction equation. Generally speaking, this R-square is modest. The research focus should have been on reconsidering the five predictor variables in order to increase R-squared and improve the equation.
Improve the equation, not evaluate the individuals in the sample.
The remaining two R-squared values (of .684 and .412) are not as impressive, with .412 being, in fact, useless. (An R-squared of .412 means that the MLR equation is mostly “noise.” A waste.)
Researcher discussion should have been on the utility of their equations at accurately predicting future EOC scores, or ACT index results, or cohort graduation rates– not on evaluating actual results, present or future.
I shake my head.
A word to “choice” promoters wishing to showcase their product using VAM: Do not do what Cowen did.
The clouds of individual data points around the three MLR lines in the Cowen report (pages 15 – 17) is not to be used to evaluate the points above the line as “better” and those below, a “worse.” No, no. Those clouds are to be used to evaluate the MLR lines themselves as modestly- to poorly-fitting.
In developing an MLR prediction equation, the better the MLR lines, the fewer the points off of the line, and the closer to the line the points that are not directly on the line.
Again, this has nothing to do with evaluating individual data points (remember, the data points here represent Louisiana high schools).
This Cowen report could be a case study in bad research on several levels, not the least of which is the researchers conducted the wrong analysis– research dysfunction at its finest.
I think I have written enough.
Schneider is also author of the ed reform whistleblower, A Chronicle of Echoes: Who’s Who In the Implosion of American Public Education