# Understanding Cowen Institute’s Botched VAM Effort

We are in an age of so-called “data-driven” education reform. Numbers and analyses are being worshiped as the end-all, be-all evidence of education quality. Student standardized test scores are at the center of the majority of such analyses. Moreover, in order to “drive” the privatization of public education using quantitative data analysis, corporate-reform-bent philanthropies and businesses are dumping money into “institutes,” groups of often questionably-credentialed individuals who promote attractive reports full of impressive numbers and analyses meant to wow the public into believing that test-driven “reform” is working.

The edge that these institutes (and other corporate-reform-promoting “nonprofits”) have in wielding statistical analyses is that neither the public nor the media is able to critically examine the quality of the work. Therefore, both are susceptible to swallowing whole the institute’s summation of its findings.

After all, if the physical appearance of the report is attractive, and if the report comes from *An Institute*, it must be trustworthy.

The public often does not critically consider the agendas of those financially supporting an institute; it often does not know the qualifications of those producing the reports, and it cannot discern whether the report outcomes amount to little more than a propaganda brochure “finding” in favor of the favored “reforms” institute donors.

All of this I had in mind as I read the retracted, October 1, 2014, Cowen Institute report, *Beating the Odds. *On October 10,2014, Cowen Institute removed the report from its website due to “flawed methodology.”

I wrote about the retraction in this October 10, 2014, post. I did not go into great detail on Cowen’s error because I needed to think of how to communicate it to readers in a way that is not too technical.

I will try to do so in this post.

What Cowen Institute did wrong was a major blunder– the kind that skilled researchers do not make. Cowen Institute’s researchers apparently thought they were conducting a value-added modeling (VAM) analysis. Instead, they conducted a more basic analysis known as multivariate linear regression (MLR)– and even that, they botched.

Not only did Cowen Institute conduct the wrong statistical analysis, and not only did it misuse and misinterpret the more basic analysis it conducted, but Cowen Institute also did not even realize it’s gross error *until a week beyond publication.*

I am baffled at how this happened. The incompetence astounds me. When I first read in the* Times-Picayune *that Cowen Institute had withdrawn the report due to “flawed methodology,” I expected a more sophisticated error. In fact, when I realized that Cowen had conducted (*and* interpreted, *and* *published*) the *wrong analysis*, I doubted my own senses.

It made me wonder just how crappy the rest of Cowen’s research actually is.

In their flawed study, Cowen Institute stated that it had produced predicted values on three outcome measures (EOC passing rates, an ACT index, and cohort graduation rate) for all Louisiana high schools. The stated goal of the study was to compare predicted outcomes with actual outcomes to determine which schools performed better, worse, or as predicted.

The focus was on actual school performance as compared to a predicted performance. Though many think of VAM in terms of evaluating the teacher based upon students test scores, Cowen was attempting to “VAM” the schools based upon student test scores and grad rates.

But the Cowen analysis was not VAM.

Before I proceed, let me note that I am convinced VAM cannot work. In December 2012, I analyzed Louisiana’s 2011 VAM pilot results and explained how erratic (and therefore useless) VAM is. Student test scores cannot measure teacher quality; neither can student test scores measure school quality, and attempting to hold teachers and schools hostage to statistical predictions on their students is lunacy.

VAM should not be connected to any high-stakes evaluation, period.

That noted, allow me to offer a brief word on VAM and MLR.

Both VAM and MLR (the basic analysis that Cowen actually conducted, albeit poorly) assume that lines can be used to capture the relationship among the variables. Thus, both rely upon the basic equation for a line. (Perhaps you remember it from algebra days gone by: *y = mx + b*.)

One can think of VAM as a more sophisticated version MLR. VAM has levels of lines to it because it considers layers, such as those evidenced when one considers that students are in classes; classes are in schools, and schools are in districts. One might think of VAM as having equations within equations. In contrast, MLR operates only on one level (i.e., no equations within equations).

The Cowen researchers conducted their analyses on one level, and they used MLR. Their report includes three separate MLRs, one for their three outcomes of interest: EOC, and ACT, and grad rates. In an attempt to predict these three outcomes, the researchers used five measures: 1) percentage of students who failed LEAP tests, 2) percentage of students who are over=age for their grade level, 3) percentage of students on free/reduced lunch, 4) percentage of students in special education, and 5) whether the school is a selective admissions school.

*Two *Cowen researchers thought these three MLRs were VAM.

Nope.

In order to conduct school-evaluating VAM on the outcomes of EOC, and ACT, and grad rates, the researchers should have incorporated previous measures of EOC, and ACT, and grad rates, into their analysis. Makes sense, doesn’t it? For example, in order to predict future EOC scores for a given school, one must consider previous EOC scores for that school. Yet no such incorporation of previous scores is present in the Cowen study.

Really, really not good.

It gets worse.

Not only did the *two* Cowen researchers use the wrong analysis; they didn’t even use MLR *well*. That’s what gets me more than any true-yet-botched attempt at actual VAM. MLR is an analysis with which one with a stats and research background should be familiar, and these Cowen folks botched even it.

When used appropriately, MLR can be used for either of two purposes: to explain or to predict an outcome. If I have a theory about what factors contribute to a certain outcome, I can use MLR to test my theory and determine the degree to which my theory *explains* some outcome.

MLR for explanation *does not* *evaluate individuals. *It actually evaluates the researcher’s theory about what factors contribute to some outcome.

The more common usage of MLR is to *predict*. I have a friend from college, a fellow stats major, who tried to come up with an MLR equation to predict winners of horse races. (It seems that many stats people dabble in gambling since gambling is all about probability, as is stats.) In coming up with his equation, my friend tried to determine which predictor measures could help him determine which horses would win future races. As such, he wanted a useful equation, one that could *predict a future outcome*: the winner of a future horse race.

Now this is important: Even if one of the predictors of future wins is some tabulation of past wins, the purpose of the MLR *was not to evaluate jockey “effectiveness” based upon the horse’s performance*. That would have been a VAM goal (and a futile one, as previously noted). Instead, my friend’s focus was on *the usefulness of his equation *at predicting the winner. That’s an MLR goal.

Cowen tried to evaluate individual schools based upon an MLR prediction equation. This was wrong to do.

Had the Cowen Institute researchers properly conducted an MLR for prediction, here’s how their study generally might have looked:

First, the research question would have focused on *the utility of the MLR equation in predicting future outcomes*, not on evaluating the schools.

Second, in order to produce a MLR prediction equation, the researchers should have at least two random samples, one to use to develop the equation, and at least one other to test the usefulness of the equation.

Finally, the researchers could have then decided whether to make (or suggest, if not possible to make) adjustments to the equation in an effort to improve prediction, or they could have decided that the equation is satisfactory.

There you have it.

One should not test a prediction equation using the same sample data one uses to arrive at the equation because the equation has been tailored to fit the sample data as best as is possible. Nevertheless, in the case of the Cowen study, it seems that the “actual” outcomes were the very same ones used to arrive at the predictions.

And the three MLR prediction equations had error issues of their own.

A major factor in determining the usefulness of the MLR prediction equation is the amount of difference in outcome scores accounted for by the equation. This value is called R-squared. A perfect prediction equation would have an R-squared of 1.0. This utopian result would mean that in development, the prediction equation accounted for all differences in the outcome, and all actual data points fell perfectly on the line of prediction. Such does not happen in reality. However, it is possible for an MLR equation to have an R-square close to 1.0, such as .98.

The lower the R-squared value, the more unaccounted-for “noise” in the prediction equation, and the less likely the equation will be useful for predicting future outcomes.

In their analysis, Cowen Institute reported three values of R-squared, one for each of its three MLR lines: .684 (EOC), .768 (ACT), and .412 (grad rates).

The highest R-squared, .768, means that for the Louisiana high schools in the year that this analysis was conducted, approximately 77% of the differences in ACT scores can be accounted for by the predictor variables that the researchers included in the analysis (mentioned previously: 1) percentage of students who failed LEAP tests, 2) percentage of students who are over=age for their grade level, 3) percentage of students on free/reduced lunch, 4) percentage of students in special education, and 5) whether the school is a selective admissions school).

An R-squared of .768 indicates that approximately 23% of schools’ differences in ACT scores remains unaccounted for by the Cowen MLR prediction equation. Generally speaking, this R-square is modest. The research focus should have been on reconsidering the five predictor variables in order to increase R-squared and *improve the equation.*

*Improve the equation, not evaluate the individuals in the sample.*

The remaining two R-squared values (of .684 and .412) are not as impressive, with .412 being, in fact, useless. (An R-squared of .412 means that the MLR equation is mostly “noise.” A waste.)

Researcher discussion should have been on the utility of their equations at *accurately predicting* future EOC scores, or ACT index results, or cohort graduation rates– *not on evaluating* actual results, present or future.

I shake my head.

A word to “choice” promoters wishing to showcase their product using VAM: Do not do what Cowen did.

The clouds of individual data points around the three MLR lines in the Cowen report (pages 15 – 17) is *not* to be used to evaluate the points above the line as “better” and those below, a “worse.” No, no. Those clouds are to be used to evaluate the MLR lines themselves as modestly- to poorly-fitting.

In developing an MLR prediction equation, the better the MLR lines, the fewer the points off of the line, and the closer to the line the points that are not directly on the line.

Again, *this has nothing to do with evaluating individual data points* (remember, the data points here represent Louisiana high schools).

This Cowen report could be a case study in bad research on several levels, not the least of which is *the researchers conducted the wrong analysis– *research dysfunction at its finest.

I think I have written enough.

__________________________________________________________

*Schneider is also author of the ed reform whistleblower, A Chronicle of Echoes: Who’s Who In the Implosion of American Public Education*

Just for the record, what was the actual as opposed to predicted employment fates of the two ‘researchers’. lolol Nice piece of explication !

Hope you keep this in mind as a version for educators of the classic How to lie with statistics, adding…about education. I think one of the unifying themes is making jockey effectiveness the test of claims about teacher effectiveness, race tracks as schools, let it go anywhere your creative mind wants to take it to ridicule the absurd inferential leaps.

Jockey effectiveness has made my day.

How utterly embarrassing and unprofessional. When the best part of Cowen Institute’s report is its retraction of it, it is time for those who rely on such unreliable sources to look at the big picture — one gets what one pays for: damn lies and statistics.

As one who admits to no understanding at all of the statistical explanations, COMMON SENSE combined with even minimal critical thinking skills told us long ago that using VAM, let alone standardized test scores, was not an accurate measure of teacher effectiveness. Mercedes, I am going to suggest that the Accountability Commission invite you to present at their next meeting. Ms. Hannah Dietsch with her dog and pony show will be shut down post haste.

Hannah Dietsch

The Broad Residency Class of 2008-2010

Current Organization: Louisiana Department of Education, Assistant Superintendent- Talent

Placement Organization: New York City Department of Education, Director of Strategy and Achievement

Pre-Residency: Teach for America

M.Ed., Harvard University

B.A., Tulane University

You blow me out of the water everyday! Thank you, Mercedes, for my free online mini course in stats and juking them!

I almost understood it, but I estimate that is my problem, not yours. Thanks again for truly detailed analyses.

Thank you.

Your characterization of VAM and statistics is misleading and lacks the nuance you purport. It is wrong to say the report should have chased a higher R2. R2 says NOTHING about whether you model is unbiased. It’s a simple measure used to gauge how much variation is explained in your outcome of interest. For example, assume that you run an experiment where the treatment is smoking cigarettes and the outcome is lung cancer. Assume that after conducting this experiment you find that cigarettes cause cancer, and in a regression framework you get a R2 of .4. According to your logic, this is not important since 60% of the variance in lung cancer is “noise”.

For those in the know, they will see that you are focus on the precision of the model (how well the data fit around the line), where the more important thing is whether you have the right line. If you have the right line, then on average you will make the right prediction. And once you have this criterion, you can worry about precision.

As a FYI, VAMs are not equations within equations. In most cases they are well thought out multiple linear regressions. It’s the well thought out part that is lacking in the Cowen report, but not the estimation strategy. Or, said differently, you want to use a statistical model that mimics the real life situation you’re assessing. This is where you and I agree that the Cowen report did not come close to meeting this criteria.

“The right line” in VAM as aplied to testing teachers is a myth.

In the case of this practical application, R-squared does matter. The “noise” costs teachers their jobs. But this Cowen regression effort is not even VAM, and yes, VAM can be defined as equations inside of equations (hierarchical linear modeling with levels of equations).

The “right line” alludes to the variables in the model, for they produce the line. Wrong predictors; wrong line.

In educational research, the “lines” of VAM equations can never be “right” because the practically unquantifiable accounts for much that contributes to student outcomes. So, there never will be “right lines.”

As to my regression examples: the R-squared does matter if one wishes to use the equation for practical– and especially high-stakes– predictive purposes. A low R-squared indicates that the predictor set is deficient as a set. There can certainly be predictor variables that are important, but applying the set of predictors is not useful because of the “noise” still in the prediction.

Yes, I am for precision in regression, especially when regression studies are used to drive public policy.

In your smoking example, 40% of variance in cancer outcomes is explained by smoking, but 60% is not– which means putting too much weight on smoking as causing cancer obscures the fact that other issues (that unexplained 60%) come into play.

Note also that the connection between smoking and cancer was not settled with a single, SLR study. Talk about bias unexplored.

Emphasis on that smoking 40% in public reports could mislead the public into thinking that smoking is “the” cause of cancer, to the detriment of ignoring other lifestyle issues that make the prediction more precise. There is also the legal team for tobacco companies, who I am sure will focus on that unexplained 60% in litigation.

As background, I have *not read the initial study* so I’m not going to comment directly on your critiques of it. If they really didn’t use past scores in their models then it’s absolutely ludicrous to call it VAM. Value added in comparison to what?!!? So you have me convinced this is a really bad study, so I’m not taking issue with your underlying claims. I am a quantitative ed. researcher and I am very, very skeptical of VAMs to evaluate specific schools or school districts or teachers. They’re just not precise enough. I’d never be able to sleep at night knowing that some teacher may have gotten fired, or some school shut down, because my models’ standard errors were biased. *Shudder.*

But I’m not entirely against theme for testing theory or evaluating programs, when used responsibly. It’s just very hard to do the research and say “we found some evidence of growth beyond expectation, and this shows promise in a program” (or the reverse) and not have people with an agenda use that to support their pre-determined beliefs one way or another, especially in high-stakes environments. The “sexiness” of value-added only makes these misappropriations of research even worse. But that doesn’t mean VAM itself is terrible. Sigh– the quandaries of education research.

In any event, I do take issue with one point you make. Saying that an R2 of 0.4 is meaningless is just plain wrong, especially in social research.

Very, very few social research studies find R2s that high. The social world is complicated and noisey. But explaining 40% of variation is no small feat in any social research study. In fact, it’s remarkable.

For example, if students’ parent’s income explained 10% of variation in test scores, that tells us that family income is an important factor for understanding academic performance, but only one — and hardly a determining one. But an important one. Saying that an R2 of 0.10 in that case is “meaningless” is wrong. It’s extremely meaningful — it tells about how structural forces affect educational outcomes, and even it gives us some purchase on “how much.”

Consider the gambling example to drive the point home. Let’s say I can model the results of a horse race using a set of easily observed variables and I can explain 5% of the variation. Furthermore, this model behaves well and meets the assumptions of a linear regression model (this, as Andrew notes, is what really matters — not the size of the R2). And the conditions of horse races actually stay the same over time. (None of these things are true for education, usually, hence why making decisions that affect people’s lives with them is so risky.) But my point is, let’s say the model could actually predict an outcome 5% better than chance. If this model with a quite small R2 could really do this, I could make a fortune — a stone-cold fortune! — exploiting that 5% edge (so long as I had the bankroll to handle the random variation of the other 95%). It might be that horse racing is just so random there’s no way to get a model that does better than 5%. It would be the case that the results of a horse race are, in your words, “mostly noise.” But by no means would such a model be a waste.

Real life example: Recently a gambler found a way to get — I believe — a 7% edge in Baccarat by observing the way cards were marked. He made 10 million exploiting that edge. He didn’t win every hand in the process because of random variation — noise (i.e., shuffling) — but he was able to improve his choices by enough to take the casino to the cleaners exploiting this 7% edge. He got sued.

If an educational intervention could explain even 5% of the variance of a meaningful outcome, and we could easily implement that intervention, and find something like 5% gains — that could potentially be a huge, huge deal. But it depends on the outcome. Standardized test scores are not a good outcome. They’re nice and reliable but are do badly on validity — they don’t predict what we need them to.

To wrap up, let’s take a real life example. Much has been made of early childhood interventions on non-cognitive skills. Heckman and others have done famous research showing that these early interventions result in differences that keep a significant number of people out of the criminal justice system and jail, resulting in huge benefits to society and cost-savings. The thing is — the size of the effect may not look like much. (I’m not sure of the actual figures, it’s been a while since I read the research). Say an EC intervention on NonCog skills results in 3 or 4 people out of 100 not in jail that otherwise would be. If such an early childhood intervention really worked, multiplied over millions it could mean thousands fewer in jail every year. Crime wouldn’t go away but we’d save millions of dollars and many, many people would be better off. It would be plausible that such an intervention only had an R2 of 5% (I don’t know the real numbers) but that could be quite socially meaningful. The size of an R2 in itself — absent context of other information telling us how much variation we “should be able to explain — tells us little about whether a model is good or bad.

I have a young son who is almost two and I have just begun to research Orleans Parish public schools on my own. Frankly, I cannot make heads or tales of anything out there. I am finding that the student populations change year to year widely in the RSD so comparing them that way would seem not to work anyway. I was so excited to find the Cowen Institute and was disappointed to learn I shouldn’t really use any of it . I can’t understand the fly by night charters here either-are they given a real chance? Is a year or two enough time to really prove anything? How scary was my last statement-it is almost like an experiment. Also, if a district (like the RSD) has a higher number of students with disabilities is that factored in to SPS? If I were a parent looking over schools for my child to attend next year, I would be a total loss. What on Earth should I really be looking for? Could I miss out on a really great school because of these reports and statistics that I am reading? I see bright little kids everywhere in New Orleans-what are they and my son (unless we move) really up against? Thank you for your blog.

Beth, the RSD charters are pushing a test-driven model that focuses on testing outcomes at the expense of genuine learning. So, it will be difficult to find a school that is not doing so in RSD.

Having stated as much, I would look for stable schools– ones that have been around for several years, without high teacher turnover. I would ask what percentage of the faculty is from Teach for America (TFA– temp teachers). I would read the parent and student handbooks. I would talk to parents whose kids attend different RSD schools.

If your child is only two, you have time to investigate. I teach in St. Tammany, across the lake. The school system is highly stable. We have no charters to my knowledge, nor any TFA. If moving is an option, again, you have time to plan.

Thank so much for your reply. You may laugh, we actually live in Mandeville! We moved to the parish one year ago from the Carolinas. My husband works in N.O. and I attend university there part time. We are looking for a house in the city which was going great until I began researching schools. I have been determined to find a fit for us in New Orleans against the warnings from everyone I know here. I believe in public schools but certainly would never short change my son for a personal belief. Looks like remaining on the North Shore or a transfer would be best for our family.

Sounds like you have your answer. 🙂

When I began teaching in 1991, St. Tammany had a great reputation among teachers seeking employment in southern Louisiana. That reputation continues despite corporate reform pressures.