Since 2008, students enrolled in English II courses in Louisiana have completed the End of Course (EOC) test developed by Pacific Metrics (PM). (Louisiana uses six EOC tests; two are in English, for courses English II and III).
The English II and III EOC tests are comprised of three sections, one of which is a constructed response item (essay). At the same time, PM prides itself on its ability to score its tests in only two days.
It does so by using a combination of human and computer scoring. Apparently a human being reads first, and then, the computer checks the human:
June 11 2013
Pacific Metrics Corporation (www.pacificmetrics.com), a leading provider of online assessment solutions, announces the successful online delivery of over 230,000 tests for the Louisiana Department of Education’s 2012–2013 End-of-Course program. Using the automated scoring solution CRASE® in conjunction with human raters, Pacific Metrics was able to deliver accurate student results two days after test completion, which is a significant accomplishment for a high-stakes assessment delivered to a large body of students.
The successful online administration and scoring of the End-of-Course tests is a direct result of effective collaboration between Pacific Metrics and the Louisiana Department of Education. “We designed and built this technologically advanced assessment system specifically to meet Louisiana’s needs and to ensure that districts and schools were well-positioned to successfully execute this administration. We are pleased that the automated scoring component of the system was able to aid in this effort,” states Bob Guerin, Acting President & CEO of Pacific Metrics.
The End-of-Course tests are scored through a combination of handscoring and the use of CRASE, which delivers a second read to constructed-response and essay items. In this blended approach, the scoring engine improves the reliability of the score by recognizing rater bias or rater drift.
The English II and III essay components of Louisiana’s EOC are scored on content, style, and conventions. The 2015 EOC Interpretive Guide has the essays being scored as a total of 18 possible points on page 10 but as a total of 12 possible points on page 18. The referenced scoring rubric (which is not available for 2015-16 but from is 2014-15) references content (4 possible points), style (4 possible points), and conventions (4 possible points) and “additional scoring criteria” that is without point specification.
And to add to the confusion, page 18 notes that Eng II constructed response is scored on two dimensions (content and style) for a possible score range of 2 to 8 points.
Based upon reading the item analyses from this year’s Eng II EOC administration, my students’ scores on the EOC Eng II essay appear to be graded using the 18 possible points referenced on page 10.
But who is doing the grading?
Well, the Louisiana Department of Education (LDOE) dodges specifically mentioning the human-computer hybrid scoring by using passive voice. The excerpt below is from the 2015 Interpretive Guide (complete with scoring inconsistency):
Multiple-Choice Items Multiple-choice items, which assess knowledge, conceptual understanding, and application of skills, are scored correct or incorrect. Student responses are automatically scored by the EOC Tests system (computer-scored).
Constructed-Response Items Constructed-response items require students to compose an answer and use higher-order thinking skills. A typical constructed-response item may direct students to develop an idea, demonstrate a problem-solving strategy, or justify an answer based on reasoning or evidence. …
Writing Prompt (English II)
A typical writing prompt may require students to defend a position or explain a concept. Students are asked to use at least one example from literature they have read in English II, prior courses, or outside of class to support their ideas.
Students are expected to type the final draft of their response in the online testing environment using standard typing skills. Compositions that are incoherent, too brief, not written in English, a restatement of the prompt, a refusal to respond, blank, or off topic are deemed nonscorable and receive 0 points.
Each scorable composition is read at least twice. During both reads, the composition is scored for two dimensions (Composing and Style/Audience Awareness) using a scoring scale of 1 to 4 points for each dimension. The total score is the sum of the dimension scores and ranges from 2 to 8 points.
“Each scorable composition is read at least twice.”
Let us revisit that 2013 PM press release:
The End-of-Course tests are scored through a combination of handscoring and the use of CRASE, which delivers a second read to constructed-response and essay items. [Emphasis added.]
It seems that the LDOE’s passive statement about EOC student essays being “read at least twice” translates into “read once by a human and a second time by a machine.”
PM has a 2013 study about its computerized read-behind, but the study is published by Routledge and costs $55 to access. So, I won’t be reading it. But here is a press release about the study and about the machines guiding the humans in their reading “in real time”:
Monterey, CA (PRWEB) May 02, 2013
Pacific Metrics Corporation (http://www.pacificmetrics.com), a leading provider of education technology solutions, is pleased to announce a significant milestone in automated scoring research efforts with the recent publication of “The Handbook of Automated Essay Evaluation.” Pacific Metrics psychometricians Drs. Susan Lottridge, Matt Schulz, and Howard Mitzel outlined their recent automated scoring research in Chapter 14, “Using Automated Scoring to Monitor Reader Performance and Detect Reader Drift in Essay Scoring.” As new demands for writing instruction and evaluation emerge, such as with the Common Core State Standards, advanced scoring methods are also being developed.
At a time when automated scoring is being advocated as a replacement for human scoring, the central idea behind Pacific Metrics’ research was to see if two imperfect scoring methods, human and machine, could be mutually leveraged to improve the overall accuracy of essay scoring. Using Pacific Metrics’ automated scoring software CRASE® (http://www.automatedscoring.com), the research team conducted studies to identify human rater bias or scoring drift, and considered what types of interventions could be applied within the live scoring window in order to improve accuracy. The analysis suggests that automated scoring can quickly detect trends in group and individual reader performance, alerting scoring operations in real time, to correct potential bias or drift.
“The use of automated scoring as a read-behind tool has proven to be very effective and acceptable in high-stakes testing. Scores are assigned by humans, while automated scoring allows 100% read-behinds in real time. Drift and bias can be detected and corrected in real time. There is no longer a need to seed validity papers into the scoring stream or to introduce parameters into our measurement models to correct for reader drift. It can simply be monitored out of existence very quickly,” says the Pacific Metrics team authoring the study.
This research falls into the general field of man-machine interaction, in which machines are used to try to improve human performance. The authors believe that their research is an important development in our understanding of how automated scoring can assist human scoring organizations to more accurately score student writing, and they hope it will be a valuable resource for future studies and research in the advancement of essay evaluation methods.
How can a machine assess writing style or assess whether a human is aware of the nuances of writing style?
As for PM’s CRASE, well, it does not need to take over for human scoring, but according to PM, it can:
September 10 2013
The Common Core State Standards promote the use of open-ended questions in instruction and on summative and formative assessments. Automated scoring software that can faithfully replicate how trained educators evaluate a student’s written response offers a new approach for states to meet the challenge of providing timely and accurate scores without requiring a large investment of (teacher) time or money. Pacific Metrics researchers have developed a free resource, the Automated Scoring (AS) Decision Matrix, to assist states and organizations in understanding the range of scoring options available, and how to combine these options for the most effective scoring solution for specific needs.
The AS Decision Matrix is organized around two criteria: the level of stakes associated with the assessment (high, medium, or low) and commonly used item response types. Educators considering using automated scoring select the item type and the associated stakes, and the matrix provides a recommendation for the scoring model to be implemented and the associated preparation and monitoring processes. The matrix shows that depending on the item type, both human and automated scoring can be mutually leveraged to ensure high quality scoring while reducing the costs and time involved.
The white paper is available on Pacific Metrics’ website at: /white-papers/ASDecisionMatrix_WhitePaper_Final.pdf
“Through years of research and practice, Pacific Metrics has gained experience with the real-world scenarios in the decision matrix. Our experience has taught us that users want to understand what to expect of automated scoring so that they can select what is most practical and advisable. The decision matrix will help state and district educators apply machine scoring more surely and successfully in their assessment programs” says Matt Schulz, Ph.D., Director of Research for Pacific Metrics.
In addition to reliably and accurately scoring essays and constructed-response items in tests, Pacific Metrics’ CRASE® supports the increased assignment of written tasks to students in the classroom. The software does not replace teacher evaluation of student work but supports it by providing fast and accurate feedback to students and teachers. CRASE can interoperate with digital learning and assessment platforms, allowing easy integration of its capabilities with a variety of testing programs and products.
The white paper referenced above includes a case study that reads like it could be based upon PM’s experience with LDOE:
Case study 2 is a high-stakes online assessment program that satisfies No Child Left Behind requirements. The program includes essays and unconstrained constructed-response items as a portion of the predominantly multiple-choice tests. The scoring models for these item types differ from Case 1 due to the high-stakes nature of the assessment.
The essays are scored by humans with a 100% automated scoring read-behind. Adjudication by an expert read occurs when the human and machine disagree by more than one point. Monitoring in this program is daily and the state agency is involved in reviewing the monitoring reports on a daily basis as well. Double-scored data were used to train the engine.
The unconstrained constructed-response items are scored using an automated scoring or handscoring scoring model. Responses are first submitted to the AS engine and the engine assigns scores only to responses that it can score with no error; the remaining responses are submitted for handscoring. The engine was trained on single-scored student responses, and the state agency approved all details of scoring. As with the essays, the monitoring occurs daily and the state agency is involved in review of monitoring reports on a daily basis.
I am hard pressed to believe that a machine can be trained to replicate human understanding of the nuances of language, including LDOE’s passive voice dodge to avoid clarifying for the public that EOC constructed response items are being first read by humans and then read by machines that are guiding the humans.
I searched for the humans behind PM essay grading and discovered that ACT acquired PM in July 2014. The press release noted that PM would continue to be its own “stand alone” company. I also found this February 2016 review on GlassDoor:
Ever since the founding CEO passed away, they (PM) threw away research and QA, then hopped on the bandwagon of pushing out flawed, unfinished products to get into the next “hot” market. Poor vision and nearsightedness in top management. They got sold to ACT and laid off a bunch of people. Barely hanging on the string of Louisiana Dept of Education contract cause of director connections. No more expansion for the future for anybody who works in there.
It turns out that PM is a small operation. According to the ACT press release on its PM acquisition, ACT describes PM as “employing approximately 80 team members.” And that it pre-acquisition. A photo of PM employees on GlassDoor shows only 40 employees.
PM is likely subcontracting for all of that EOC constructed response reading. There is no telling who the humans are behind the essay scoring.
At least we know
who what’s helping them with that second read.
Coming June 24, 2016, from TC Press:
(Click image to enlarge)
Don’t care to buy from Amazon? Purchase my books from Powell’s City of Books instead.