While numerous articles on Criterion® have been published and its validity evidence has accumulated, test users need to obtain relevant validity evidence for their local context and develop their own validity argument.
This paper aims to provide validity evidence for the interpretation and use of Criterion® for assessing second language (L2) writing proficiency at a university in Japan.
This is because we are interested in comparing scores derived from different prompts, interpreting scores as indicators of L2 writing proficiency, and examining score changes before and after instructions.
While these three perspectives are part of many types of validity evidence, providing them would be a step forward to a convincing validity argument (see Bachman & Palmer, ) examined effects of the Test of English as a Foreign Language (TOEFL) Internet-based test (i BT®) independent task prompts and rater types (human scoring vs.
We focused on three perspectives: (a) differences in the difficulty of prompts in terms of Criterion® holistic scores, (b) relationships between Criterion® holistic scores and indicators of L2 proficiency, and (c) changes in Criterion® holistic and writing quality scores at three time points over 28 weeks.
We used Rasch analysis (to examine (a)), Pearson product–moment correlations (to examine (b)), and multilevel modeling (to examine (c)).machine scoring done by the same e-rater® engine as Criterion®) on holistic scores.She found nonsignificant and negligible effects of prompts and an interaction between prompts and rater types and a significant but small effect of rater types (partial η = 0.003, 0.001, and 0.030, respectively).First, we found statistically significant but minor differences in prompt difficulty.Second, Criterion® holistic scores were found to be relatively weakly but positively correlated with indicators of L2 proficiency.She reported significant increase of the number of words they wrote and improvement in overall organization., have provided valuable insights into the capability of Criterion® in detecting changes in writing. First, all previous studies had only two time points to collect data.It is preferable to measure writing three or more times, which would enable us to examine clearer patterns of score change over time and obtain stronger evidence to argue for the utility of Criterion® as a sensitive measurement tool for detecting long-term changes in L2 writing proficiency.Second, all the previous research has used repeated ) did not consider a nested structure of their data in which students belong to different classes.Data are nested when data at lower levels are situated within data at higher levels.Along with the increasingly wider applications of Criterion®, numerous studies have been conducted from various perspectives, which is well summarized in Enright and Quinlan () emphasized the importance of developing one’s own localized validity argument considering one’s test purposes and uses.For this aim, we examine the validity of the interpretation and use for assessing second language (L2) writing proficiency at a university in Japan, when the interpretation and use are made based on scores derived from Criterion®.