Machine Comprehension Test (MCTest)

Description	MC500 Test	MC160 Test	Score Files
A Parallel-Hierarchical Model for Machine Comprehension on Sparse Data. by Adam Trischler, Zheng Ye, Xingdi Yuan, Jing He, Phillip Bachman, and Kaheer Suleman. In ACL 2016.	71.00%	74.58%
Machine comprehension with syntax, frames, and semantics. by Hai Wang, Mohit Bansal, Kevin Gimpel, and David McAllester. In ACL 2015.	69.94%	75.27%
Learning answer entailing structures for machine comprehension. by Mrinmaya Sachan, Avinava Dubey, Eric P. Xing, and Matthew Richardson. In ACL 2015.	67.83%	---
A Strong Lexical Matching Method for the Machine Comprehension Test. by Ellery Smith, Nicola Greco, Matko Bošnjak, and Andreas Vlachos. In EMNLP 2015.	65.96%	74.27%
Machine comprehension with discourse relations. by Karthik Narasimhan and Regina Barzilay. In ACL 2015.	63.75%	73.23%
Baseline: SW+D + RTE	63.33%	69.27%	RTE_Plus_Ba...
Baseline: Sliding Window plus Distance (SW+D)	59.93%	68.02%	Baseline_SW_D
Baseline: RTE (using BIUTEE)	55.01%	57.92%	RTE
Baseline: Sliding Window (SW)	54.28%	58.26%	Baseline_SW
Attention-Based Convolutional Neural Network for Machine Comprehension. by Wenpeng Yin, Sebastian Ebert, Hinrich Schüutze. In NAACL 2016.	52.9%	63.1%

Add your results: This is the set of results as far as we are aware, but it's hard to track all of the results that may have been published on the data set. If you have results on MCTest that you would like to appear above (or if you spot an error) email us with your results.

Score Files: The score file format is specified in the README.txt contained in the MCTest data. A score file provides the given algorithm's score for each answer. We hope this will enable more rapid progress on MCTest – by enabling each new algorithm to build on top of previous results, by allowing pairwise statistical significance testing, and by allowing anyone to investigate what kind of errors are being made by previous work.

Note: The accuracies reported here differ from those reported in the paper. The primary reason is in how ties are dealt with. Here, we report accuracies using partial credit. That is, if three answers tie for highest score, and one of them was right, then the algorithm gets 1/3 points for that question. This differs from the published results which were computed by random tie-breaking. Partial credit is deterministic and more reproducable by others, as well as being less noisy. For this reason, we suggest that any results published using this data also use partial credit in the case of ties.

Further, once we were using partial credit for tie-breaking, we were able to see that there was an improvement in accuracy on the development set using a weight other than 1 for combinining the sliding window and distance baselines (previously, this improvement was lost in the random noise induced by random tie breaking). The best weight for MC500 was 11, and for MC160 it was 10 (for MC160 there is actually a tie for accuracy among weights 1 and 10, among others, so we selected 10 on the assumption that the best weight on different datasets should be similar, and MC500 had more data and indicated an ideal weight of 11). This weight tuning was done purely on the development sets.

Finally, there is a bug in the paper. The correct implmentation of Algorithm 2 (Distance Based Baseline) is to take an average, across all question words, of the minimum distance between that question word and one of the answer words. The paper accidentally says to take the minimum across the question words.

The original score files for the SW+D baseline, with a combination weight of 1, and taking the minimum instead of average when computing the distance based baseline, can be downloaded here: BaselineInPaper_SW_D

UPDATE 11/5/2014: Fixed Baseline_SW_D.scores.zip, which had accidentally been created using a MC500 baseline combination weight of 10 instead of 11 and was normalizing the distance-based scores by |P| rather than |P|-1.