MCTest: Machine comprehension test http://research.microsoft.com/mct --------------------------------------- Welcome --------------------------------------- In this archive, you'll find the 660 reading comprehension tests (split into MC160 and MC500) as described in this paper: Matthew Richardson, Christopher J.C. Burges, and Erin Renshaw (2013). "MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text," in Proceedings of the 2013 Conference on Emprical Methods in Natural Language Processing (EMNLP 2013), pp. 193-203, Seattle, WA, USA, October 2013. http://research.microsoft.com/en-us/um/redmond/projects/mctest/MCTest_EMNLP2013.pdf If you use this data, we ask that you reference the above paper so that others may also easily find and use the data themselves. If you have any questions or find the data useful, please don't hesitate to contact us: Matt Richardson: mattri@microsoft.com Chris Burges: cburges@microsoft.com Erin Renshaw: erinren@microsoft.com --------------------------------------- Train, Development, and Test sets --------------------------------------- There are two primary datasets: mc500 and mc160. See our paper (link above) for a description of the differences between these two sets. MC500 contains 500 story sets (a story set is a story and its associated questions); MC160 contains 160. Each dataset is split into train, development, and test sets, with each story set assigned to one. As one would expect, the train set can be used to train your algorithm, or to examine closely to see what kind of phenomenon exist in the stories, or any other purpose you would like to use it for. The development set is intended to be a set-aside set that you can use for evaluating your algorithm. The test set is intended to be a final test set that you use only once, to get your final results for publishing. The answers for the test set are distributed in a separate archive. We encourage you to download this file only when you are ready to do your final evaluation. For both TSV and TXT format, downloading the answers archive will allow you to use the same code as was used when evaluating on the development set. The files are named: mc[500|160].[train|dev|test].[txt|tsv|ans] --------------------------------------- License --------------------------------------- Please see the included license.pdf for the license --------------------------------------- Data Format --------------------------------------- We provide the data into two separate formats: .txt for ease of readability, and .tsv/.ans for ease of coding. The two formats contain the same information, just in different formats. In both formats, questions are prefixed with "one:" or "multiple:", indicating whether the author marked that the question required one or multiple sentences from the story in order to find the right answer. The text in the stories has been processed as little as possible to remain true to the original author's submission. Special non-ASCII characters such as curly quotation marks, em-dashes, and elipses have been converted to their ASCII equivalent. ---- TXT Format ---- The .txt files present the stories and questions in a convenient text format, for ease of reading. The correct answer for each question is marked with a "*". The format is consistent, so can be programmatically read if desired, but may be harder to write a parser for than the TSV format. Note that the test story sets do not indicate the correct answer (see not above on train/dev/test split). To get the test sets in text format that do contain the answer, you must download the test answers archive. ---- TSV/ANS Format ---- These consist of tab-delimited files, with one story set per line. The .tsv file contains the story, questions, and answers. The .ans file contains the correct answer for each question The format of a line in the TSV file is: Id properties story q1 q2 q3 q4 where qN = questionText answerA answerB answerC answerD and properties is a semicolon-delimited list of property:value pairs, including Author (anonymized author id, consistent across all files) Work Time(s): Seconds between author accepting and submitting the task Qual. score: The author's grammar qualification test score (% correct) Creativity Words: Words the author was given to encourage creativity (there are no creativity words or qual score for mc160, see paper) The format of a line in the ANS file is: answer1 answer2 answer3 answer4 where answerN is the correct answer (A, B, C, or D) for question N Finally, because some authors used newlines and/or tabs to indicate paragraph separation, and this would break the TSV format, we have replaced any newline or tab with "\newline" or "\tab", respectively. No questions or answers required this escaping. --------------------------------------- An Ongoing Resource --------------------------------------- As we stated in our paper, we will maintain the website with links to the latest published results using this data. If you publish a paper using this data, let us know and we will add a link to it. Also, if you publish a paper and are willing to share your scoring files with others, send them to us and we will post them as well. We hope that by providing the scoring files from previous work, we will enable more rapid progress on this problem -- by enabling each new algorithm to build on top of previous algorithmic results, by allowing pairwise statistical significance testing, and by allowing anyone to investigate what kind of errors are being made by previous work. --------------------------------------- Score Files --------------------------------------- To send us your scores, please format them in the following tab-delimited format, one line per story (similar to the ANS file): scores1 scores2 scores3 scores4 where scoresN are the scores for question N, and have the format: scoresN = scoreN_A, scoreN_B, scoreN_C, scoreN_D where scoreN_A is the score your algo assigns to answer A for question N. The scores may be probabilities, or may simply be unnormalized real values. It is assumed that the highest score is the one your algorithm would select, and the higher the score, the more confident it is in that selection. example: 3.2, 1.1, 0.9, 3.1 -0.3, 1.1, -4.3, 0.4 ... would mean the algorithm selects "A" for question 1, and "B" for question 2. We will provide these score files for public download, attributed to you and with a link to your publication if possible.