You find on this page the data for the competition.
For each problem, you can download the train and test set available during the competition. The idea was to use the first set (or both if you thought that helped) to train your algorithm and learn a model. Participants could then assign probabilities to each element of the test set and submit their answers on the compete page. More details are given on the participate page.The solution file (which contains the probabilities in the target of the elements of the test set) and the model file (with a description of the target machine) are now available.
During this phase, the machine used to generate each data set WAS not given. In addition, different values had been given to several parameters while they were fixed in the previous phase of the competition: the size of the samples and of the alphabet varies, the number of states of the target is not the same, the symbol and transition sparsities take different values. More details are given in the paper containing the result of the competition. All these files can be found in this archive.
|Problem number||Files||Problem number||Files||Problem number||Files|
|Problem 1||Problem 2||Problem 3|
|Problem 4||Problem 5||Problem 6|
|Problem 7||Problem 8||Problem 9|
|Problem 10||Problem 11||Problem 12|
|Problem 13||Problem 14||Problem 15|
|Problem 16||Problem 17||Problem 18|
|Problem 19||Problem 20||Problem 21|
|Problem 22||Problem 23||Problem 24|
|Problem 25||Problem 26||Problem 27|
|Problem 28||Problem 29||Problem 30|
|Problem 31||Problem 32||Problem 33|
|Problem 34||Problem 35||Problem 36|
|Problem 37||Problem 38||Problem 39|
|Problem 40||Problem 41||Problem 42|
|Problem 43||Problem 44||Problem 45|
|Problem 46||Problem 47||Problem 48|
The first real data problem corresponds to part of speech tagging. The train and test sets are randomly selected sentences where words have been automatically replaced by POS. The evaluation is done by comparing the submitted probabilities with the ones obtained with the 3-gram baseline trained on the whole corpus (which is 10 times bigger than the available train set).
The second real-data problem comes from sensor information. Note that all strings are of length 20 as they correspond to sliding windows over a discretized sensor signal. It may thus worth to include a stop after 20 steps in your models instead of final probabilities... The evaluation is done in the same way than for the first real data problem.
|Train set for real problem 1||Test set for real problem 1|
|Train set for real problem 2||Test set for real problem 2|
These are the data that were used during the training phase of the conference which ended on May 20th 2012. The files containing the real probabilities of the test sets are now available. They correspond to the probabilities that are attributed to the element of the test sets by the target machine.
Hint: the name of each file (except for the first problem) contains an information about the way it had been generated.