Artificial Data
Details about the way the data are generated can be found in this article.
How the targets are generated
Four kinds of machines are generated: Markov Chains, Determinist Probabilistic Finite Automata, Hidden Markov Models and Probabilistic Finite Automata. They are all built followwing the same procedure:
They are constructed completely at random, i.e., not biased towards a certain structure such as in the recent Stamina competition that was geared towards learning software models. The construction proceedure takes the following parameters as inputs:- The number of states N
- The size of the alphabet A
- The sparsity of symbols S (over all possible state-symbol pairs, only S percent of them are generated, chosen uniformly at random)
- The sparsity of transitions T (over all possible state-state-symbol triples, only T percent of them are generated, chosen uniformly at random)
The number of initial states is then T*N and the number of final states is S*N.
All initial, symbol, and transition probabilisties are drawn from a Dirichlet distribution. The final probabilities are drawn together with the symbol probabilities.
For the real competition phase, the complexity of the generated targets is tested using the baselines, which allows a selection of targets of different complexity.
How the data sets are generated
Given a generated target, the training set is generated using this model in the obvious way, by following paths in the PFA.
The test set is generated in the same way but duplicates are removed. If the resulting set does not contain enough examples, we re-generate strings in the same way.
If the average length of the generated strings is outside of a predefined range, a new automaton and new data sets are generated.
How the success of the learning is evaluated
Knowing the PFA, the probability of a string is the sum of the probabilities on all the possible paths of that string. Given a path q1 a q2 b q3 c ... d qf, the path probability is I[q1](1.0-F[q1])S[q1,a]T[q1,a,q2](1.0-F[q2])...F[qf]. We compute it using dynamic programming over state-remaining sequence pairs (as a complete backtrack is not efficient).
We thus are able to compute the probabilities of each string of the test test. We normalize these probabilities in order they sum to 1.
The evaluation measure is a perplexity one, given by the formula:
where PrT(x) is the probability of
Notice that this formula is independent of a specific model and thus participants do not have to learn PFAs.
Real world data
During the second phase of the competition, there will be some real data from the real world. We are working on that at the moment, but all ideas are welcome. In particular, if you have a dataset that can be used, please feel free to contact us.


