We provide on this page some baseline algorithms. A panorama of the different approaches that learn distribution via automata can be found in the PAutomaC article.

Simple algorithms

Two simple algorithms are available in this python code. Usage: python file.train file.test. It generates two files:

  • file.3gr that contains the probabilities on the test set obtained by a 3-gram trained using both file.train and file.test.
  • file.sam that contains the probabilities on the test set obtained by computing the frequency of each string in both file.train and file.test.
Both outputed files are in the format expected by the compete page: they start with the number of elements in file.test and then contain this exact number of probabilities, which are normalized to sum to 1.


An implementation of the well-known Baum-Welch algorithm is available. We implemented it in Python and participants can of course use and modify the code to optimize their results.


Though it is clearly not the state of the art, most of today best algorithms are derived from the simple yet effective approach of ALERGIA [Carrasco & Oncina, 1994]. And as it does require to tune only one parameter, it is a relevant baseline. The code that uses OpenFST and Visual Studio can be download here. This baseline scores in the database all correspond to an alpha parameter set to 0.05, so it can certainly be tuned to get better results.