Text Normalisation Datasets

Several text datasets in 3 languages: English, Spanish, and Romanian, used to test and validate the proposed statistical machine translation method for text normalisation.

Several text datasets are provided in three languages: English, Spanish, and Romanian. They may be used to test and validate a method and the associated software applied for language independent number transcription as a component in the text normalization module for text to speech synthesis. Instead of using transcription rules, the system uses a small amount of transcribed numbers to train a model and then uses statistical machine translation to expand unseen numbers into the associated text.

In order to create translation models you will need the following tools: GIZA ++ 1.0.7, IRSTLM 5.80.0, and Moses or directly download the customized solution proposed by “Simple4All” project in the “Norma” toolkit.

For each dataset there are provided pairs of files containing the list of numbers (named *.so, source) and the corresponding text transcription (*.ta, target). There are available files for training, tunning, and testing.

The dataset is licensed under a Creative Commons Attribution 3.0 Unported License. The associated scripts are licensed according to the description provided in the “Simple4All” “Norma” tool.

If you use any part of the datasets in your work, please cite the
following paper:
R. San-Segundo, J.M. Montero, M. Giurgiu, I. Muresan, S. King,
“Multilingual Number Transcription for Text-to-Speech Conversion”,
In Proc. of The 8th Speech Synthesis Workshop, Barcelona, September, 2013.