Text Normalisation Datasets

Several text datasets in 3 languages: English, Spanish, and Romanian, used to test and validate the proposed statistical machine translation method for text normalisation.

Several text datasets are provided in three languages: English, Spanish, and Romanian. They may be used to test and validate a method and the associated software applied for language independent number transcription as a component in the text normalization module for text to speech synthesis. Instead of using transcription rules, the system uses a small amount of transcribed numbers to train a model and then uses statistical machine translation to expand unseen numbers into the associated text.

In order to create translation models you will need the following tools: GIZA ++ 1.0.7, IRSTLM 5.80.0, and Moses or directly download the customized solution proposed by “Simple4All” project in the “Norma” toolkit.

CONTENTS:
For each dataset there are provided pairs of files containing the list of numbers (named *.so, source) and the corresponding text transcription (*.ta, target). There are available files for training, tunning, and testing.

LICENSE:
The dataset is licensed under a Creative Commons Attribution 3.0 Unported License. The associated scripts are licensed according to the description provided in the “Simple4All” “Norma” tool.

CITATION:
If you use any part of the datasets in your work, please cite the
following paper:
R. San-Segundo, J.M. Montero, M. Giurgiu, I. Muresan, S. King,
“Multilingual Number Transcription for Text-to-Speech Conversion”,
In Proc. of The 8th Speech Synthesis Workshop, Barcelona, September, 2013.

[download]