Adriana Stan, Peter Bell, Simon King (2012): A Grapheme-based Method for Automatic Alignment of Speech and Text Data. In: Proc. Spoken Language Technology Workshop (SLT), 2012 IEEE, pp. 286 - 290, 2012. (Type: Inproceeding | Abstract | Links | BibTeX | Tags: grapheme-based models, imperfect transcripts, speech alignment, word networks)@inproceedings{stanSLT2012,
title = {A Grapheme-based Method for Automatic Alignment of Speech and Text Data},
author = {Adriana Stan and Peter Bell and Simon King},
url = {http://dx.doi.org/10.1109/SLT.2012.6424237},
year = {2012},
date = {2012-12-05},
booktitle = {Proc. Spoken Language Technology Workshop (SLT), 2012 IEEE},
pages = {286 - 290},
abstract = {This paper introduces a method for automatic alignment of speech data with unsynchronised, imperfect transcripts, for a domain where no initial acoustic models are available. Using grapheme-based acoustic models, word skip networks and orthographic speech transcripts, we are able to harvest 55% of the speech with a 93% utterance-level accuracy and 99% word accuracy for the produced transcriptions. The work is based on the assumption that there is a high degree of corre- spondence between the speech and text, and that a full transcription of all of the speech is not required. The method is language independent and the only prior knowledge and resources required are the speech and text transcripts, and a few minor user interventions.},
keywords = {grapheme-based models, imperfect transcripts, speech alignment, word networks}
}
This paper introduces a method for automatic alignment of speech data with unsynchronised, imperfect transcripts, for a domain where no initial acoustic models are available. Using grapheme-based acoustic models, word skip networks and orthographic speech transcripts, we are able to harvest 55% of the speech with a 93% utterance-level accuracy and 99% word accuracy for the produced transcriptions. The work is based on the assumption that there is a high degree of corre- spondence between the speech and text, and that a full transcription of all of the speech is not required. The method is language independent and the only prior knowledge and resources required are the speech and text transcripts, and a few minor user interventions.
|