LILDA

LILDA is a Latent Dirichlet Allocation based language identification tool based upon open source components.

It is designed to be used to purify text corpora, by selecting sentences only containing the prominent language present.

See: Zhang, Clark & Wang 2014 Unsupervised Language Filtering using the Latent Dirichlet Allocation. In proc. Interspeech 2014, (included in the download) Singapore for technical details.

[Download]

Output Categories

Corpora
Tools

The SIMPLE⁴ALL project created speech synthesis technology that learns from data with little or no expert supervision and continually improves itself, simply by being used.

LILDA

Related Products

Output Categories

The SIMPLE4ALL project created speech synthesis technology that learns from data with little or no expert supervision and continually improves itself, simply by being used.

LILDA

Related Products

Output Categories

The SIMPLE⁴ALL project created speech synthesis technology that learns from data with little or no expert supervision and continually improves itself, simply by being used.