LILDA

LILDA is a Latent Dirichlet Allocation based language identification tool based upon open source components.

It is designed to be used to purify text corpora, by selecting sentences only containing the prominent language present.

See: Zhang, Clark & Wang  2014 Unsupervised Language Filtering using the Latent Dirichlet Allocation. In proc. Interspeech 2014, (included in the download) Singapore for technical details.

[Download]