The module aims at performing automatic segmentation and clustering of an input audio according to speaker identity using acoustic cues.
Multimedia document indexing and archiving services.
Speaker diarization is the process of partitioning an input audio stream into homogeneous segments according to their speaker identity. This partitioning is a useful preprocessing step for an automatic speech transcription system, but it can also improve the readability of the transcription by structuring the audio stream into speaker turns. One of the major issues is that the number of speakers in the audio stream is generally unknown a priori and needs to be automatically determined.
Given samples of known speaker’s voices, speaker verification techniques can be further applied and provide clusters of identified speaker.
The LIMSI multi-stage speaker diarization system combines an agglomerative clustering based on Bayesian information criterion (BIC) with a second clustering stage using speaker identification (SID) techniques with more complex models.
This system participated to several evaluations on acoustic speaker diarization, on US English Broadcast News for NIST Rich Transcription 2004 Fall (NIST RT’04F) and on French broadcast radio and TV news and conversations for the ESTER-1 and ESTER-2 evaluation campaigns, providing state-of-the-art performances. Within the QUAERO program, LIMSI is developing improved speaker diarization and speaker tracking systems for broadcast news but also for more interactive data like talk shows.
It is a building block of the system presented by QUAERO partners to the REPERE challenge on multimodal person identification.
A standard PC with Linux operating system.
The technology developed at LIMSI-CNRS is available for licensing on a case-by-case basis.