2012 ALBAYZIN EVALUATIONS: AUDIO SEGMENTATION


Audio segmentation


In some applications of speech technologies like Automatic Speech Recognition systems for Broadcast shows or Spoken Document Retrieval in very large multimedia repositories, Audio Segmentation is considered a very important task. Speech is usually found along with music or environmental noise, and the presence of each one of these acoustic classes must be accurately labeled, since the accuracy of these labels is critical for the subsequent systems to be successful. Thus, the development of accurate Audio Segmentation Systems is essential to allow applications like ASR or SDR to perform adequately in real-world environments.


Evaluation Description

The proposed evaluation consists of segmenting a broadcast audio document and assign labels for each segment indicating the presence of speech, music and/or noise. That is, two or more classes can be found simultaneously in audio segments and the goal is to indicate if one, two or the three aforementioned classes are present for a given time instant. For example, music can be overlapped with speech or noise can be in the background if someone is speaking.

In this evaluation, we consider that Speech is present every time that a person is speaking but not in the background. Music is understood in a general sense and Noise is considered every time some acoustic content is present different than speech and music (including speech in the background).

Data

The Catalan broadcast news database from the 3/24 TV channel proposed for the 2010 Albayzin Audio Segmentation Evaluation will be used for training segmentation systems.

Around four hours of the Aragón Radio database will be used for development and another sixteen hours of the Aragón Radio database will be used for testing.

Metrics

As in the NIST RT Diarization evaluations, to measure the performance of the proposed systems, the segmentation error score (SER) [1] will be computed as the fraction of class time that is not correctly attributed to that specific class (speech, noise or music). This score will be computed over the entire file to be processed; including regions where more than one class is present (overlap regions). This score will be the ratio of the overall segmentation error time to the sum of the durations of the segments that are assigned to each class in the file. The segmentation error time includes the time that is assigned to the wrong class, missed class time and false alarm class time. Note that in the overlap areas, the duration of the segment is attributed to all the reference classes that are present in that specific segment, and therefore, this time will be considered more than once.

2012 Albayzin Evaluation Plan


Registration:

All Research groups interested in participating in this evaluation must send an email to

This e-mail address is being protected from spambots. You need JavaScript enabled to view it.

This e-mail address is being protected from spambots. You need JavaScript enabled to view it.

with CC to the Chairs of the Albayzin 2012 Evaluations:

This e-mail address is being protected from spambots. You need JavaScript enabled to view it.

This e-mail address is being protected from spambots. You need JavaScript enabled to view it.

Indicating the following Information:

RESEARCH GROUP:
INSTITUTION:
CONTACT PERSON:
E-MAIL:

before July 15, 2012





[1] Segmentation Error Rate will be computed in the same way that Diarization Error Rate is computed for the NIST RT Evaluations. Specifically, as described in the 2009 Diarization Evaluation of the NIST RT Evaluation http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf

Additional information