The Second DIHARD Diarization Challenge: Dataset, task, and baselines

This paper introduces the second DIHARD challenge, the second in a series of speaker diarization challenges intended to improve the robustness of diarization systems to variation in recording equipment, noise conditions, and conversational domain. The challenge comprises four tracks evaluating diarization performance under two input conditions (single channel vs. multi-channel) and two segmentation conditions (diarization from a reference speech segmentation vs. diarization from scratch). In order to prevent participants from overtuning to a particular combination of recording conditions and conversational domain, recordings are drawn from a variety of sources ranging from read audiobooks to meeting speech, to child language acquisition recordings, to dinner parties, to web video. We describe the task and metrics, challenge design, datasets, and baseline systems for speech enhancement, speech activity detection, and diarization.


Introduction
Speaker diarization, often referred to as "who spoke when", is the task of determining how many speakers are present in a conversation and correctly identifying all segments for each speaker. In addition to being an interesting technical challenge, it forms an important part of the pre-processing pipeline for speech-to-text and is essential for making objective measurements of turn-taking behavior. Early work in this area was driven by the NIST Rich Transcription (RT) evaluations [1], which ran between 2002 and 2009. In addition to driving substantial performance improvements, especially for meeting speech, the RT evaluations introduced the diarization error rate (DER) metric, which remains the principal evaluation metric in this area. Since the RT evaluation series ended in 2009, diarization performance has continued to improve, though the lack of a common task has resulted in fragmentation with individual research groups focusing on different datasets or domains (e.g., conversational telephone speech [2,3,4,5,6], broadcast [7,8], or meeting [9,10]). At best, this has made comparing performance difficult, while at worst it may have engendered overfitting to individual domains/datasets resulting in systems that do not generalize. Moreover, the majority of this work has evaluated systems using a modified version of DER in which speech within 250 ms of reference boundaries and overlapped speech are excluded from scoring. As short segments such as backchannels and overlapping speech are both common in conversation, this may have resulted in an over-optimistic assessment of performance even within these domains 1 [11].
It is against this backdrop that the JSALT-2017 workshop [12] and DIHARD challenges 2 emerged. The DIHARD series of challenges introduce a new common task for diarization that is intended both to facilitate comparison of current and future systems through standardized data, tasks, and metrics and promote work on robust diarization systems; that is systems, that are able to accurately handle highly interactive and overlapping speech from a range of conversational domains, while being resilient to variation in recording equipment, recording environment, reverberation, ambient noise, number of speakers, and speaker demographics. As with the NIST RT evaluations, DER is adopted as the primary evaluation metric, but without use of collars or exclusion of overlapping speech. There are no constraints on training data, with participants allowed to use any combination of public/proprietary data for system development.
The initial DIHARD challenge (DIHARD I) [13] ran during the spring of 2018 and attracted registrations from 20 teams, of which 13 submitted systems. As expected, state-of-the-art systems performed poorly, with final DER on the evaluation set for the top systems ranging from 23.73% [14] when provided with reference speech activity detection (SAD) marks to 35.51% [15] when forced to perform diarization from scratch. These error rates rates are more than double the state-of-the-art for CALL-HOME [16] at the time [4,5]. For some domains, error rates for the best systems exceeded 49% when using reference SAD and 75% when performing diarization from scratch! The second DIHARD Challenge (DIHARD II) [17], like its predecessor, examines diarization system performance under two SAD conditions: diarization from a supplied reference SAD and diarization from scratch. As with DIHARD I, it includes a single channel input condition utilizing wideband speech sampled from 11 demanding domains, ranging from clean, nearfield recordings of read audiobooks to extremely noisy, highly interactive, farfield recordings of speech in restaurants to child language data recorded in the home using LENA vests. Unlike DIHARD I, it additionally offers a multichannel input condition requiring participants to perform diarization from farfield microphone arrays of dinner party speech drawn from the CHiME-5 corpus [18]. For the first time, we also provide participants with baseline systems for speech enhancement, SAD, and diarization, as well as results obtained with these systems for all tracks.

Tracks
The challenge features two audio input conditions: • Single channel -Systems are provided with a single channel of audio for each recording. Depending on the recording source, this channel may be taken from a single distant microphone, a single channel from a distant microphone array, a mix of head-mounted or array microphones, or a mix of binaural microphones.
• Multichannel -Each recording session contains output from one or more distant microphone arrays, each containing multiple channels. Participants are instructed to treat the arrays separately, producing one output per array. They are free to use as few or as many of the channels on each array as they wish to perform diarization.
As system performance is strongly tied to the quality of the SAD component, we also include two SAD conditions: • Reference SAD -Systems are provided with a reference speech segmentation that is generated by merging speaker turns in the reference diarization.
• System SAD -Systems are provided with just the raw audio input for each recording session and are responsible for producing their own speech segmentation.
Together, this yields the following four evaluation tracks: • Track 1 -single channel audio using reference SAD • Track 2 -single channel audio using system SAD • Track 3 -multichannel audio using reference SAD • Track 4 -multichannel audio using system SAD All teams are required to register for at least one of track 1 or track 3.

Performance Metrics
As in DIHARD I, the primary metric is DER [1], which is the sum of missed speech, false alarm speech, and speaker misclassification error rates. Because systems are provided with the reference speech segmentation for tracks 1 and 3, for these tracks, it exclusively measures speaker misclassification error. This is the metric used to rank systems on the leaderboard.
For each system we also compute a secondary metric, Jaccard error rate (JER), which is newly developed for DIHARD II. JER is based on the Jaccard similarity index [19,20], a metric commonly used to evaluate the output of image segmentation systems, which is defined as the ratio between the sizes of the intersections and unions of two sets of segments. An optimal mapping between speakers in the reference diarization and speakers in the system diarization is determined and for each pair the Jaccard index of their segmentations is computed. JER is defined as 1 minus the average of these scores, expressed as a percentage. That is, it is the mean of Eq. 1 across all reference speakers ref , where TOTAL is the duration of the union of reference and system speaker segments, FA is the total system speaker time not attributed to the reference speaker, and MISS is the total reference speaker time not attributed to the system speaker. It ranges from 0% in the case where each reference speaker is paired with a system speaker with an identical segmentation to 100% in the case where none of the system speakers overlap any of the reference speakers.
All metrics are computed using version 1.0.1 of the dscore tool 3 without the use of forgiveness collars and with scoring of overlapped speech.

Overview
The DIHARD II development and evaluation sets draw from a diverse set of sources exhibiting wide variation in recording equipment, recording environment, ambient noise, number of speakers, and speaker demographics. The single channel input condition (tracks 1 and 2) dataset is a superset of that used in DIHARD I, though 6 hours of additional material have been added to ensure that all domains are represented in both the development and evaluation set. Additionally, two domains where the DIHARD I annotation was deemed suspect (child language and web video) have been entirely resegmented. For the multichannel input condition (tracks 3 and 4) we use the multi-party dinner recordings originally collected for and exposed during the CHiME-5 challenge [18]. The development and evaluation sets are summarized in Table 1.
The development set includes reference diarization and speech segmentation and may be used for any purpose including system development or training. As with DIHARD I, there is no training set, with participants free to train their systems on any proprietary and/or public data. Both the development and evaluation sets will be submitted for publication via LDC at the end of the evaluation.

Single channel data (tracks 1 and 2)
The single channel input condition development and evaluation sets consist of selections of 5-10 minute duration samples drawn from 11 conversational domains, each including approximately 2 hours of audio. The full set of domains is described below with LDC Catalog numbers where appropriate. Unless otherwise specified, all speech is English, though not necessarily by native or even fluent speakers. All audio is distributed via LDC as 16 kHz, monochannel FLAC files.
• audiobooks -amateur recordings of public domain English works drawn from LibriVox; care was taken to avoid overlap with LibriSpeech [21] (unpublished) • broadcast interview -student produced interviews with newsmakers of the day taken from a late 1970s college

Multichannel data (tracks 3 and 4)
The multichannel input condition development and evaluation sets are drawn from the CHiME-5 dinner party corpus [18], a corpus of conversational speech collected during dinner parties held in real homes. The development set combines the CHiME-5 training and development sets and encompasses 45 hours of dinner parties from 18 homes. The evaluation set is identical to the CHiME-5 evaluation set and consists of 5 hours of dinner parties from 2 homes. Each party was recorded using 6 Microsoft Kinect devices (4 channel linear arrays) distributed throughout the home in such a way that the conversation was always present on each array. Due to a combination of clock drift and random frame dropping, the Kinects within each recording session exhibit massive desynchronization, both with each other and with the binaural recording devices worn by participants. For this reason, each Kinect device is treated separately with the resulting development and evaluation sets having durations of 262.4 hours and 31.2 hours respectively. All audio is distributed via the University of Sheffield as 16 kHz WAV files.

Processing
A limited number of recordings contained regions carrying personal identifying information (PII), which were removed prior to publication. For the clinical and restaurant domains, this was done at LDC by low-pass filtering using a 10th order Butterworth filter with a passband of 0 to 400 Hz. To avoid abrupt transitions in the resulting waveform, the effect of the filter was gradually faded in and out at the beginning and end of the regions using a ramp of 40 ms. In the case of the sociolinguistic field recordings domain and the CHiME-5 data, PII was removed by the original creators of the corpora. In the former case, PII was replaced by tones of matched duration, while in the latter case it was zeroed out. PII containing regions are ignored during scoring.

Annotation
Reference segmentation and speaker labeling was produced by annotators at LDC using a tool equipped with playback, waveform and spectrogram display. Annotators were instructed to split on pauses > 200 ms, where a pause was defined as any stretch of time during which the speaker was not producing vocalization (e.g., backchannels, filled pauses, singing, speech errors and disfluencies, infant babbling or vocalizations, laughter, coughs, breaths, lipsmacks, and humming) of any kind. Boundaries were placed within 10 ms of the true boundary, taking care not to truncate sounds at edges of words (e.g., utterance-final fricatives). Where individual close talking microphones were available for speakers, annotation was performed separately for each speaker using their individual microphone. Due to time constraints, this manual segmentation process could not be implemented for the multichannel development data; for this data, segmentation was taken from the turn boundaries established during the original CHiME-5 transcription. An additional post-processing step was necessary for the CHiME-5 annotation to correct for the lack of synchronization between binaural recording devices and Kinects. For each Kinect, the lag between that array and the binaural recording devices was estimated at regular intervals using normalized crosscorrelation. The speech boundaries etablished by annotation on the binaural devices were then corrected for each Kinect using these estimated lags.

Speech enhancement
For speech enhancement we use a densely-connected LSTM architecture [24,25,26] trained to predict the ideal ratio masks (IRM) [27] of speech from log-power spectra (LPS) fea-tures. The model is trained via progressive multi-target learning [24,28] using 400 hours of noisy speech produced by corrupting clean utterances from WSJ0 [29] and a 50 hour Chinese speech corpus from the 863 Program [30]. Utterances were corrupted using 115 noise types [24] at 3 SNR levels (-5dB, 0dB, and 5dB). The trained models as well as scripts for applying them, are distributed through GitHub 4 .

Beamforming
For the multichannel tracks, we use weighted delay-and-sum beamforming as implemented in BeamformIt [31]. Beamforming is applied independently for each Kinect in each session using all four channels following the CHiME-5 recipe [18].

Speech activity detection
The baselines for tracks 2 and 4 use WebRTC's 5 SAD as implemented in the py-webrtc Python package 6 . Scripts for performing SAD using the same settings used to obtain the baseline results are distributed through GitHub 4 .

Diarization
The diarization baseline is based on the previously published Kaldi [32] recipe 7 for JHU's submission to DIHARD I [14]. At a high level, the system performs diarization by dividing each recording into short overlapping segments, extracting x-vectors [33,34], scoring with probabilistic linear discriminant analysis (PLDA) [35], and clustering using agglomerative hierarchical clustering (AHC) [36]. In contrast to the original JHU system, we omit the Variational Bayes resegmentation step [37]. The trained models are distributed through GitHub 8 .
The x-vector extractor configuration is identical to that used in previous speaker recognition and diarization systems [34,14] with two exceptions: i) 30 dimensional mel frequency cepstral coefficient (MFCC) features are used instead of mel filterbank features; ii) the embedding layer uses 512 dimensions. MFCCs are extracted every 10 ms using a 25 ms window and meannormalized using a 3 second sliding window. For training we use a combination of VoxCeleb 1 and VoxCeleb 2 [38,39] augmented with additive noise and reverberation according to the recipe from [33]. Segments under 4 seconds duration are discarded, resulting in a training set with 7,323 speakers. Reverberation is added by convolution with room responses from the RIR dataset [40], while additive noises are drawn from the MU-SAN dataset [41]. At test time, x-vectors are extracted from 1.5 second segments with 0.75 second overlap.
Following extraction, x-vectors are pre-processed to perform domain adaptation to the DIHARD II dataset. This is done by normalizing with a global mean and whitening transform learned from the DIHARD II development set. The whitened x-vectors are then length normalized [42] and used to train a Gaussian PLDA model [35] using a subset of VoxCeleb consisting of segments of at least 3 seconds duration. Following PLDA scoring, clustering is performed using AHC with the threshold set by minimizing DER on the development data.

Baseline results
DER and JER of the baseline system on both the development and evaluation sets for each track are presented in Table 2. The speech enhancement module is used only for tracks 2 and 4 as a pre-processing front-end for the SAD pipeline as the diarization system did not show improvements using the enhanced audio. The scores obtained by the challenge baseline are quite high, with track 1 DER roughly in line with the performance of the best DIHARD I systems [14,15,25] and track 2 DER 5% higher than for DIHARD I (15% without enhancement), which we suspect reflects a combination of superior SAD components in those systems and the more careful segmentation for the child language and web video domains in DIHARD II. Error rates are noticeably higher for tracks 3 and 4, reaching 50.85% and 77.34% respectively, though, again, these rates are roughly in line with those observed for the best DIHARD I systems on the two most difficult domains in that challenge: restaurant and child language.

Conclusion
The field of speaker diarization has changed drastically in the two short years we have been running this challenge. In the lead up to DIHARD I, the research community was fragmented and most research concentrated on relatively easy datasets using forgiving evaluation metrics. This both made comparison of systems difficult and led some to believe that diarization was relatively solved and uninteresting. However, we were pleased by the response to DIHARD I, both during the evaluation and after, demonstrating that there is interest in robust diarization. This renewed energy is on display in DIHARD II, which attracted 48 registered teams from 17 countries, more than doubling the number of teams registered for DIHARD I. It is also evident in the recent announcement of the Fearless Steps challenge, which includes diarization among its tasks. We hope that this year's contributions lead to marked progress toward the goal of truly robust diarization.

Acknowledgements
We would like to thank Harshah Vardhan MA, Prachi Singh, and Lei Sun for their help in preparing the baseline sytems and results. We would also like to acknowledge the generous support of Agence Nationale de la Recherche (ANR-16-DATA-0004 ACLEW, ANR-14-CE30-0003 MechELex, ANR-17-EURE-0017), the J. S. McDonnell Foundation, and the Linguistic Data Consortium as well as the CHiME-5 challenge for allowing us use of their data.