JISE


  [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]


Journal of Information Science and Engineering, Vol. 40 No. 2, pp. 303-316


Speaker Verification System Based on Time Delay Neural Network with Pre-activated CNN Stem and Deep Layer Aggregation


WEI-TING LIN, TING-WEI CHEN AND CHIA-PING CHEN+
Department of Computer Science and Engineering
National Sun Yat-sen University
Kaohsiung, 804 Taiwan
E-mail: {m093040020; m103040017}@student.nsysu.edu.tw; cpchen@mail.cse.nsysu.edu.tw


In this paper, we improve the state-of-the-art ECAPA-TDNN model for speaker verification with CNN stem, self-calibration (SC) block, and deep layer aggregation. The proposed architecture is called Emphasized Channel Attention Propagation and Deep Layer Aggregation with Pre-activated CNN Stem in Time Delay Neural Network, which is abbreviated as ECAPDLA CNNv2-TDNN. First, we add a pre-activated stemming convolution layer in front of the main ECAPA-TDNN architecture. This ensures that the input to our main model architecture is a stable feature representation. Next, we change the multi-layer aggregation of ECAPA-TDNN to deep layer aggregation and replace the SE-Res2block in ECAPATDNN with SC block. Thus, the proposed implementation enhances feature extraction on multiple time scales and spectral channels and improves the overall training efficacy. On the VoxCeleb1-O dataset, the proposed model achieves an equal error rate (EER) of 0.95%. This is significantly better than the EER of 1.23% achieved by the ECAPA-TDNN baseline.


Keywords: speaker verification, time delay neural network, CNN stem, self-calibration block, deep layer aggregation

  Retrieve PDF document (JISE_202402_07.pdf)