Publications | Petr Grinberg

2025

ICASSP
What Does an Audio Deepfake Detector Focus on? A Study in the Time Domain

Petr Grinberg, Ankur Kumar, Surya Koppisetti, and 1 more author

In ICASSP 2025 (Accepted) IEEE international conference on acoustics, speech and signal processing (ICASSP), 2025

Abs arXiv Bib

Adding explanations to audio deepfake detection (ADD) models will boost their real-world application by providing insight on the decision making process. In this paper, we propose a relevancy-based explainable AI (XAI) method to analyze the predictions of transformer-based ADD models. We compare against standard Grad-CAM and SHAP-based methods, using quantitative faithfulness metrics as well as a partial spoof test, to comprehensively analyze the relative importance of different temporal regions in an audio. We consider large datasets, unlike previous works where only limited utterances are studied, and find that the XAI methods differ in their explanations. The proposed relevancy-based XAI method performs the best overall on a variety of metrics. Further investigation on the relative importance of speech/non-speech, phonetic content, and voice onsets/offsets suggest that the XAI results obtained from analyzing limited utterances don’t necessarily hold when evaluated on large datasets.
@inproceedings{grinberg2025does, title = {{What Does an Audio Deepfake Detector Focus on? A Study in the Time Domain}}, author = {Grinberg, Petr and Kumar, Ankur and Koppisetti, Surya and Bharaj, Gaurav}, booktitle = {ICASSP 2025 (Accepted) IEEE international conference on acoustics, speech and signal processing (ICASSP)}, year = {2025}, organization = {IEEE}, }

2023

IEEE Access
RawSpectrogram: On the Way to Effective Streaming Speech Anti-spoofing

Petr Grinberg, and Vladislav Shikhov

IEEE Access, 2023

Abs DOI Bib

Traditional anti-spoofing systems cannot be used straightforwardly with streaming audio because they are designed for finite utterances. Such offline models can be applied in streaming with the help of buffering; however, they are not effective in terms of memory and computational consumption. We propose a novel approach called RawSpectrogram that makes offline models streaming-friendly without a significant drop in quality. The method was tested on RawNet2 and AASIST, resulting in new architectures called RawRNN (RawLSTM and RawGRU), RS-AASIST, and TAASIST. The RawRNN-type models are much smaller and achieve a better Equal Error Rate than their base architecture, RawNet2. RS-AASIST and TAASIST have fewer parameters than AASIST and achieve similar quality. We also proved our concept for models with time-frequency transform front-ends and automatic speaker verification systems by proposing RECAPA-TDNN based on ECAPA-TDNN. RS-AASIST and RECAPA-TDNN were combined into the first streaming-friendly spoofing-aware speaker verification system reported in the literature. This joint system achieves significantly better quality than the corresponding offline solutions. All our models require far fewer floating-point operations for score updates. RawSpectrogram usage significantly reduces the latency of the prediction and allows the system to update the probability with each new chunk from the stream, preserving all information from the past. To the best of our knowledge, TAASIST is the most successful voice anti-spoofing system that employs a vanilla Transformer trained using supervised learning.
@article{grinberg2023rawspectrogram, title = {{RawSpectrogram: On the Way to Effective Streaming Speech Anti-spoofing}}, author = {Grinberg, Petr and Shikhov, Vladislav}, journal = {IEEE Access}, year = {2023}, publisher = {IEEE}, doi = {10.1109/ACCESS.2023.3321919}, }

2022

arXiv
A Comparative Study of Fusion Methods for SASV Challenge 2022

Petr Grinberg, and Vladislav Shikhov

arXiv preprint arXiv:2203.16970, 2022

Abs DOI Bib

Automatic Speaker Verification (ASV) system is a type of bio-metric authentication. It can be attacked by an intruder, who falsifies data in order to get access to protected information. Countermeasures (CM) are special algorithms that detect these spoofing-attacks. While the ASVspoof Challenge series were focused on the development of CM for fixed ASV system, the new Spoofing Aware Speaker Verification (SASV) Challenge organizers believe that best results can be achieved if CM and ASV systems are optimized jointly. One of the approaches for cooperative optimization is a fusion over embeddings or scores obtained from ASV and CM models. The baselines of SASV Challenge 2022 present two types of fusion: score-sum and back-end ensemble with a 3-layer MLP. This paper describes our research of other fusion methods, including boosting over embeddings, which has not been used in anti-spoofing studies before.
@article{grinberg2022comparative, title = {{A Comparative Study of Fusion Methods for SASV Challenge 2022}}, author = {Grinberg, Petr and Shikhov, Vladislav}, journal = {arXiv preprint arXiv:2203.16970}, year = {2022}, doi = {10.48550/arXiv.2203.16970}, }