Unified Voice Biometric System Against Spoofing Attacks

The project was done by my 3rd year student, Avgustyonok Alina, at HSE (Coursework) in 23/24 academic year.

Biometric systems are split into Automatic Speaker Verification (ASV) and Countermeasure (CM) systems that handle different types of attacks. The ASV aims to distinguish between owner of the system and another real person (imposter attack), whereas CM has to discriminate between real and synthesized speech (i.e. TTS and VC) or real and recorded speech (a.k.a. replay attack). Usually, there exist three separate systems for all three types of attacks (imposter, synthesized, recorded), but a complete authentication system must handle all of them. Hence, the development of Spoofing-Aware Speaker Verification (SASV) system gained popularity recently.

However, in literature, such SASV systems mostly focus only on the joined detection of imposter attacks and synthesized speech. A unified system that also considers recorded speech attack is necessary. The project aims on the development of such a system and provides some initial insights by developing the SASV system that combines an ASV and replay-attack CM.

LFCC-LCNN is taken as a CM system and ECAPA-TDNN as an ASV one. By taking the methodology from imposter/synthesized speech SASV systems, the four version of impost/replay attack SASV systems are created:

Simple sum of scores (with/without pre-normalization), as in the baseline of SASV Challenge 2022.
Probabilistic fusion.
FiLM-based fusion.
Cascade approach.

The experimental results highlight that the behavior of joining approaches is similar between synthesized speech and recorded speech cases: cascade obtains the best performance, the FiLM is the second-best, and probabilistic fusion is better than vanilla sum.