LenslessMic | Petr Grinberg

Petr Grinberg, Eric Bezzam, Paolo Prandoni, Martin Vetterli

Audiovisual Communications Laboratory, EPFL, Switzerland

With society’s increasing reliance on digital data sharing, the protection of sensitive information has become critical. Encryption serves as one of the privacy-preserving methods; however, its realization in the audio domain predominantly relies on signal processing or software methods embedded into hardware. In this paper, we introduce LenslessMic, a hybrid optical hardware-based encryption method that utilizes a lensless camera as a physical layer of security applicable to multiple types of audio. We show that LenslessMic enables robust authentication of audio recordings and encryption strength greater than or equal to that of AES-256 with high-quality signals and minimal loss of content information. The approach is validated with a low-cost Raspberry Pi prototype and is open sourced together with datasets to facilitate research in the area.

Demo Samples

Below we show example audio from the collected Librispeech and SongDescriber datasets for different models: Learned (with different \(g, r\) variations), R-Learned, No-PSF, and ADMM-100.

All models and train/test datasets are published on Huggingface Collection.

Audio Reconstruction

Below we provide recordings from test-clean set of Librispeech. We recommend listening to the ground-truth and codec versions only after listening to the reconstructions to avoid phonetic-restoration effect, i.e. unintelligible audio may become meaningful after you know what is the actual content of the speech. Methods with \(g \ge 2\) increase LenslessMic robustness to ensure that this effect will not allow hearing speech content from No-PSF reconstructions.

Method	237-126133-0018	4446-2273-0026	5105-28241-0013	5142-33396-0062	7127-75947-0032	8455-210777-0042

Learned and Codec recordings sound identical, showcasing high quality reconstruction abilities of LenslessMic. Since it may be possible to understand some part of the speech content for No-PSF, we provide LenslessMic variants with improved robustness that operate on group frames. The examples are provided below. \(g=3\) case provides the best balance between security robustness and reconstruction quality.

Method	237-126133-0018	4446-2273-0026	5105-28241-0013	5142-33396-0062	7127-75947-0032	8455-210777-0042

We test whether LenslessMic is applicable to other neural audio codecs and collect a dataset with X-codec instead of DAC. The results are presented below. While some utterances sound almost as good as DAC-ones, there are several recordings that have severe reverberation-like or filtering-like effect that decreases intelligibility. In general, LenslessMic is applicable on other codecs and even in cross-codec scenario, however, training on the actual data is required for high-quality reconstruction. R-Learned achieves better quality because it is trained on test-random data and is not overfitted towards any single codec.

Method	237-126133-0018	4446-2273-0026	5105-28241-0013	5142-33396-0062	7127-75947-0032	8455-210777-0042

Besides, LenslessMic works on music data too (we downsampled dataset to 16kHz, so some instruments are a bit distorted):

Method	27.599227.6s	33.1336433.6s	43.175043.6s	46.5346.6s	56.76656.6s

Audio Encryption and Authentication

LenslessMic reconstruction results in noise if PSF is wrong. This is a core property behind LenslessMic authentication robustness and accuracy. Below you can find examples:

The same audio with wrong (left) and correct (right) PSF.

LenslessMic encryption strength is greater than or equal to that of AES-256. Below we provide audio samples for different \(W\), i.e., the ratio of correctly determined pixels in the PSF. \(W=7\%\) and \(W=4\%\) correspond to AES-256 and AES-128 strength, respectively.

Results for \(g=3\):

Method	237-126133-0018	4446-2273-0026	5105-28241-0013	5142-33396-0062	7127-75947-0032	8455-210777-0042

Results for \(g=2\):

Method	237-126133-0018	4446-2273-0026	5105-28241-0013	5142-33396-0062	7127-75947-0032	8455-210777-0042

\(g=3\) is more robust, however, leads to slightly worse reconstruction quality.

Captured Video Representation

Lensed

Lensless Measurement

Reconstruction (Learned)

Reconstruction (No-PSF)

Reconstruction (ADMM-100)

Example of a lensed video (audio-to-video conversion), a corresponding lensless measurement, and its reconstruction

The video below shows how reconstruction of the same frame enhances from \(W=0\%\) to \(W=100\%\):

Reconstruction (Learned)

Demo Samples

Audio Reconstruction

Audio Encryption and Authentication

Captured Video Representation

Extra Experiments