LenslessMic
Audio Encryption and Authentication via Lensless Computational Imaging -- Official Demo Page.
With society’s increasing reliance on digital data sharing, the protection of sensitive information has become critical. Encryption serves as one of the privacy-preserving methods; however, its realization in the audio domain predominantly relies on signal processing or software methods embedded into hardware. In this paper, we introduce LenslessMic, a hybrid optical hardware-based encryption method that utilizes a lensless camera as a physical layer of security applicable to multiple types of audio. We show that LenslessMic enables robust authentication of audio recordings and encryption strength greater than or equal to that of AES-256 with high-quality signals and minimal loss of content information. The approach is validated with a low-cost Raspberry Pi prototype and is open sourced together with datasets to facilitate research in the area.
Demo Samples
Below we show example audio from the collected Librispeech and SongDescriber datasets for different models: Learned (with different \(g, r\) variations), R-Learned, No-PSF, and ADMM-100.
All models and train/test datasets are published on Huggingface Collection.
Audio Reconstruction
Below we provide recordings from test-clean set of Librispeech. We recommend listening to the ground-truth and codec versions only after listening to the reconstructions to avoid phonetic-restoration effect, i.e. unintelligible audio may become meaningful after you know what is the actual content of the speech. Methods with \(g \ge 2\) increase LenslessMic robustness to ensure that this effect will not allow hearing speech content from No-PSF reconstructions.
Method | 237-126133-0018 | 4446-2273-0026 | 5105-28241-0013 | 5142-33396-0062 | 7127-75947-0032 | 8455-210777-0042 |
---|
Learned and Codec recordings sound identical, showcasing high quality reconstruction abilities of LenslessMic. Since it may be possible to understand some part of the speech content for No-PSF, we provide LenslessMic variants with improved robustness that operate on group frames. The examples are provided below. \(g=3\) case provides the best balance between security robustness and reconstruction quality.
Method | 237-126133-0018 | 4446-2273-0026 | 5105-28241-0013 | 5142-33396-0062 | 7127-75947-0032 | 8455-210777-0042 |
---|
We test whether LenslessMic is applicable to other neural audio codecs and collect a dataset with X-codec instead of DAC. The results are presented below. While some utterances sound almost as good as DAC-ones, there are several recordings that have severe reverberation-like or filtering-like effect that decreases intelligibility. In general, LenslessMic is applicable on other codecs and even in cross-codec scenario, however, training on the actual data is required for high-quality reconstruction. R-Learned achieves better quality because it is trained on test-random data and is not overfitted towards any single codec.
Method | 237-126133-0018 | 4446-2273-0026 | 5105-28241-0013 | 5142-33396-0062 | 7127-75947-0032 | 8455-210777-0042 |
---|
Besides, LenslessMic works on music data too (we downsampled dataset to 16kHz, so some instruments are a bit distorted):
Method | 27.599227.6s | 33.1336433.6s | 43.175043.6s | 46.5346.6s | 56.76656.6s |
---|
Audio Encryption and Authentication
LenslessMic reconstruction results in noise if PSF is wrong. This is a core property behind LenslessMic authentication robustness and accuracy. Below you can find examples:
LenslessMic encryption strength is greater than or equal to that of AES-256. Below we provide audio samples for different \(W\), i.e., the ratio of correctly determined pixels in the PSF. \(W=7\%\) and \(W=4\%\) correspond to AES-256 and AES-128 strength, respectively.
Results for \(g=3\):
Method | 237-126133-0018 | 4446-2273-0026 | 5105-28241-0013 | 5142-33396-0062 | 7127-75947-0032 | 8455-210777-0042 |
---|
Results for \(g=2\):
Method | 237-126133-0018 | 4446-2273-0026 | 5105-28241-0013 | 5142-33396-0062 | 7127-75947-0032 | 8455-210777-0042 |
---|
\(g=3\) is more robust, however, leads to slightly worse reconstruction quality.
Captured Video Representation
The video below shows how reconstruction of the same frame enhances from \(W=0\%\) to \(W=100\%\):