doi: 10.3934/jimo.2021131
## Two-level optimization approach with accelerated proximal gradient for objective measures in sparse speech reconstruction

 1 School of Electrical Engineering, Computing and Mathematical Sciences, Curtin University of Technology, Perth, Australia 2 Faculty of Engineering and Physical Sciences, University of Southampton Malaysia (UoSM), Iskandar Puteri, Johor, Malaysia

Received  November 2020 Revised  May 2021 Early access August 2021

Compressive speech enhancement makes use of the sparseness of speech and the non-sparseness of noise in time-frequency representation to perform speech enhancement. However, reconstructing the sparsest output may not necessarily translate to a good enhanced speech signal as speech distortion may be at risk. This paper proposes a two level optimization approach to incorporate objective quality measures in compressive speech enhancement. The proposed method combines the accelerated proximal gradient approach and a global one dimensional optimization method to solve the sparse reconstruction. By incorporating objective quality measures in the optimization process, the reconstructed output is not only sparse but also maintains the highest objective quality score possible. In other words, the sparse speech reconstruction process is now quality sparse speech reconstruction. Experimental results in a compressive speech enhancement consistently show score improvement in objectives measures in different noisy environments compared to the non-optimized method. Additionally, the proposed optimization yields a higher convergence rate with a lower computational complexity compared to the existing methods.

##### References:
 [1] A. Beck and M. Teboulle, A fast iterative shrinkage-thresholding algorithm for linear inverse problem, SIAM Journal on Imaging Sciences, 2 (2009), 183-202.  doi: 10.1137/080716542.  Google Scholar [2] J. Benesty and Y. Huang, A Perspective on Single-Channel Frequency-Domain Speech Enhancement, San Rafael: Morgan and Claypool Publishers, 2010. doi: 10.2200/S00344ED1V01Y201104SAP008.  Google Scholar [3] S. F. Boll, Supression of acoustic noise in speech using spectral subtraction, IEEE Transactions on Acoustics, Speech and Signal Processing, ASSP-27 (1979), 113-120.   Google Scholar [4] O. Burdakov, Y. Dai and N. Huang, Stabilized Barzilai-Borwein method, J. Comp. Math., 37 (2019), 916-936.  doi: 10.4208/jcm.1911-m2019-0171.  Google Scholar [5] E. J. Candés, J. Romberg and T. Tao, Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information, IEEE Transactions on Information Theory, 52 (2006), 489-509.  doi: 10.1109/TIT.2005.862083.  Google Scholar [6] E. J. Candes and T. Tao, Near-optimal signal recovery from random projections: universal encoding strategies, IEEE Transactions on Information Theory, 52 (2006), 5406-5425.  doi: 10.1109/TIT.2006.885507.  Google Scholar [7] E. J. Candes and M. B. Wakin, An introduction to compressive sampling, IEEE Signal Processing Magazine, (2008), 21-30. Google Scholar [8] H. H. Dam and A. Cantoni, Interior point method for optimum zero-forcing beamforming with per-antenna power constraints and optimal step size, Signal Processing, 106 (2015), 10-14.  doi: 10.1016/j.sigpro.2014.06.028.  Google Scholar [9] H. H. Dam and S. Nordholm, Accelerated gradient with optimal step size for second-order blind signal separation, Multidimens. Syst. Signal Process., 29 (2018), 903-919.  doi: 10.1007/s11045-017-0478-8.  Google Scholar [10] T. Esch and P. Vary, Efficient musical noise suppression for speech enhancement system, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, (2009), 4409-4412. doi: 10.1109/ICASSP.2009.4960607.  Google Scholar [11] P. K. Ghosh, A. Tsiartas and S. Narayanan, Robust voice activity detection using long-term signal variability, IEEE Transactions on Audio, Speech and Language Processing, 19 (2011), 600-613.  doi: 10.1109/TASL.2010.2052803.  Google Scholar [12] S. J. Kim, K. Koh, M. Lustig, S. Boyd and D. Gorinevsky, An interior-point method for large-scale $l_1$-regularized least squares, IEEE Journal of Selected Topics in Signal Processing, 1 (2007), 606-617.   Google Scholar [13] H. Li, C. Fang and Z. Lin, Accelerated first-order optimization algorithms for machine learning, Proceedings of the IEEE, (2020), 1-16. Google Scholar [14] P. C. Loizou, Speech Enhancement: Theory and Practice, CRC press, Boca Raton, 2013.  doi: 10.1201/9781420015836.  Google Scholar [15] S. Y. Low, Compressive speech enhancement in the modulation domain, Speech Communication, 102 (2018), 87-99.  doi: 10.1016/j.specom.2018.08.003.  Google Scholar [16] S. Y. Low, D. S. Pham and S. Venkatesh, Compressive speech enhancement, Speech Communication, 55 (2013), 757-768.  doi: 10.1016/j.specom.2013.03.003.  Google Scholar [17] R. Martin, Noise power spectral density estimation based on optimal smoothing and minimum statistics, IEEE Transactions on Speech and Audio Processing, 9 (2001), 504-512.  doi: 10.1109/89.928915.  Google Scholar [18] R. Miyazaki, H. Saruwatari, T. Inoue, K. Shikano and K. Kondo, Musical-noise-free speech enhancement: Theory and evaluation, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), (2012), 4565-4568. doi: 10.1109/ICASSP.2012.6288934.  Google Scholar [19] M. Nazih, K. Minaoui and P. Comon, Using the proximal gradient and the accelerated proximal gradient as a canonical polyadic tensor decomposition algorithms in difficult situations, Signal Processing, 171 (2020), 107472. doi: 10.1016/j.sigpro.2020.107472.  Google Scholar [20] N. Parikh and S. Boyd, Proximal Algorithms, Foundation and Trends in Optimization, 1 (2013), 123-231.   Google Scholar [21] A. W. Rix, J. G. Beerends, M. P. Hollier and A. P. Hekstra, Perceptual evaluation of speech quality (PESQ) - a new method for speech quality assessment of telephone networks and codecs, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), 2 (2001), 749-752.  doi: 10.1109/ICASSP.2001.941023.  Google Scholar [22] M. Schmidt, Least squares optimization with l1-norm regularization, Technical Report CSP542B, 2005. Google Scholar [23] Y. Shi, S. Y. Low and K. F. C. Yiu, Hyper-parameterization of sparse reconstruction for speech enhancement, Applied Acoustics, 138 (2018), 72-79.  doi: 10.1016/j.apacoust.2018.03.020.  Google Scholar [24] C. H. Taal, R. C. Hendriks, R. Heusdens and J. Jensen, A short-time objective intelligibility measure for time-frequency weighted noisy speech, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, (2010), 4214-4217. doi: 10.1109/ICASSP.2010.5495701.  Google Scholar [25] R. Tibshirani, Regression shrinkage and selection via the lasso, J. Roy. Statist. Soc. Ser. B, 58 (1996), 267-288.  doi: 10.1111/j.2517-6161.1996.tb02080.x.  Google Scholar [26] M. Torcoli, An improved measure of musical noise based on spectral kurtosis, 019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), (2019), 90-94. doi: 10.1109/WASPAA.2019.8937195.  Google Scholar [27] D. Wu, W. Zhu and M. N. S. Swamy, A compressive sensing method for noise reduction of speech and audio signals, 2011 IEEE 54th International Midwest Symposium on Circuits and Systems (MWSCAS), (2011), 1-4. doi: 10.1109/MWSCAS.2011.6026662.  Google Scholar [28] Z. Zhang, Y. Xu, J. Yang, X. Li and D. Zhang, A Survey of Sparse Representation: Algorithms and Applications, IEEE Access, 3 (2015), 490-530.   Google Scholar

Convergence for accelerated proximal gradient, proximal gradient methods and interior point methods for babble noise with 0 dB and $L = 256$
Convergence for the proximal gradient, the accelerated proximal gradient, and the interior point methods for babble noise with 0 dB and $L = 512$
Convergence for the proximal gradient, the accelerated proximal gradient, and the interior point methods for destroyer noise with 0 dB and $L = 512$
Complexity comparison between the proximal gradient, the accelerated proximal gradient, and the interior point methods for babble noise, destroyer noise and white noise with window length $L = 256$
 Noise type SNR Accelerated Proximal Gradient Proximal Gradient Interior Point Method Babble noise 0dB 3.1307s 3.3139s 7.2640s 5dB 2.7209s 2.8722s 7.0281s 10dB 2.3887s 2.4476s 6.8677s 15dB 2.2449s 2.4057s 6.8233s 20dB 2.0481s 2.1050s 6.6695s Destroyer noise 0dB 2.8322s 2.9187s 6.9363s 5dB 2.4799s 2.5386s 6.8301s 10dB 2.2675s 2.4413s 6.7417s 15dB 2.1390s 2.2119s 6.7070s 20dB 1.8859s 1.9688s 6.4216s White noise 0dB 3.4491s 3.5234s 6.6548s 5dB 2.8229s 2.9723s 6.9340s 10dB 2.5765s 2.6288s 7.2333s 15dB 2.3393s 2.4726s 7.0217s 20dB 1.9912s 2.0732s 6.5130s
Complexity comparison between the accelerated proximal gradient, the proximal gradient and the interior point methods for babble noise, destroyer noise and white noise with window length $L = 512$
 Noise type SNR Accelerated Proximal Gradient Proximal Gradient Interior Point Method Babble noise 0dB 0.8681s 0.9342s 12.7778s 5dB 0.7779s 0.8346s 12.5931s 10dB 0.7119s 0.7730s 12.2826s 15dB 0.6637s 0.7199s 12.0663s 20dB 0.6138s 0.6703s 11.7910s Destroyer noise 0dB 0.8096s 0.8760s 12.4143s 5dB 0.7330s 0.7863s 12.4028s 10dB 0.6709s 0.7329s 12.0540s 15dB 0.6263s 0.6908s 11.9282s 20dB 0.5950s 0.6550 11.8206s White noise 0dB 0.9592s 1.0401s 11.9704s 5dB 0.8137s 0.8761s 12.5119s 10dB 0.7049s 0.7656s 12.8533s 15dB 0.6503s 0.7136s 12.3004s 20dB 0.6193s 0.6818s 11.9545s
PESQ and STOI performance for different SNR with babble noise and $L = 256$
 SNR Methods PESQ STOI 0 dB Optimized $\lambda=0.9549$ 2.0328 0.7147 Fixed value $\lambda=0.8$ 2.0073 0.7032 Fixed value $\lambda=0.9$ 2.0241 0.7103 Unprocessed $--$ 1.8938 0.7145 5 dB Optimized $\lambda=0.9449$ 2.4100 0.8200 Fixed value $\lambda=0.8$ 2.3896 0.8107 Fixed value $\lambda=0.9$ 2.3996 0.8170 Unprocessed $--$ 2.2203 0.8130 10 dB Optimized $\lambda=0.9549$ 2.7702 0.8999 Fixed value $\lambda=0.8$ 2.7522 0.8918 Fixed value $\lambda=0.9$ 2.7639 0.8974 Unprocessed $--$ 2.5434 0.8899 15dB Optimized $\lambda=0.9525$ 3.1247 0.9504 Fixed value $\lambda=0.8$ 3.0937 0.9455 Fixed value $\lambda=0.9$ 3.1144 0.9489 Unprocessed $--$ 2.8556 0.9423 20dB Optimized $\lambda=0.9549$ 3.4425 0.9767 Fixed value $\lambda=0.8$ 3.3898 0.9731 Fixed value $\lambda=0.9$ 3.4317 0.9757 Unprocessed $--$ 3.1674 0.9734
PESQ and STOI performance for different SNR with destroyer noise and $L = 256$
 SNR Method s PESQ STOI 0 dB Optimized $\lambda=0.8949$ 2.1629 0.7532 Fixed value $\lambda=0.8$ 2.1543 0.7448 Fixed value $\lambda=0.9$ 2.1456 0.7497 Unprocessed $--$ 1.9271 0.7524 5 dB Optimized $\lambda=0.8951$ 2.5370 0.8337 Fixed value $\lambda=0.8$ 2.5186 0.8267 Fixed value $\lambda=0.9$ 2.5283 0.8325 Unprocessed $--$ 2.2955 0.8281 10 dB Optimized $\lambda=0.8749$ 2.8704 0.9001 Fixed value $\lambda=0.8$ 2.8543 0.8933 Fixed value $\lambda=0.9$ 2.8677 0.8985 Unprocessed $--$ 2.6132 0.8902 15dB Optimized $\lambda=0.8949$ 3.1914 0.9468 Fixed value $\lambda=0.8$ 3.1611 0.9412 Fixed value $\lambda=0.9$ 3.1876 0.9455 Unprocessed $--$ 2.9256 0.9382 20dB Optimized $\lambda=0.9451$ 3.4868 0.9737 Fixed value $\lambda=0.8$ 3.4427 0.9696 Fixed value $\lambda=0.9$ 3.4722 0.9726 Unprocessed $--$ 3.2468 0.9697
PESQ and STOI performance for different SNR with white noise and $L = 256$
 SNR Methods PESQ STOI 0 dB Optimized $\lambda=0.9331$ 2.0119 0.7661 Fixed value $\lambda=0.8$ 1.9895 0.7519 Fixed value $\lambda=0.9$ 2.0042 0.7619 Unprocessed $--$ 1.6665 0.7377 5 dB Optimized $\lambda=0.9451$ 2.3972 0.8615 Fixed value $\lambda=0.8$ 2.3716 0.8492 Fixed value $\lambda=0.9$ 2.3913 0.8580 Unprocessed $--$ 1.9615 0.8387 10 dB Optimized $\lambda=0.9451$ 2.8102 0.9275 Fixed value $\lambda=0.8$ 2.7735 0.9183 Fixed value $\lambda=0.9$ 2.7976 0.9246 Unprocessed $--$ 2.2989 0.9146 15dB Optimized $\lambda=0.9349$ 3.1973 0.9652 Fixed value $\lambda=0.8$ 3.1472 0.9594 Fixed value $\lambda=0.9$ 3.1844 0.9636 Unprocessed $--$ 2.6442 0.9613 20dB Optimized $\lambda=0.9501$ 3.5007 0.9858 Fixed value $\lambda=0.8$ 3.4286 0.9797 Fixed value $\lambda=0.9$ 3.4796 0.9826 Unprocessed $--$ 2.9839 0.9845
PESQ and STOI performance for different SNR with babble noise and $L = 512$
 SNR Methods PESQ STOI 0 dB Optimized $\lambda=0.9601$ 2.0699 0.7234 Fixed value $\lambda=0.8$ 2.0525 0.7129 Fixed value $\lambda=0.9$ 2.0634 0.7212 Unprocessed $--$ 1.8938 0.7145 5 dB Optimized $\lambda= 0.9601$ 2.4185 0.8282 Fixed value $\lambda=0.8$ 2.4084 0.8195 Fixed value $\lambda=0.9$ 2.4150 0.8258 Unprocessed $--$ 2.2203 0.8130 10 dB Optimized $\lambda=0.9079$ 2.7672 0.9064 Fixed value $\lambda=0.8$ 2.7529 0.8996 Fixed value $\lambda=0.9$ 2.7586 0.9045 Unprocessed $--$ 2.5434 0.8899 15dB Optimized $\lambda=0.9077$ 3.1187 0.9540 Fixed value $\lambda=0.8$ 3.0736 0.9507 Fixed value $\lambda=0.9$ 3.0790 0.9530 Unprocessed $--$ 2.8556 0.9423 20dB Optimized $\lambda=0.9601$ 3.3898 0.9785 Fixed value $\lambda=0.8$ 3.3703 0.9760 Fixed value $\lambda=0.9$ 3.3822 0.9775 Unprocessed $--$ 3.1674 0.9734
PESQ and STOI performance for different SNR with destroyer noise and 512 subbands
 SNR Method s PESQ STOI 0 dB Optimized $\lambda=0.7601$ 2.2328 0.7629 Fixed value $\lambda=0.8$ 2.2256 0.7602 Fixed value $\lambda=0.9$ 2.2078 0.7622 Unprocessed $--$ 1.9271 0.7524 5 dB Optimized $\lambda=0.7700$ 2.5651 0.8441 Fixed value $\lambda=0.8$ 2.5589 0.8414 Fixed value $\lambda=0.9$ 2.5569 0.8436 Unprocessed $--$ 2.2955 0.8281 10 dB Optimized $\lambda=0.7700$ 2.8773 0.9084 Fixed value $\lambda=0.8$ 2.8699 0.9056 Fixed value $\lambda=0.9$ 2.8742 0.9081 Unprocessed $--$ 2.6132 0.8902 15dB Optimized $\lambda=0.8301$ 3.1775 0.9530 Fixed value $\lambda=0.8$ 3.1710 0.9509 Fixed value $\lambda=0.9$ 3.1732 0.9529 Unprocessed $--$ 2.9256 0.9382 20dB Optimized $\lambda=0.9270$ 3.4819 0.9768 Fixed value $\lambda=0.8$ 3.4375 0.9750 Fixed value $\lambda=0.9$ 3.4469 0.9766 Unprocessed $--$ 3.2468 0.9697
PESQ and STOI performance for different SNR with white noise and 512 subbands
 SNR Methods PESQ STOI 0 dB Optimized $\lambda=0.8599$ 2.0454 0.7829 Fixed value $\lambda=0.8$ 2.0395 0.7721 Fixed value $\lambda=0.9$ 2.0403 0.7775 Unprocessed $--$ 1.6665 0.7377 5 dB Optimized $\lambda=0.8801$ 2.4148 0.8735 Fixed value $\lambda=0.8$ 2.4129 0.8638 Fixed value $\lambda=0.9$ 2.4101 0.8695 Unprocessed $--$ 1.9615 0.8387 10 dB Optimized $\lambda=0.9496$ 2.8009 0.9338 Fixed value $\lambda=0.8$ 2.7937 0.9261 Fixed value $\lambda=0.9$ 2.7948 0.9308 Unprocessed $--$ 2.2989 0.9146 15dB Optimized $\lambda=0.9550$ 3.1967 0.9673 Fixed value $\lambda=0.8$ 3.1421 0.9623 Fixed value $\lambda=0.9$ 3.1514 0.9652 Unprocessed $--$ 2.6442 0.9613 20dB Optimized $\lambda=0.9601$ 3.4393 0.9864 Fixed value $\lambda=0.8$ 3.3930 0.9807 Fixed value $\lambda=0.9$ 3.4175 0.9823 Unprocessed $--$ 2.9839 0.9845
