Article Contents
Article Contents

# Evaluation of parallel and sequential deep learning models for music subgenre classification

The second author is supported by a grant from the Natural Sciences and Engineering Research Council of Canada (NSERC)

• In this paper, we evaluate two deep learning models which integrate convolutional and recurrent neural networks. We implement both sequential and parallel architectures for fine-grain musical subgenre classification. Due to the exceptionally low signal to noise ratio (SNR) of our low level mel-spectrogram dataset, more sensitive yet robust learning models are required to generate meaningful results. We investigate the effects of three commonly applied optimizers, dropout, batch regularization, and sensitivity to varying initialization distributions. The results demonstrate that the sequential model specifically requires the RMSprop optimizer, while the parallel model implemented with the Adam optimizer yielded encouraging and stable results achieving an average F1 score of $0.63$. When all factors are considered, the optimized hybrid parallel model outperformed the sequential in classification accuracy and system stability.

Mathematics Subject Classification: Primary: 68T07; Secondary: 68T10.

 Citation:

• Figure 1.  Baseline CNN model

Figure 2.  CRNN sequential architecture

Figure 3.  Parallel CNN-RNN architecture

Figure 4.  Visualization of one song from our dataset

Figure 5.  RMSprop learning process on two axes

Figure 6.  Classification accuracy across 50 epochs

Table 1.  F1 scores for optimizer evaluation

Table 2.  Optimal classification accuracy

Table 3.  Marco F1 scores for effect of regularization

 Model Data Dropout Batch Normalization Dropout + Batch Normalization CRNN Train 0.67 1.00 0.98 Validation 0.65 0.58 0.60 Test 0.62 0.57 0.41 CNN-RNN Train 0.65 1.00 0.90 Validation 0.65 0.58 0.60 Test 0.63 0.61 0.63

Table 4.  Average F1 accuracy scores for effects of initialization methods

 Initialization CNN CRNN CNN-RNN Glorot Normal 0.31 0.63 0.63 Glorot Uniform 0.34 0.60 0.59 Random Normal 0.33 0.45 0.53 Random Uniform 0.33 0.37 0.57
•  [1] M. Browne and S. S. Ghidary, Convolutional neural networks for image processing: An application in robot vision, in AI 2003: Advances in Artificial Intelligence, Lecture Notes in Comput. Sci., 2903, Lecture Notes in Artificial Intelligence, Springer, Berlin, 2003,641–652. doi: 10.1007/978-3-540-24581-0_55. [2] K. Choi, G. Fazekas, M. Sandler and K. Cho, Convolutional recurrent neural networks for music classification, IEEE International Conference on Acoustics, Speech, and Signal Processing, New Orleans, LA, 2017. doi: 10.1109/ICASSP.2017.7952585. [3] J. Chung, C. Gulcehre, K. Cho and Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, preprint, arXiv: 1412.3555v1. [4] Y. M. G. Costa, L. S. Oliveira and C. N. Silla Jr., An evaluation of Convolutional Neural Networks for music classification using spectrograms, Applied Soft Computing, 52 (2017), 28-38.  doi: 10.1016/j.asoc.2016.12.024. [5] G. Gessle and S. Åkesson, A Comparative Analysis of CNN and LSTM for Music Genre Classification, Degree Project in Technology, Stockholm, Sweden, 2019. Available from: https://www.diva-portal.org/smash/get/diva2:1354738/FULLTEXT01.pdf. [6] X. Glorot and Y. Bengio, Understanding the difficulty of training deep feedforward neural networks, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 9 (2010), 249–256. Available from: http://proceedings.mlr.press/v9/glorot10a.html. [7] M. Helén and T. Virtanen, Separation of drums from polyphonic music using non-negative matrix factorization and support vector machine, $13^{th}$ European Signal Processing Conference, Antalya, Turkey, (2005), 1–4. Available from: https://ieeexplore.ieee.org/document/7078147. [8] S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Computation, 9 (1997), 1735-1780.  doi: 10.1162/neco.1997.9.8.1735. [9] D. P. Kingma and J. Ba, ADAM: A method for stochastic optimization, preprint, arXiv: 1412.6980. [10] A. Krizhevsky, I. Sutskever and G. E. Hinton, ImageNet classification with deep convolutional neural networks, $26^{th}$ Conference on Neural Information Processing Systems, NeurIPS, 2012. [11] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard and L. D. Jackel, Backpropagation applied to handwritten zip code recognition, Neural Computation, 1 (1989), 541-551. [12] H. Lee, Y. Largman, P. Pham and A. Y. Ng, Unsupervised feature learning for audio classification using convolutional deep belief networks, $23^{rd}$ Conference on Neural Information Processing Systems, NeurIPS, 2009. Available from: https://ai.stanford.edu/ ang/papers/nips09-AudioConvolutionalDBN.pdf. [13] B. Logan, Mel frequency cepstral coefficients for music modeling, International Symposium on Music Information Retrieval, 2000. [14] Z. Nasrullah and Y. Zhao, Music artist classification with convolutional recurrent neural networks, preprint, arXiv: 1901.04555v2. [15] Y. Panagakis, C. Kotropoulos and G. R. Arce, Music genre classification via sparse representations of auditory temporal modulations, $17^{th}$ European Signal Processing Conference, Glasgow, UK, 2009. [16] Python, Package Index: Spotify and Spot-dl., Available from: https://pypi.org/project/spotify/ and https://pypi.org/project/spotdl/. [17] A. J. R. Simpson, G. Roma and M. D. Plumbley, Deep karaoke: Extracting vocals from musical mixtures using a convolutional deep neural network, preprint, arXiv: 1504.04658. [18] G. Tzanetakis and P. Cook, Musical genre classification of audio signals, IEEE Transactions on Speech and Audio Processing, 10 (2002), 293-302.  doi: 10.1109/TSA.2002.800560. [19] J. Wülfing and M. A. Riedmiller, Unsupervised learning of local features for music classification, International Society for Music Information Retrieval Conference (ISMIR), 2012. Available from: http://ml.informatik.uni-freiburg.de/former/_media/publications/wuelf2012.pdf. [20] C. Xu, N. C. Maddage, X. Shao, F. Cao and Q. Tian, Musical genre classification using support vector machines, IEEE International Conference on Acoustics, Speech, and Signal Processing, (ICASSP '03), Hong Kong, 2003. doi: 10.1109/ICASSP.2003.1199998. [21] R. Yang, L. Feng, H. Wang, J. Yao and S. Luo, Parallel recurrent convolutional neural networks based music genre classification method for mobile devices, IEEE Access, 8 (2020), 19629-19637.  doi: 10.1109/ACCESS.2020.2968170. [22] M. D. Zeiler, ADADELTA: An adaptive learning rate method, preprint, arXiv: 1212.5701.

Figures(6)

Tables(4)