What to Do with The Data? Current Methods and Applications of Music Information Retrieval / by Xiao Quan

            The definition of the study of Music Information Retrieval (MIR) is quite straightforward: to extract meaningful information from a piece of music. However, what to do with the information extracted is not as easy to generalize in one sentence. In this essay, I will provide an outline of some popular MIR tasks and approaches that pertains to digital audio content performed in research, as well as some of its current commercial implementations.

            Ranking from a scale of low to high subjectivity, some popular MIR tasks include: Tempo Estimation, Key Detection, Note Onset, Beat Tracking, Melody Extraction, Chord Estimation, Structural Segmentation, Music Auto-tagging and Mood Recognition (Choi et al., 2017). While some of these features such as pitch, tempo, and note onset can be articulated logically, with a large pool of domain knowledge for identification, while others such as Structural Segmentation, Music Auto-tagging, and Mood Recognition can be highly subjective, without much strict logic for conventional algorithms to identify (McFee, Nieto & Bello, 2015; Lamere 2008). However, with the development of higher computing power and Machine Learning algorithms, especially Deep Learning, these highly subjective tasks are also becoming probable and even accurate for computers to perform (Choi et al., 2017).

            Whatever the tasks are, the first step for most MIR research is often to transform a one-dimensional discrete-time signal into a two-dimensional frequency and time representation, e.g. spectrograms (Choi et al., 2017). Because many current popular machine learning libraries are written in python, a popular tool for such conversion is the ‘librosa’ library written by McFee et al. (2015). Depending on the task, different spectrograms are being created for subsequent feature extraction. Common spectrograms include Short-Time Fourier Transform (STFT), Mel-Spectrogram, Constant-Q Transform (CQT), and Chromagram (McFee et al. 2015). Of the four, STFT is the fastest and the most efficient to perform, but often less useful in tasks that are frequency/pitch related, as it provides a linear distribution of center frequencies. Mel-Spectrogram and CQT, however, provides better results when performing more subjective tasks, such as boundary detection (Ullrich, Schlüter & Grill, 2014) or learning latent features for music recommendation (Van den Oord, Dieleman & Schrauwen, 2013, as cited in Choi et al., 2017). This is because in these spectrograms, the distributions of center frequencies are spaced logarithmically to match human perception, in the case of CQT, pitch classes. Lastly, Chromagrams are like an extension of CQT, with frequencies within a pitch class folded into a set of notes of a scale on the y-axis.

            The recent research trend for extracting features from the aforementioned spectrograms is to utilize deep learning, a form of machine learning algorithm that, instead of using strict pre-described logic to perform a task, learns rules from large amounts of example data to guide its behavior. This method has proven to be effective for tasks that are complex, subjective, with hard-to-define groundtruths, such as music auto-tagging and genre classification (Choi et al, 2017; 2018). In the past decade, the success of online music services such as Spotify, Shazam, and Tidal, have increased both commercial and academic attention on MIR research and applications. Though MIR applications are extensive and interdisciplinary, ranging from music information recognition plugins, score-following, hit-song prediction, to new interfaces for music interaction and browsing (Schedl, Gómez, & Urbano, 2014), the predominant application of MIR research is Playlist Generation and Recommendation, as it’s one of the highest level problems for MIR. The current trend is using deep learning methods to study both music content similarities as well as pre-existing human selected sequences to calculate the next track recommendation, or realize automatic playlist continuation (Schedl, M. 2019).

            In conclusion, the merging of MIR and Data Science techniques will continue to drive future developments of MIR applications. As we step into the age of 5G connectivity, an exponential growth of multi-dimensional data will inevitably become available to us. How exactly will this guide the future MIR research topics is hard to predict, yet it’s certain that the accuracy and quality of current MIR/Deep Learning tasks will continue to improve with more data to learn from.

 

  

 

 

Citations

 

Choi, K., Fazekas, G., Cho, K., & Sandler, M. (2017). A tutorial on deep learning for music information retrieval. arXiv preprint arXiv:1709.04396.

 

Choi, K., Fazekas, G., Cho, K., & Sandler, M. (2017). The effects of noisy labels on deep convolutional neural networks for music classification. arXiv preprint arXiv:1706.02361.

 

Choi, K., Fazekas, G., Cho, K., & Sandler, M. (2018). The effects of noisy labels on deep convolutional neural networks for music tagging. IEEE Transactions on Emerging Topics in Computational Intelligence2(2), 139-149.

 

Lamere, P. (2008). Social tagging and music information retrieval. Journal of new music research37(2), 101-114.

 

McFee, B., Nieto, O., & Bello, J. P. (2015, October). Hierarchical Evaluation of Segment Boundary Detection. In ISMIR (pp. 406-412).

 

Schedl, M. (2019). Deep Learning in Music Recommendation Systems. Frontiers in Applied Mathematics and Statistics5, 44.

 

Schedl, M., Gómez, E., & Urbano, J. (2014). Music information retrieval: Recent developments and applications. Foundations and Trends® in Information Retrieval8(2-3), 127-261.

 

Ullrich, K., Schlüter, J., & Grill, T. (2014, October). Boundary Detection in Music Structure Analysis using Convolutional Neural Networks. In ISMIR (pp. 417-422).

 

Van den Oord, A., Dieleman, S., & Schrauwen, B. (2013). Deep content-based music recommendation. In Advances in neural information processing systems (pp. 2643-2651).