Effect of the complexity

We expect the success of our method to be correlated to the complexity of the music it is trained on. We tested this hypothesis with listening tests. To do so, we trained the method on different datasets with increasing complexity.

The songs were corrupted at a random place with a 750 ms gap and then reconstructed with our method. The reconstructed song was cropped 2 to 4 seconds (randomly varying) before and after the gap.

1) Simple midi

The simplest case we handled was `hand-written' MIDI data. Here, the MIDI annotations have little variation since they are written down by humans on a quantized structure. For this case, we used the Lakh MIDI dataset, a collection of 176,581 unique MIDI files, 45,129 of which have been matched and aligned to entries in the Million Song Dataset. The Lakh MIDI dataset was generated with the goal of facilitating large-scale music information retrieval, both symbolic (using the MIDI files alone) and audio content-based (using information extracted from the MIDI files as annotations for the matched audio files).

Original examples to be inpainted.
a)
b)
c)
d)

Inpainted 750 ms with GACELA.
a)
b)
c)
d)

Original examples.
a)
b)
c)
d)

2) Midi recorded from human performances

For the second complexity level, we used MIDI data that was extracted from performances on a piano. Here, the added complexity is the lack of a strict musical structure such as the precise tempo present from level 1). For this case, we used the Maestro dataset, a dataset containing over 200 hours of paired audio and MIDI recordings from ten years of International Piano-e-Competition. In this competitions, virtuoso pianist perform on Yamaha Disklaviers which, in addition to being concert-quality acoustic grand pianos, utilize an integrated high-precision MIDI capture and playback system. The MIDI data includes key strike velocities and sustain/sostenuto/una corda pedal positions. The repertoire is mostly classical, including composers from the 17th to early 20th century.

Original examples to be inpainted.
a)
b)
c)
d)

Inpainted 750 ms with GACELA.
a)
b)
c)
d)

Original examples.
a)
b)
c)
d)

3) Recordings of piano performances

For the third complexity level, we used real recorded performances of grand pianos. These are the same pieces from the second complexity level. This level adds the sound complexity of a real instrument compared to a simple midi synthesized sound.

Original examples to be inpainted.
a)
b)
c)
d)

Inpainted 750 ms with GACELA.
a)
b)
c)
d)

Original examples.
a)
b)
c)
d)

4) Free music - single genre

The fourth level of complexity is the last one we used for the listening tests. For this, we wanted to test the system on a broader scenario including a more general definition of music. On this level, the added complexity is the interaction between several real instruments. To remove some variation from the dataset, we trained the network on a single genre at the time, in this case either rock or electronic music (for the listening test we only used rock samples). For this complexity level, we used the free music archive dataset (FMA), particularly, a subset we generated by segmenting the `small' dataset by genre. FMA is an open and easily accessible dataset, usually used for evaluating tasks in musical information retrieval. The small version of FMA is comprised of 8,000 30s segments of songs with eight balanced genres sampled at 44.1kHz.

Original examples to be inpainted.
a)
b)
c)
d)

Inpainted 750 ms with GACELA.
a)
b)
c)
d)

Original examples.
a)
b)
c)
d)