A metal song consists of a list of riffs (short melodies that are repeated). In a metal song, each riff has its distinct identity and the beginnings and ends of riffs are usually clear, whereas some classical music has long passages of free flowing melodies that blend into each other.
Problem: There is no good model to generate instrumentals for metal.
Existing projects such as DeepSlayer (TransformerXL) or DADABOTS (RNN) generate music that does not have a clear time signature, let alone clearly demarcated riffs.
General MIDI generators (such as Staccato): Do not produce remotely convincing metal music. Perhaps there are not enough metal in the training set.
Metal guitar playing involves techniques such as palm muting and pinch harmonics, which are key to getting the distinctive metal sound. These techniques are usually ignored or denoted inconsistently in MIDI files, so they are missing from all the MIDI-based models such as DeepSlayer.
My Approach: .
To get well-defined riffs: Create a data set where the beginnings and ends of riffs are annotated. Create tokens that indicate where the beginnings of riffs are. Model then learns what the begginings (and ends) of riffs look like.
To get palm muting correct: Instead of MIDI files, extract training data from Guitar Pro files (.gp) which has palm muting indicated consistently.
Focus on generating guitar: There is more guitar data available than all-instruments data. Will train some encoder-decoder to generate drum beats and bass lines based on the guitar (future project).
Data and Model.
Data: Scraped for the Guitar Pro files of over 70k songs (with a high concentration of metal, hard rock, and punk songs). Developed a simple algorithm to annotate the beginning and ends of individual riffs and a straightforward tokenization that takes measures and riffs into account. After transposing, left with a data set of 2 billion tokens.
Model: Karpathy's minGPT. Used the "GPT2" option (98 million parameters). Will be released soon (the loss is still dropping currently).
Using the same six notes for prompt as the previous sample but increasing the temperature, the model produced something dissonant that reminds me of modern technical death metal:
Note that the music samples produced all obeyed a strict 4/4 time signature and it is clear where individual riffs start and end as desired.
If you pay attention, you can hear the model's choice of palm muting.