Dragonforce is a relatively popular power metal band whose songs mostly feature fantasy-themed lyrics. I’ve recently seen many comments online stating something to the effect:

It wouldn’t be a Dragonforce song without mentioning

This got me thinking; is there any truth to this? Are there words used repeatedly throughout all of Dragonforce’s lyrics?

Obviously, I could manually go through and read the lyrics of all the Dragonforce songs, but since the band has a lot of songs (and this is a programming blog), I wrote code to do it for me.


The basic approach is to construct a word histogram. Basically, once I get all the lyrics, I would read them in and then construct a histogram (essentially a map) where the “key” is the word, and the “value” is the number of times that word has appeared.


Counting Words

After some searching for lyric services and Python libraries to help me out, I settled on azapi, which pulls lyrics from AZLyrics. Once I had all the lyric files, the analysis code was actually pretty simple. The first step is to add code to read each individual lyric file (I saved a *.txt file for each song) and read the content line by line:

def analyze_lyrics(artist=str):
    # get lyric files for the specific artist.
    # lyric files are named <song title> - <artist title>.txt
    text_files = glob.glob('*{}.txt'.format(artist))

    # word histogram
    word_dict = dict()
    # loop over all files
    for file_path in text_files:
        with open(file_path, 'r') as file:
            text_content =
            # process the content
            for line in text_content.splitlines():
                # process content...

Now that I have each line ready for processing, I’ll need to do some string sanitizing. While looking through the lyric files to double check things were working correctly, I noticed some lines like this:

[Music: ... list of artist (omitted here)...]

These need to be skipped from our processing code. I also want to remove most punctuation so that I only count the words. Finally I will also remove extraneous spaces as well and make everything lowercase. Taking all these steps into account I came up with the following code:

# ... other code
if '[' in line.strip()[0:3]:
    print('Ignoring line: {}'.format(line))
    # skip line
# remove punctuation
cleaned_up_line = remove_characters(line, ';:,.?*!()&%@#$-_/\\')
cleaned_up_line = cleaned_up_line.lower()

remove_characters is a utility function I wrote that iterates through the provided string (second input argument) and removes each character of the provided string from the input line. That way I was able to remove all the punctuation I didn’t want in one line. Finally, I update the word histogram.

# split by whitespace
words = cleaned_up_line.split()
for word in words:
    word_dict[word] = word_dict.get(word, 0) + 1

word_dict is a Python dict which basically acts as a map. The keys are unique for that dict instance. In the above code I update the count by looking up the current value in the dict (if present, otherwise the default value is 0) and then increment the count. Now that I have a dict that holds words and their word counts, on to the best part of data analysis: data visualization!

Visualizing Results

From the get go, I knew I wanted to create a word cloud image from the data I had accumulated. To do this I used the great Python library wordcloud:

wordcloud = WordCloud(width = 3840, height=2160, background_color = 'white', stopwords = stopwords, max_words=300).generate_from_frequencies(word_dict)

wordcloud includes the funcationlity to exclude certain words from the counting process (referred to as stopwords). Originally, I had tried using this to remove certain words from the word histogram but they were still showing up in the final image. It seems this was due to the fact that I wasn’t letting wordcloud do the word counting by just passing in a list of words, so I removed the stop words myself:

stopwords = set(STOPWORDS)
stopwords.update(['to', "the", 'and', 'in', 'for', 'a', 'of', 'my', 'our', 'we', "we\'re", 'i', "i\'m"])

print('Removing stopwords...')
for word in stopwords:
    word_dict.pop(word, None)

The nice thing is that wordcloud provides a default list of stop words which you can import. It’s appropriately named STOPWORDS. In the above code, I create a set from the variable that wordcloud provides and then appended some other words I wanted to exclude.

The final resulting wordcloud looked like this:


Finally, I also wanted to make a traditional bar graph of the top 40 words across all Dragonforce lyrics. To do this, I used the popular matplotlib library.

# sort by count (highest at the top)
sorted_words = dict(sorted(word_dict.items(), key= lambda item : item[1], reverse=True))

word_count = 40
names = list(sorted_words.keys())
values = list(sorted_words.values()), values[0:word_count], tick_label=names[0:word_count])
plt.title('Top {} Words of {} Lyrics'.format(word_count, artist))
plt.tick_params(axis='x', which='major', labelsize=7)
plt.savefig('{}-word-hist.png'.format(artist.lower()), dpi=300, bbox_inches='tight')

The result looked like this:



Overall, I was surprised that very few of the words mentioned in online comments (flames, far away, through, fire…) showed up in the top 40 words across all Dragonforce lyrics. The clear winner was will. You could argue that this word should also be excluded from the count for usages like I will go on... or something similar, but since I didn’t have a way to understand the context of the use of the word, I left it in.

Are you curious about another artist’s lyrics? Then I invite you to do your own analysis! The code presented above is available on my Github. Be sure to give it a :star: and follow me for future projects. Happy coding!

Hey there! Thanks for reading this article and visiting my site. Be sure to join the Discord and follow me on Github to ensure you don’t miss out on future content! Feel free to also check out my YouTube channel where I post tutorials and fun programming videos.

Leave a Comment

Your email address will not be published. Required fields are marked *