Song Generator

  1. Introduction to Artificial Intelligence
  2. What is NLP?
  3. What is Machine Learning?
  4. What is Deep learning?
  5. Relation between AI ML DL and NLP 
  6. Idea Behind Lyrics and Music generator
  7. Algorithms used
  8. Python Walkthrough 

Introduction to Artificial Intelligence

Could anyone in the 19th century would have imagined we would have cellphones to talk with our near and dear ones. But we did.

Could anyone in the 20th century would have imagined we would be video chatting, and building robots etc. But we did.

And now we are in the path to make our machine intelligent using Artificial Intelligence. 

In the past decade, with the evolution of various concepts such as Machine learning, Deep learning, NLP, Computer Vision etc., Artificial Intelligence has emerged rapidly. But what is Artificial Intelligence? 
It is nothing but a process of making the machine/computer to act and react as a human. In doing so, several techniques and algorithms are applied. 

Ray Kurzweil, American inventor and futurist has rightly said, “Artificial intelligence will reach human levels by around 2029. Follow that out further to, say, 2045, and we will have multiplied the intelligence – the human biological machine intelligence of our civilization – a billion-fold.

There are three different levels of artificial intelligence:

  1. General AI: AI is said to be general when the level of accuracy of AI to perform any intellectual task is the same as that of humans.
  2. Active AI: When AI can beat humans in any task it is called Active AI.
  3. Narrow AI: When AI performs some tasks better than humans then AI is called Narrow AI. Narrow AI is the current field of research.

Artificial Intelligence is amalgamation of several sciences such as Machine Learning, Deep Learning, Computer Vision, Natural language Processing, Bioinformatics etc. Now-a-days every company wants to apply or combine Artificial Intelligence and Data Science with their existing technologies. Though AI is widely growing, however, it has certain limitations too. And one needs to understand when to apply AI techniques and when to not. Moreover, with all the automation and innovative projects, there is a misconception that Artificial Intelligence is here to take all the jobs.

However, it’s not true. AI applications are here to make our lives easier. For example, code generator applications help programmers to write code in a more efficient way. Elon Musk (Technology Entrepreneur, and Investor) has rightly quoted, “AI doesn’t have to be evil to destroy humanity – if AI has a goal and humanity just happens to come in the way, it will destroy humanity as a matter of course without even thinking about it, no hard feelings”.

What is Machine Learning?

As we are in the digital era, we have data in abundance either structured or unstructured. There is an imbalance between these two types of data. Unstructured data covers almost 80 percent and Structured data covers 20 percent of all the available data. There are various techniques to extract and analyse the patterns in the data and make useful predictions and usage out of it. 

Machine Learning is one of the best tools available to analyse the patterns in data. In layman terms, machine learning is the process for training the machine to do automated tasks. The process of machine learning is very simple. It can be segregated into the following four steps.

  1. Collecting the data: the data is collected from either public or private domains according to the use case.
  2. Selecting a classifier: An appropriate classifier is selected for the use case.
  3. Training the classifier: That classifier is then trained. 
  4. Making predictions: The trained classifier then makes predictions. 

Reading the above steps you may ask but what a classifier is? Classifier uses extracted features of the data to make predictions.

Now imagine your friend’s hobby is to listen to music and the way he selects his favorite is based on the song genre, the intensity of the song and gender of the singer. He mostly likes the high intensity songs/rock music sung by a male singer. However, he will most probably dislike the rock music sung by female singers. Now imagine a new rock song has been launched in the market, so it is most likely that your friend will like this song. Now a new song has been launched with medium intensity sung by a male singer. Will your friend like the song??

Here is where machine learning kicks in. The algorithms of the ML such as Naive Bayes, K- Nearest Neighbour will find out the pattern of the liking for your friend and then predict the probability of whether he will like the song or not

What is Deep Learning?

Now one may ask if the machine was learning using algorithms and making predictions, then why deep learning? And what is it?

The answer is very simple. Deep learning is nothing but a subset of Machine Learning where the learning phase of the model is done using neural networks. 

Consider the same example as above. The steps remain the same. However, the difference is in training. Each input is feeded into a neuron along with its weights which are multiplied. The multiplication result becomes the input for the next layer. The same process is repeated for each layer of the neural network. The output layer either predicts the probability of each class in the network (that is classification problem) or value for regression problems.

A well trained neural network can predict the choice of your friend with higher accuracy. As per Andrew Ng, the chief scientist of China’s major search engine Baidu and one of the leaders of the Google Brain Project, “The analogy to deep learning is that the rocket engine is the deep learning models and the fuel is the huge amounts of data we can feed to these algorithms.

The answer to “why deep learning?” lies in the advantage of using Deep learning models, as these models learn incrementally by extracting the high-level features from data. This reduces the need for domain expertise and hard core feature extraction.

What is NLP?

Now imagine, you are having a conversation with your friend in any language either English, Spanish, German, Hindi etc. You know your friend understands the language you are speaking in and the friend also replies accordingly. Now imagine you want to have the same conversation with your computer. The big question here is will the machine understand you?

The answer is a big NO. The reason behind this is that machines only understand numeric not any other language. So if you want the machine to understand the language you need to feed it some rules to make it understand.  

Now you would be wondering if the heading of the section is NLP why you have been reading about the conversation example. This is because in layman terms, NLP (NLP is an abbreviation of Natural Language Processing)  is a way to make the machine understand the ‘language’. In the process of doing so various Machine Learning and Deep learning algorithms and concepts are used. The applications of NLP are increasing for example, the most popular application is Chatbot which when initially came into picture turned up the company’s revenue. There are more applications apart from Chatbot such as Recommendation systems, Sentiment Analysis used by various brands and companies to analyse customer satisfaction, Resume selection using NER (Named Entity Recognition) and many more.

Relation between AI ML DL and NLP 

The relation between them can be depicted from the following diagram.

song generator

From the above image it is clear that AI is the super-set that contains all the existing and emerging techniques such as Machine learning, Deep Learning, Natural Language Processing etc. We have also discussed in the start that Deep learning is a subset of Machine learning. Now comes NLP, as NLP is a part of AI whose applications are developed using both Machine Learning and Deep Learning both, thus it is placed inside AI combining both ML and DL.

Idea behind Lyrics and Music generator

AI is an emerging field where new innovation applications are being developed every day. A song generator is such an idea. A song consists of Lyrics and the background music. AI techniques are being used for both composing music and generating song lyrics. This article will majorly focus on the Deep Learning algorithms which can be used to generate song lyrics when trained well.


You read about what is deep learning and why it came into picture. The two most commonly used deep neural network algorithms are- CNN (Convolutional Neural Network) and RNN (Recurrent Neural Network)

CNN is a deep learning algorithm that takes input, assigns it importance based on weights and bias to various features in the input so that every input can be differentiated from each other. CNN is majorly applied to the use cases where images are involved such as image processing, image recognition, image classification etc. If you want to dive into CNN, you can check this link.

Similarly RNN is a deep learning algorithm which is generally applied in the use cases where the data to be predicted depends on the previous data (this is called sequential data) due to the reason that this algorithm remembers the input because of its internal memory.  

Let’s now focus on generating lyrics for a song using deep neural networks. If you carefully observe our use case, the data is sequential as the lyrics of the song are dependent on each other. Thus here we will be using RNN’s LSTM. You can dig deeper into RNN using our link

Let’s dive into LSTM (Long Short Term Memory). LSTM is an extension of RNN which helps RNN remember or in other words LSTM is the memory part of RNN. The diagram below depicts the architecture of LSTM. 

song generator

The components of LSTM are:

  1. Cell State
  2. Forget gate
  3. Input gate
  4. Output gate 
  5. Current cell memory
  6. Current cell output

Let’s understand each component through a story. Imagine yourself going through your college album and remembering the great time you spent with your friends. This can be related to the Cell State of LSTM which keeps the past memory. Now your doorbell rings and when you open one of your friends is standing in front of you. Your brain instantly focuses on only the memories related to that particular friend standing in front of you and tends to forget the rest. This is the Forget gate of LSTM which allows what to remember and what to forget. Now you invite your friend in and you start the conversation. To make the conversation go, you will decide what to respond to your friend’s response. This will act as an Input gate of LSTM which decides what to respond. What the input gate tells to respond, the Output Gate responds in the same manner. Similarly your brain tells you what to respond to and you speak the same. Now your memory is updated. In terms of LSTM, at the end the current cell state and current cell output are updated and it goes to the next cell as an input. 

Hope this example will help you understand how LSTM works and the role of each component of LSTM in prediction.

Python Walkthrough

Let’s now implement the concept of LSTM for generating lyrics for a song in Python. The dataset used for this use case is Taylor Swift Song Lyrics from all the albums. The objective is to generate meaningful lyrics for a song. 

Let’s start by Importing the required packages and Loading the dataset.

#Importing the required packages
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import regularizers
import tensorflow.keras.utils as ku 
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf

#Loading the dataset


The way for writing for everyone is different, thus data may contain words in all capital letters, first letter capital rest small, or all in small. We know that computers understand only numeric. Thus when we convert this data even the same word written in different ways will have different numeric representation for example, CAT and Cat will have different numeric representation though they are the same words. Thus we need to convert all the data into lowercase.

#taking lyrics column in dataframe and converting text to lowercase
#printing the data
song generator

Punctuation removal

In any language, punctuation play a very important role as they can change the flow and way of to interpret what is written. However, when it comes to machines, punctuation are considered to be noise. Thus we are removing the punctuation from the lyrics column.

import string
data['lyric']=data['lyric'].str.replace('[{}]'.format(string.punctuation), '')


In order to convert the words into numbers, the initial step is tokenization. In keras, Tokenizer() is the function used to tokenize the text.

tokenizer = Tokenizer()
total_no_words = len(tokenizer.word_index) + 1

The total number of words in the data is 2415. 

Input Sequence

After the text is tokenized, let’s arrange the words for numerical representation by creating an input sequence using the tokens created.

#input sequence
input_song_sequences = []
for line in corpus_song:
    tokenlist = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(tokenlist)):
        ngram_sequence = tokenlist[:i+1]


Padding the sequence

The input sequences created are padded. Padding is nothing but making all sequences of the same length due to the reason that numerical representation of every word is of different length.

# Input Sequence Padding

max_input_sequence_len = max([len(x) for x in input_song_sequences])
input_song_sequences = np.array(pad_sequences(input_song_sequences, maxlen=max_input_sequence_len, padding='pre'))


Maximum input sequence is 18. And the input sequence is:


After all the preprocessing and normalizing the text data, let’s generate a model. For modeling we are using a Bidirectional LSTM model. LSTM, as discussed earlier, stores the information from the inputs. There are two types of LSTM models: Unidirectional and Bidirectional. 

Unidirectional model only stores information from the past as it receives inputs from the past. However, the Bidirectional model runs two ways, one from past to future and one from future to past and thus the model at any point in time preserves information from both states (past and future). The Bidirectional LSTMs are also known as BiLSTMs and they are generally used in use cases where context is involved. 

Apart from BiLSTM, embedding is also used. The basic idea behind embedding is that words and associated words are clustered together in a multi-dimensional vector space.

The Dropout layer is applied after the BiLSTM layer to drop noise. However,this layer prevents our model from overfitting.

In the model we are applying two dense layers. A Dense layer is the fully connected layer where all the output of the previous layer is connected to the input of this layer. The first dense layer has input of half the count of total number of words (total_no_words/2) and activation function used is relu. Rectified Linear Unit (ReLu) returns 0 if the value of X is negative otherwise it returns the same positive value of X. 

The second dense layer also has input of half the count of total number of words (total_no_words/2) and activation function used is softmax. The softmax activation function calculates and returns the probability of every target class over the rest possible target classes.

The performance metrics we are using for the model is accuracy. Accuracy is a metric which gives the measure of closeness of calculated value to actual value.

The loss parameter has categorical cross entropy also called softmax loss. The Softmax activation function is used along with categorical cross entropy to train the model to output the probability over the certain number of classes for the input and is majorly used in multi-class classification.

Adam is an optimization algorithm designed specifically for training deep neural networks. Adam optimizer has an adaptive learning rate. It computes learning rates for every parameter. Adam’s name is derived from adaptive moment estimation. Adam uses estimations of first and second moments of gradient to adapt the learning rate for each weight of the neural network. 

model= Sequential()
model.add(Embedding(total_no_words, 160,input_length=max_input_sequence_len-1)) 
model.add(Bidirectional(LSTM(200, return_sequences = True))) 
model.add(Dense(total_no_words/2, activation='relu', kernel_regularizer=regularizers.l2(0.001))) 
model.add(Dense(total_no_words, activation='softmax')) 

Model predictors and output labels for the categorical data is generated.

model_predictors, output_label = input_song_sequences[:,:-1],input_song_sequences[:,-1]
output_label = ku.to_categorical(output_label, num_classes=total_no_words)

#In this step we are now fitting the model.
history =, output_label, epochs=100, verbose=1)

The function song generate is defined below which actually converts the text to sequences, applies padding for those sequences according to the maximum length word present in the data as a prefix, and then predicts the class for the input. This process is carried out for every input in the data.

def Song_Generate(text, next_words):
    for _ in range(next_words):
        tokenlist = tokenizer.texts_to_sequences()[0]
        tokenlist = pad_sequences([tokenlist],
        predicted = model.predict_classes(tokenlist, verbose=0)
        output= ""
        for word, index in tokenizer.word_index.items():
            if index == predicted:
                output = word
        text += " " + output

Let’s predict next 10 lyrics after ‘I am in love’

next_words = 10 
text = "i am in love"
Song_Generate(text, next_words)

Let’s predict next 100 lyrics after ‘I am in love’

next_words = 100
text = "i am in love"
Song_Generate(text, next_words)


The results are not that satisfying as data is only limited to a particular singer’s lyrics in a particular language. To obtain better results, we can combine lyrics of more singers in the same pattern. Also, fusion songs (mixing more than one language) are also the upcoming trend. So we can feed the model with lyrics of the language of our choice and then see the accuracy of the lyrics being generated.

Source :