Sentiment Analysis using LSTM

Abishek PSS
11 min readMay 4, 2021

In the era where data is available is abundance businesses starts to leverage this as an opportunity to grow exponentially. Today’s marketers are rightfully obsessed with metrics. But not to forget that the customers are more than just a data point. It is easy to overlook customer’s feelings and emotions, which can be difficult to quantify. However, with the help technology we companies can achieve this. Sentiment essentially relates to feelings: attitude, emotions, and opinions. Sentiment analysis refers to the practice of applying Natural Language Processing and Text Analysis techniques to identify and extract subjective information from a piece of text. With the help of this sentiment analysis tool that I built businesses can help understand their customers and target audience better and make wise decisions in developing products that will sell at large scale in the market. After understanding that the primary pillar of any sustainable business is the marketing team, I feel this tool adds value to them.


The goal is to predict the sentiment for a given review from a user with the help of a Long Short Term Memory (LSTM) model trained on the dataset. The sentiment of reviews is binary, meaning the IMDB rating < 5 results in a sentiment score of 0, and rating >=7 have a sentiment score of 1.

Data Field

  • id — Unique ID of each review
  • sentiment — Sentiment of the review; 1 for positive reviews and 0 for negative reviews
  • review — Text of the review

Now let us understand the technologies used to build this application.

Long Short Term Memory (LSTM):

LSTM Network

The core idea of LSTM’s are the cell state, and it’s various gates. The cell state act as a transport highway that transfers relative information all the way down the sequence chain. You can think of it as the “memory” of the network. The cell state, in theory, can carry relevant information throughout the processing of the sequence. So even information from the earlier time steps can make it’s way to later time steps, reducing the effects of short-term memory. As the cell state goes on its journey, information get’s added or removed to the cell state via gates. The gates are different neural networks that decide which information is allowed on the cell state. The gates can learn what information is relevant to keep or forget during training.

Technically, LSTM inputs can only understand real numbers. A way to convert symbol to number is to assign a unique integer to each symbol based on frequency of occurrence. For example, there are 112 unique symbols in the text above. The function in Listing 2 builds a dictionary with the following entries [ “,” : 0 ] [ “the” : 1 ], …, [ “council” : 37 ],…,[ “spoke” : 111 ]. The reverse dictionary is also generated since it will be used in decoding the output of LSTM.

LSTM’s are primarily used in NLP and Time series forecasting. LSTM is really powerful that the results achieved using RNN’s can also be achieved using LSTM.


Tokenization is essentially splitting a phrase, sentence, paragraph, or an entire text document into smaller units, such as individual words or terms. Each of these smaller units are called tokens.


There are multiple ways to tokenize sentences in python to name a few the following are

  • split()
  • RegEx
  • NLTK
  • spaCY
  • Keras

Early Stopping:

A problem with training neural networks is in the choice of the number of training epochs to use.

Too many epochs can lead to overfitting of the training dataset, whereas too few may result in an underfit model. Early stopping is a method that allows you to specify an arbitrary large number of training epochs and stop training once the model performance stops improving on a hold out validation dataset.

After multiple trails the optimal accuracy for early stopping was set to 95%. So every time during model training when the train accuracy to reaches the set value it will save the weights and stop training.

Keras supports various metrics: val_loss, val_acc, train_loss, train_acc to monitor for early stopping

Early stopping with Loss as metric

Save model checkpoints:

Often times there will be situation when model training will take hours or even days in such cases we would prefer to save model checkpoints. After you have trained a neural network, you would want to save it for future use and deploying to production. So, what is a saved neural network model? It primarily contains the network design or graph and values of the network parameters that we have trained. Tensorflow offers two ways to save and restore your progress.

Checkpoints capture the exact value of all parameters (tf.Variable objects) used by a model. Checkpoints do not contain any description of the computation defined by the model and thus are typically only useful when source code that will use the saved parameter values is available.

The SavedModel format on the other hand includes a serialized description of the computation defined by the model in addition to the parameter values (checkpoint). Models in this format are independent of the source code that created the model. They are thus suitable for deployment via TensorFlow Serving, TensorFlow Lite, TensorFlow.js, or programs in other programming languages (the C, C++, Java, Go, Rust, C# etc.).

Is accuracy a better metric to validate your model ?

The answer is no. Lets take this example for instance.

Very easily, you will notice that the accuracy for this model is very very high, at 99.9%!! Wow!

But….(well you know this is coming right?) what if I mentioned that the positive over here is actually someone who is sick and carrying a virus that can spread very quickly? Or the positive here represent a fraud case? Or the positive here represents terrorist that the model says its a non-terrorist? Well you get the idea. The costs of having a mis-classified actual positive (or false negative) is very high here in these three circumstances that I posed.

OK, so now you realized that accuracy is not the be-all and end-all model metric to use when selecting the best model…now what?

The F1 score gives much better view on the performance of the model.

I calculated it using the formula

F1 Score

Lets get started with how to build this classifier.

1. Importing the necessary libraries

import numpy as np
import pandas as pd
import re
import string
import os
import nltk'stopwords')
from nltk.corpus import stopwords
from wordcloud import WordCloud, STOPWORDS
stopwords = set(STOPWORDS)
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
from sklearn.metrics import confusion_matriximport warnings

2. Read the train and test files using pandas

df_train = pd.read_csv("./labeledTrainData.tsv", header=0, delimiter="\t", quoting=3)
df_test=pd.read_csv("./testData.tsv", header=0, delimiter="\t", quoting=3)

3. Data Preprocessing:

For any NLP problem it is essential the raw data is cleaned and processed in the desired format before fed into the model. Here I have converted the data into lower case and removed punctuations using string library which proved faster in than what I had seen in the reference material. Additionally the stopwords (such as “the”, “a”, “an”, “in”) were removed using the NLTK library as these words don't matter when indexing.

def data_cleaning(raw_data):
raw_data = raw_data.translate(str.maketrans('', '', string.punctuation + string.digits))
words = raw_data.lower().split()
stops = set(stopwords.words("english"))
useful_words = [w for w in words if not w in stops]
return( " ".join(useful_words))

4. Now lets visualize the words that are majorly found in the dataset using wordcloud library.

def generate_wordcloud(data, title = None):
wordcloud = WordCloud(
fig = plt.figure(1, figsize=(15, 15))

5. Import API’s from Tensorflow required to build the LSTM Model

import tensorflow as tf
# from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Dense , Input , LSTM , Embedding, Dropout , Activation, Flatten
from tensorflow.keras.layers import Bidirectional, GlobalMaxPool1D, SpatialDropout1D
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras import initializers, regularizers, constraints, optimizers, layers

6. Tokenize words

y = df_train["sentiment"].values
train_reviews = df_train["review"]
test_reviews = df_test["review"]
max_features = 6000
tokenizer = Tokenizer(num_words=max_features)
list_tokenized_train = tokenizer.texts_to_sequences(train_reviews)
list_tokenized_test = tokenizer.texts_to_sequences(test_reviews)

7. Now lets define the callbacks that is necessary to perform early stopping and saving model checkpoints.

class myCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epochs, logs={}):
if logs.get('accuracy') > 0.95:
print('\n Stopped Training!\n')
self.model.stop_training = True
def train_model(model, model_name, n_epochs, batch_size, X_data, y_data, validation_split):
checkpoint_path = model_name+"_cp-{epoch:04d}.ckpt"
checkpoint_dir = os.path.dirname(checkpoint_path)
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=checkpoint_path, save_weights_only=True, verbose=1)
callbacks = myCallback()
history =
return history

8. Define generate graph function

To keep the code clean lets create a function to draw graphs as we would be needing after every experiment to view the change in the model performance.

def generate_graph(history):
plt.plot(history.history['accuracy'], 'b')
plt.plot(history.history['val_accuracy'], 'r')
plt.title('Model Accuracy'),
plt.legend(['Train', 'Validation'], loc='upper left')

9. Define prediction function

This function handles the calculation of F1 score.

def predict_func(model):
prediction = model.predict(X_test)
y_pred = (prediction > 0.5)
df_test["sentiment"] = df_test["id"].map(lambda x: 1 if int(x.strip('"').split("_")[1]) >= 5 else 0)
y_test = df_test["sentiment"]
cf_matrix = confusion_matrix(y_pred, y_test)
f1_score_calc = cf_matrix[0][0] / (cf_matrix[0][0] + 0.5 * (cf_matrix[0][1] + cf_matrix[1][0]))
print('F1-score: %.3f' % f1_score_calc)
print("Confusion Matrix : ", cf_matrix)
return f1_score_calc

10. Experiments

10.1 Model A: This will be the base model what will be used to benchmark upon for further model that are built.

class Model_A():
def __new__(self):
inp = Input(shape=(max_length, ))
embed_size = 128
x = Embedding(max_features, embed_size)(inp)
x = LSTM(60, return_sequences=True, name='lstm_layer')(x)
x = GlobalMaxPool1D()(x)
x = Dropout(0.1)(x)
x = Dense(50, activation="relu")(x)
x = Dropout(0.1)(x)
x = Dense(1, activation="sigmoid")(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy', optimizer='SGD', metrics=['accuracy'])

return model
model_a = Model_A()
history_a = train_model(model_a, "model_a", 10, 64, X_train, y, 0.2)
model_a_score = predict_func(model_a)
F1-score: 0.663
Confusion Matrix : [[12106 11916] [ 394 584]]

Model Summary

Model: "model"
Layer (type) Output Shape Param #
input_1 (InputLayer) [(None, 370)] 0
embedding (Embedding) (None, 370, 128) 768000
lstm_layer (LSTM) (None, 370, 60) 45360
global_max_pooling1d (Global (None, 60) 0
dropout (Dropout) (None, 60) 0
dense (Dense) (None, 50) 3050
dropout_1 (Dropout) (None, 50) 0
dense_1 (Dense) (None, 1) 51
Total params: 816,461
Trainable params: 816,461
Non-trainable params: 0
Model A

10.2 Model B: The change in this model is I have used adam optimizer and SpatialDropout1D layer which performs the same function as Dropout, however, it drops entire 1D feature maps instead of individual elements.

class Model_B():
def __new__(self):
inp = Input(shape=(max_length, ))
x = Embedding(max_features, 128)(inp)
x = SpatialDropout1D(0.25)(x)
x = LSTM(100, dropout=0.5)(x)
x = Dropout(0.5)(x)
x = Dense(1, activation='sigmoid')(x)
model = Model(inputs=inp, outputs=x)
model.compile(loss='binary_crossentropy',optimizer='adam', metrics=['accuracy'])
return model

model_b = Model_B()
history_b = train_model(model_b, "model_b", 10, 64, X_train, y, 0.2)
model_b_score = predict_func(model_b)
F1-score: 0.861
Confusion Matrix : [[10974 2012] [ 1526 10488]]
Model B

10.3 Model C: I used a smaller batch size while training and increased the number for training epochs to check the stability of the model for long run assuming the early stopping will prevent the model from overfitting. Additionally I’ve used a Bidirectional LSTM which enables helps the model to perform better as LSTM’s in general are unidirectional.

class Model_C():
def __new__(self):
embed_size = 128
model = Sequential()
model.add(Embedding(max_features, embed_size))
model.add(Bidirectional(LSTM(75, return_sequences = True)))
model.add(Dense(16, activation="relu"))
model.add(Dense(8, activation="relu"))
model.add(Dense(1, activation="sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

return model

model_c = Model_C()
history_c = train_model(model_c, "model_c", 10, 128, X_train, y, 0.2)
model_c_score = predict_func(model_c)
F1-score: 0.863
Confusion Matrix : [[10993 1993] [ 1507 10507]]
Model C

10.4 Model D: Defining a less complex model to achieve the same accuracy. This comes handy in production where we would require to high model throughput with less computational cost.

class Model_D():
def __new__(self):
embed_size = 64
model = Sequential()
model.add(Embedding(max_features, embed_size))
model.add(LSTM(50, return_sequences = True))
model.add(Dense(16, activation="relu"))
model.add(Dense(1, activation="sigmoid"))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
return modelmodel_d = Model_D()
history_d = train_model(model_d, "model_d", 20, 16, X_train, y, 0.2)
F1-score: 0.837
Confusion Matrix : [[10683 2335] [ 1817 10165]]
Model D

10.5 Random Forest Classifier: I wanted to understand how far behind do the traditional machine learning model fall behind when it comes to sentiment analysis on such huge dataset. Turns out its very poor far worse than our base model.

from sklearn.ensemble import RandomForestClassifiermodel_random_forest = RandomForestClassifier(n_estimators = 150, random_state=45, bootstrap = "False", criterion="gini", min_samples_split = 10, min_samples_leaf = 1), y)
random_forest_score = predict_func(model_random_forest)
F1-score: 0.547
Confusion Matrix : [[6995 6074] [5505 6426]]

11. Model Comparison — Visualization

results_fine_tuned = {"Model_A " : model_a_score,
"Model_B" : model_b_score,
"Model_C" : model_c_score,
"Model_D": model_d_score,
"Random Forest": random_forest_score}

plt.figure(figsize=(7, 7))
plt.title('Comparison of models')
plt.ylabel('Model Score')
plots = sns.barplot([i for i in results_fine_tuned], [results_fine_tuned[i] for i in results_fine_tuned])
for p in plots.patches:
plots.annotate(format(p.get_height(), '.3f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center',
xytext = (0, 9),
textcoords = 'offset points')

Best model — Model C 86.3%


  1. With the help of the knowledge I gained from the Tensorflow in Practice certification from coursera that I did last year I was able to implement and improve the model performance using LSTM.
  2. Own method of calculating the F1 score by generating confusion matrix
  3. Implemented model early stopping technique to prevent model from overfitting.
  4. Saved model checkpoints after every epochs that allows us to retrain the model later by loading the weights. This enables the ability to host the model on the cloud ready to predict when integrated with API.

Conclusion and Challenges Faced

Before starting this project I sat back to see what can be possibly to done to build a good performing model to predict a sentiment analysis of a given text. Going over the learnings from my previous projects and assignments I understood that traditional machine learning models cannot perform good for such large dataset. Hence I decided to go with LSTM models.

Since this time I was dealing with larger dataset the training time was longer I would run in to error while training. With the help of model checkpoints I was able to load model from previous checkpoints and resume training.

Tuning batch size was a key factor towards improving the model. Keeping the model simple and increasing the batch size helped achieve higher F1 score. This way the model doesn't overfit either.