Sentiment Analysis is defined as the process of mining of data, views, sentences, comments to predict the emotion, feeling of the sentence through Natural Processing Language (NLP). The analysis involves the classification of text into three phases as positive, negative, and neutral. Sentiment Analysis is widely applied to reviews and survey responses, online social media, and many healthcare materials for applications.
In this project, I aim to perform Sentiment Analysis on the movie database that is movie reviews database. Data used in this project can be seen here — http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
We will be using several Python libraries and frameworks specific to text analytics, NLP, and ML. Dependencies — Pandas, NumPy, Scikit-learn, Scipy. NLT version should be >= 3.2.4, otherwise, TokToktokenizer may not present.
Cleaning, pre-processing, and normalization of unwanted phrases or words to some standard format. There will be some irrelevant symbols, characters that create noise. So, to build meaningful features we have to enable standardization.
Here, standardization means to format the text by removing special characters, stopwords, analyze contractions, and root words. Stemming and Lemmatization also.
This is looking very short but the process consists of many steps to accomplish.
So, let us take a look over the keywords we used here:
1. Text Pre-processing: It is the process that is processed before the data usage for analyzing. This also includes feature engineering where each review strings are taken as tokens.
2. Cleaning Text: This refers to removing the unnecessary contents before feature extraction. The Beautiful Soup library does this job.
3. Expanding Contractions: In the English language, contractions are basically shortened versions of words or phrases which are created by removing specific letters and sounds. Ex, I’ll, You’ve, etc.
4. Stemming and Lemmatization: This is basically known as stem words. Stem words are nothing but they are the root words through which many forms of a single word are created. And the process of extracting the stem words from the words used is called stemming. Ex, watch, watches, watching, etc has a stem word watch.
Sentiment Analysis Using Supervised Learning
- Prepare train and test datasets
- Pre-process and normalize text documents
- Feature engineering
- Model training
- Model prediction and evaluation
- Logistic Regression with bag of words
- Logistic Regression with TFIDF
- Support Vector Machines (SVM) with bag of words
- Support Vector Machines TFIDF
Logistic Regression with bag of words (BOW)
Logistic Regression with TFIDF
Support Vector Machines (SVM) with bag of words (BOW)
Support Vector Machines (SVM) with TFIDF
Out of the four Supervised model Logistic Regression with bag of words (BOW) performed the best with Accuracy of 90.59% and F1 score of 90.62%.
Sentiment Analysis Using Unsupervised Learning Lexicon Based
AFINN: List of words rated for valence with an integer between -5 to +5.
SentiWordNet: Words with some values assigned to them to judge positive and negative sentiments.
VADER: It is a lexicon sentiment where words are evaluated and validated for further use in sentimental analysis.
Out of the three Unsupervised model Afinn performed the best with Accuracy of 71.18% and F1 score of 74.68%.
Since going through Supervised and Unsupervised learning, comparison between accuracy and f1 score. We came to know that Supervised learning is better than Unsupervised learning as it gives an accuracy of 90.59% and F1 score of 90.62%, while Unsupervised learning gave an accuracy of 71.18% and F1 score of 74.68%.
This shows us that Supervised Learning is better than Unsupervised Learning.