patrickwalsh6079
- Mar 7
- 7 min read

Sentiment Analysis of Movie Reviews using Naive Bayes & Support Vector Machines

INTRODUCTION

In today's digital age, where opinions are readily shared and influential, understanding customer sentiment has become paramount for businesses striving to stay ahead in the competitive market landscape. Sentiment analysis, a technique that involves extracting insights from text to determine the emotional tone behind it, offers invaluable insights into customer perceptions and preferences. One area where sentiment analysis holds significant potential is in the movie industry.

Movie reviews, whether posted on social media, review platforms, or blogs, are a treasure trove of opinions that reflect audience reactions to films. Analyzing these reviews can provide movie studios, producers, and distributors with critical insights into audience sentiment, helping them gauge the success of their productions, identify areas for improvement, and make informed decisions regarding marketing and distribution strategies.

In this report, we delve into the realm of sentiment analysis of movie reviews using advanced machine learning techniques. By harnessing the power of Naive Bayes and Support Vector Machines (SVM), we aim to decipher the sentiment embedded within movie reviews and shed light on the business value derived from such analysis. Through our exploration, we seek to uncover the most impactful words and features that drive audience perceptions, ultimately empowering businesses to make data-driven decisions that resonate with their target audience and drive success in the dynamic world of cinema.

ANALYSIS

Dataset

The Movie Reviews dataset consists of two corpora categorized into a positive review corpus and a negative review corpus. Each corpus contains 12,500 documents that contain movie reviews that are either positive or negative.

The corpora can be found here:

https://drive.google.com/drive/folders/19oUCruzbXo0OS89OctTaKmvoM3OtaBl2

https://drive.google.com/drive/folders/14RRnbHC3Xwc5S3dZW6ysDMWn4p2td8lL

First, we will load the two corpora into os and display the first 10 filenames of each corpus, as seen below.

After each corpus is loaded into os as a list of filenames, they need to be formatted to contain the full filepath for each document. This step is necessary before the corpora can be loaded into a dataframe using CountVectorizer() from the SciKit-Learn package. Once the corpora have contain the full filepath information, we will examine the first 10 files of each corpus to confirm this step was completed successfully.

Data Preparation

Next, the corpora must be tokenized and vectorized so that they will be in a numeric format for further processing and training with machine learning models. We import CountVectorizer() from SciKit-Learn and instantiate the CountVectorizer() object. This step tokenizes the documents and creates document term matrices (DTM) which will then be converted to vectorized dataframes. This step is what transforms the unstructured text data into a structured, numeric format for processing in supervised modeling. The screenshot below captures the tokenization and creation of DTMs for the corpora.

Once the two corpora are in a DTM, the final step is to vectorize the DTMs using the .toarray() method and Pandas. The final output is shown below as a structured dataframe with rows and columns.

Each row represents a document, and each column represents a unique token (essentially a word) from that document. As seen in the screenshot above, there are 12,500 documents in the positive reviews corpus, with 55,428 unique tokens across all the documents.

We are not done yet. Additional data cleaning must be done to clean up unwanted or unnecessary tokens from the dataframes. The first cleaning step we will perform is to drop columns that contain numbers, as we only want tokens that potentially represent words. The next cleaning step is to drop any columns with special characters. After that, we will drop columns that are less than three letters long or are more than 15 letters long. This will help cut down on short words like ‘a’, ‘to’, and ‘the’ and also drop tokens with exceptionally long lengths, such as tokens with repeated letters like ‘zzzzzzzzzzzzzzzz’. Finally, we want only dictionary words as tokens, so we will drop tokens that contain three or more consecutive letters such as ‘aaab’ or ‘azzzzaa’.

This data cleaning step will hopefully accomplish two things. 1. Cut out unhelpful tokens that do not add any meaning to the text which should help the models be more accurate. And 2., reduce the dimensionality of the dataframes to decrease training and processing times.

Positive reviews dataframe (cleaned)

Negative reviews dataframe (cleaned)

The cleaned dataframes are seen above. While there are still non-dictionary words in these cleaned dataframes, the cleaning steps have helped to reduce the dimensionality and noise. Most of the columns are now dictionary words, as seen in the screen shot below.

The last data preparation step before moving onto model training is to combine the positive and negative review dataframes into one dataframe with a label column for supervised learning. As seen in the screenshot below, there are now 25,000 rows (documents) and 71,457 columns (words/tokens) in the combined dataframe. However, there are now null values in some of the rows as a result of the concatenation step.

To eliminate null values, we will null values with a zero. The finalized, combined, vectorized and cleaned dataframe is seen below.

Models

Now that the data is ready, it’s time to build and train the models. This exercise will create five distinct models for comparison:

Multinomial Naive Bayes,
Linear SVC Support Vector Machine (SVM),
SVM with polynomial kernel,
SVM with Radial Basis Function (RBF) kernel,
SVM with Stochastic Gradient Descent (SGD) classifier.

Each model will be instantiated with specific hyperparameters to tune the model for optimal performance and training speed. The dataframe will be split into training and testing sets with a 70/30 split. This means that 70% of the data will be used for training and 30% will be set aside for testing.

Scaling the data

Additionally, the SVM models (models 2-5) will benefit from scaled data to increase training speed. Due to their complexity, SVM models may require several hours to train when used on a large dataset and may fail to converge or reach an acceptable level of accuracy before training time ceases. When I tried running the Linear SVC model (model 2) on data that was not scaled, I received this warning:

Training time took almost 6 hours and I got a convergence warning after 10,000 iterations. The warning recommends scaling the data beforehand with either the StandardScaler or MinMaxScaler. When tested, this model achieved an accuracy of 83.2%, which is pretty good, but we will try scaling the data to see if we get better accuracy and faster training time.

While discussed in more detail in the RESULTS section, this model did perform better with scaled data using the MinMaxScaler with a range of -1 to 1 than with unscaled data. The training time was cut down from 6 hours to 4.5 hours, and the accuracy increased from 83.2% to 86.7%. See the screenshot below for how scaling was accomplished on both the training and testing sets.

The instantiation and hyperparameters for the 5 models are detailed below.

RESULTS

The next few screenshots capture the performance of the 5 models. Performance is measured confusion matrices and accuracy scores. Summary results will also follow.

Summary Results

Model	Algorithm	Data scaling	Training time	Accuracy
1	Naive Bayes	None	11 seconds	85.5%
2	Linear SVC SVM	MinMaxScaler	4.58 hours	86.7%
3	SVM with polynomial kernel	MinMaxScaler	3.58 hours	84.5%
4	SVM with RBF kernel	MinMaxScaler	9.75 hours	81.6%
5	SVM with SGD classifier	StandardScaler	2.5 minutes	80.6%

Overall, the model that performed the best was model 2, with 86.7% accuracy. Linear SVC models are well suited to large and complex datasets, however, much work could be done to try to improve the model accuracy. For example, the hyperparameters could be further tuned to optimize training time and performance, such as adjusting the cost, max iterations, and the type of scaling to use, whether MinMaxScaler, StandardScaler, or preprocessing.scale(), for example. We could also return to the data preprocessing and cleaning steps to further reduce dimensionality by dropping unnecessary or unuseful columns from the dataset. This step alone could drastically improve model training time and accuracy. However, due to excessive training times from multiple models, I did not have enough time to explore these tuning steps in greater detail. As such, much more work could be done in this area to find the optimum model.

Feature Importance

The feature importance determines what are the most important words or features that each model uses to make decisions. What follows is the feature importance for the first two models.

I was unable to create feature importance lists for the remaining models because creating the permutation calculations for the non-linear SVM models required too much processing power and as a result, my kernel froze up when trying to run the last few cells. However, we have a pretty good sense of feature importance from the first two models.

CONCLUSIONS

Our analysis has demonstrated the power of machine learning in accurately gauging audience sentiment. By leveraging models such as Naive Bayes and Support Vector Machines, we have achieved high levels of accuracy in classifying movie reviews as positive or negative. This capability enables businesses to swiftly identify trends and patterns in audience reactions, allowing them to respond promptly to emerging issues or capitalize on positive feedback.

Furthermore, our exploration into feature importance has revealed the most influential words and features driving audience perceptions of films. Understanding which words carry the most weight in shaping sentiment empowers businesses to tailor their marketing campaigns, promotional materials, and even storyline elements to resonate more deeply with their target audience. By aligning messaging and content with the preferences and emotions of moviegoers, businesses can enhance engagement, build stronger connections with audiences, and ultimately drive box office success.

Moreover, the scalability and efficiency of machine learning models offer significant advantages for businesses looking to streamline their operations and optimize resource allocation. By automating the process of sentiment analysis, movie studios and distributors can analyze vast volumes of reviews in real-time, enabling them to stay abreast of audience reactions and market trends. This agility allows businesses to make data-driven decisions swiftly, seize opportunities, and mitigate risks in an ever-evolving industry landscape.

In conclusion, sentiment analysis of movie reviews using machine learning holds immense potential for businesses seeking to thrive in the competitive film industry. By harnessing the insights gleaned from our analysis, businesses can refine their strategies, enhance audience engagement, and drive success in the dynamic world of cinema. Embracing the power of data-driven decision-making, businesses can unlock new opportunities, captivate audiences, and chart a course towards sustained growth and prosperity in the digital age.

Source code:

Sentiment Analysis of Movie Reviews using Naive Bayes & Support Vector Machines

Recent Posts

Comentários