top of page
  • patrickwalsh6079

Identifying the Speaker from the Script of The Office: a Machine Learning & Text Mining Use-case




INTRODUCTION


The Office, an American sitcom, has gained immense popularity for its unique humor and relatable characters. In this text-mining analysis, we delve into the vast dialogue dataset of The Office to uncover patterns and insights that shed light on the dynamics between characters and the underlying themes of the show.


This project will utilize The Office dataset from Kaggle, consisting of every line of dialogue from every character in the TV show The Office. The goal will be to text-mine this dataset, focusing on the speaker and the line of dialogue columns to train models that can read a line of dialogue and predict who the speaker is.




ANALYSIS



Dataset


The dataset consists of dialogue lines from The Office TV series, along with the corresponding speaker's name. It was initially loaded from a CSV file and then cleaned and preprocessed for analysis.






Data Preparation


Before diving into the riches of dialogue, we took care to clean and polish our dataset. We stripped away extraneous details, focusing solely on the 'speaker' and 'line' columns. By gauging the frequency of each speaker, we identified the leading voices that echo throughout the series. Filtering out sparse dialogues and short lines, we ensured our analysis revolves around substantial conversations, brimming with insights and laughter.


The data preparation and data cleaning process consisted of the following steps:


  • Loaded the raw dataset from a CSV file.

  • Kept only the 'speaker' and 'line' columns.

  • Calculated the frequency of each speaker to identify prominent speakers.

  • Filtered the dataset to keep only rows where the speaker appears at least 1000 times.

  • Removed rows with very short lines (3 words or less) to focus on meaningful text data.

  • Saved the cleaned dataset to a new CSV file for further processing.








Models


Our arsenal of models mirrors the diversity of The Office's characters, each bringing its unique perspective to the table:


Naive Bayes Model: An intuitive approach that captures the essence of dialogue patterns and speaker quirks.


Support Vector Machines (SVM): Harnessing the power of mathematical boundaries to distinguish between character voices with precision.


Random Forest Classifier: Like a forest of decision trees, picking up nuances and subtleties in dialogue delivery.


Neural Network (MLPClassifier): Mimicking the complexity of human thought processes, diving deep into the intricacies of dialogue analysis.


LSTM Model (Word Embeddings): Delving into the soul of words, creating embeddings that capture the essence of each line and speaker.



Naive Bayes Model


  • Trained a Multinomial Naive Bayes model on the cleaned text data.

  • Evaluated the model's performance using accuracy and confusion matrix.



Support Vector Machines (SVM)


  • Scaled the training and testing sets using MinMaxScaler for SVMs.

  • Created an SVM classifier with a linear kernel.

  • Fit the SVM model with sample weights calculated based on class imbalance.

  • Predicted speaker labels using the trained SVM model.



Random Forest Classifier


  • Trained a Random Forest Classifier on the text data.

  • Made predictions using the trained Random Forest model.



Neural Network (MLPClassifier)


  • Implemented a Multi-Layer Perceptron (MLP) classifier with specified parameters.

  • Trained the MLP classifier on the text data and evaluated its predictions.



LSTM Model (Word Embeddings)


  • Loaded the cleaned lines dataset.

  • Tokenized the text and trained a Word2Vec model to create word embeddings.

  • Preprocessed the text data for the LSTM model by tokenizing, padding sequences, and encoding labels. 

  • Built an LSTM model using Keras with an embedding layer, LSTM layer, and dense layer. 

  • Compiled and trained the LSTM model on the training data.




RESULTS


The results of each model, including accuracy scores, confusion matrices, and predictions, were analyzed to assess the performance of speaker classification on The Office dataset.






















CONCLUSIONS


The text mining analysis of The Office dataset has provided valuable insights into various aspects of the show's dialogue and character interactions. One key finding is the distinct dialogue styles of each character, which contribute to their unique personalities and roles within the show.


However, it proved very difficult to develop robust classification methods for identifying the speaker by dialogue alone. It may be the case that more holistic methods that take into consideration other pieces of data may be necessary. For example, methods that utilize other information such as the season and episode number may yield better results in the task of speaker identification.


Overall, this analysis highlights the richness of The Office's dialogue and its contribution to the show's success. It showcases how text mining techniques can be used to extract valuable information from large datasets, offering new perspectives and enhancing our appreciation of beloved TV shows like The Office.


Source code:


11 views0 comments

Commentaires


bottom of page