Data Integration

Objective

We aim to use the features already available to us in the dataset, along with some extracted features to identify and develop the appropriate machine learning model that helps us predict incoming hate on YouTube with the Highest Accuracy. We are going to be using Data Integration techniques, Sentiment Analysis, TF-IDF Scoring and Machine Learning Modeling.

Sentiment Implementation

We used an NLTK Sentiment Analyzer and ran it through our dataset’s video comments column. It gave us a score between -1 and 1 for each comment, which we then found the average of per video and created a separate ‘average_sentiment’ column for. Once we did that, we dropped the comments column, along with all the other rows for each video except the one with the highest views. This ensured we had the YouTube video’s latest row with its average sentiment. We also created a ‘sentiment_label’ feature that assigned binary values to a video which implied if it received hate or not.

We then used the sub-table we created for the ‘tags’ column which also contained the video_id for that tag and ran a TF-IDF Scoring Model on it so each tag used in a video was assigned a score based on the number of times it occurred in the tags column, and all the scores per tag per video were added up to a final ‘tfidf_sum’ column. We did this so we could identify which videos used more important or heavy weighted tags.

COrrelation matrix

Implementation of correlation matrix in order to find potential correlations and patterns between the features. Potential findings were that even though there isn't a strong correlation between most features, the sentiments and TF IDF features do have a good correlation compared to other features. ChatGPT explains this good since our datasets were extremely large.

COrrelation matrix

Our analysis also revealed that the News and Politics, Shows, and Education categories were the top three categories receiving the most hate speech. This information provides valuable insights for identifying the areas that require more attention to prevent online hate. We aim to use these findings to guide our efforts in creating a safer and more positive environment on YouTube.