The data set for this blog post got lost and the code no longer runs.
This code has been slightly revised to make sure it works as of 2018-12-16.
In this post, we will look at a handful of English1 movies reviews from imdb which I have scraped and placed in this repository movie reviews. I took a look at the best and worst rated movies with their best and worst reviews respectively. From that, we will try to see if we can see how positive reviews on good movies are different than positive reviews on bad movies and so on.
We will use fairly standard packages with the inclusion of paletteer for the sole reason of self-promotion. (yay!!!)
library(tidyverse)
library(tidytext)
library(plotly)
library(paletteer)
we will read in the data using readr
<- read_csv("https://raw.githubusercontent.com/EmilHvitfeldt/movie-reviews/master/reviews_v1.csv") reviews_raw
Let’s take a look at the data I prepared for us:
glimpse(reviews_raw)
It includes 7 different variables. There is some redundancy, the url
variable contains the URL of the movie, and id
and title
are just the extracts from the url
variable. The rating
variable is the average rating of the movie and will not be used in this analysis. Lastly, we have the review_rating
and movie_rating
which will denote if the review is positive or negative and if the movie being reviewed is good or bad respectively.
Let’s start by unnesting the words and get the counts. We also don’t want to look at all the stopwords and words that contain numbers, this is likely not a great number of words but we will exclude them for now anyway.
<- unnest_tokens(reviews_raw, word, text) %>%
counted_words count(word, movie_rating, review_rating) %>%
anti_join(stop_words, by = "word") %>%
filter(!str_detect(word, "\\d"))
And lets have a quick looks at the result:
%>% arrange(desc(n)) %>% head(n = 15) counted_words
And we notice that the word movie has been used quite a lot more in reviews of bad movies than in good movies.
Log odds
We have a bunch of counts here and we would like to find a worthwhile transformation of them. Since we have the number of reviews for good movies and bad movies we would be able to find the percentage of words appearing in good movies. This would give us a number between 0 and 1, where the interesting words would be when the percentage is close to 0 and 1 as it would show that the word is being used more in one than another.
By doing this transformation to both the review scores and movie scores will give us the following plot:
%>%
counted_words mutate(rating = str_c(movie_rating, "_", review_rating)) %>%
select(-movie_rating, -review_rating) %>%
spread(rating, n) %>%
drop_na() %>%
mutate(review_lo = (bad_good + good_good) / (bad_bad + good_bad + bad_good + good_good),
movie_lo = (good_bad + good_good) / (bad_bad + bad_good + good_bad + good_good)) %>%
ggplot() +
aes(movie_lo, review_lo) +
geom_text(aes(label = word))
Another way to do this is to take the log of the odds of one event happening over the other event. We will create this little helper function for us.
<- function(x, y) {
log_odds <- x + y
total <- x / total
p log(p / (1 - p))
}
applying this transformation instead expands the limit from 0 to 1 to the whole number range where the midpoint is 0, this has some nice properties from a visualization perspective, it will also compact the center points a little more allowing outliers to be more prominent.
<- counted_words %>%
plot_data mutate(rating = str_c(movie_rating, "_", review_rating)) %>%
select(-movie_rating, -review_rating) %>%
spread(rating, n) %>%
drop_na() %>%
mutate(review_lo = log_odds(bad_good + good_good, bad_bad + good_bad),
movie_lo = log_odds(good_bad + good_good, bad_bad + bad_good))
%>%
plot_data ggplot() +
aes(movie_lo, review_lo, label = word) +
geom_text()
We have a good degree of overplotting in this plot, but part of that might be because of the text, a quick look at the scatterplot still reveals a good deal of overplotting. We will try to counter that later on.
%>%
plot_data ggplot() +
aes(movie_lo, review_lo) +
geom_point(alpha = 0.5)
Let us stay in the scatterplot. Lets tighten up the theme and include guidelines at y = 0 and x = 0. We will also find the range of the data to make sure we include all the points.
%>%
plot_data select(movie_lo, review_lo) %>%
range()
%>%
plot_data ggplot() +
aes(movie_lo, review_lo) +
geom_vline(xintercept = 0, color = "grey") +
geom_hline(yintercept = 0, color = "grey") +
geom_point(alpha = 0.5) +
theme_minimal() +
coord_cartesian(ylim = c(-4.6, 4.6),
xlim = c(-4.6, 4.6)) +
labs(x = "← Bad Movies - Good Movies →", y = "← Bad Reviews - Good Reviews →")
We still have quite a bit of overplotting, I’m going to sample the points based on importance. The importance matrix I’m going to work with is the distance from the middle. In addition, we are going to display the number of times a word is used by the size of the points.
set.seed(13)
<- plot_data %>%
plot_data_v2 mutate(distance = review_lo ^ 2 + movie_lo ^ 2,
n = bad_bad + bad_good + good_bad + good_good) %>%
sample_frac(0.1, weight = distance)
%>%
plot_data_v2 ggplot() +
aes(movie_lo, review_lo, size = n) +
geom_vline(xintercept = 0, color = "grey") +
geom_hline(yintercept = 0, color = "grey") +
geom_point(alpha = 0.5) +
theme_minimal() +
coord_cartesian(ylim = c(-4.6, 4.6),
xlim = c(-4.6, 4.6)) +
labs(x = "← Bad Movies - Good Movies →", y = "← Bad Reviews - Good Reviews →")
Lastly, we will make the whole thing interactive with plotly to allow hover text. We include some colors to indicate the distance to the center.
<- plot_data_v2 %>%
p ggplot() +
aes(movie_lo, review_lo, size = n, color = distance, text = word) +
geom_vline(xintercept = 0, color = "grey") +
geom_hline(yintercept = 0, color = "grey") +
geom_point(alpha = 0.5) +
theme_minimal() +
coord_cartesian(ylim = c(-4.6, 4.6),
xlim = c(-4.6, 4.6)) +
labs(x = "← Bad Movies - Good Movies →",
y = "← Bad Reviews - Good Reviews →",
title = "What are people saying about the best and worst movies on IMDB?") +
scale_color_paletteer_c("viridis::viridis") +
guides(color = "none", size = "none")
ggplotly(p, width = 700, height = 700, displayModeBar = FALSE,
tooltip = "text") %>%
config(displayModeBar = F)
And we are done and it looks amazing! With this dataviz, we can see that the word overrated is mainly used in negative reviews about good movies. Likewise unfunny is used in bad reviews about bad movies. There are many more examples that I’ll let you explore by yourself.
Thanks for tagging along!