I’ve just finished R for Data Science by Hadley Wickham and just started Text mining With R by Julia Silge. So I figured it’s about time i do some data analysis to apply the skills I learned. I decided to do sentiment analysis after reading this post by Julia Silge.
After skimming through some interesting datasets on the internet, i decided to use ’A Million Headlines` dataset which can be found on Kaggle. It’s a dataset of news headlines published over a period of 14 years from 2003 to 2017 taken from Australian news source ABC(Australian Broadcasting Group).
First, let’s import all the packages needed:
library(tidyverse)
library(here)
library(tidytext)
library(viridis)
library(widyr)
library(ggraph)
library(igraph)
library(scales)
library(knitr)
library(wordcloud)
library(reshape2)
Now let’s import the data
# import data
news <- as.tibble(read_csv(here("abcnews-date-text.csv")))
news
## # A tibble: 1,103,665 x 2
## publish_date headline_text
## <dbl> <chr>
## 1 20030219 aba decides against community broadcasting licence
## 2 20030219 act fire witnesses must be aware of defamation
## 3 20030219 a g calls for infrastructure protection summit
## 4 20030219 air nz staff in aust strike for pay rise
## 5 20030219 air nz strike to affect australian travellers
## 6 20030219 ambitious olsson wins triple jump
## 7 20030219 antic delighted with record breaking barca
## 8 20030219 aussie qualifier stosur wastes four memphis match
## 9 20030219 aust addresses un security council over iraq
## 10 20030219 australia is locked into war timetable opp
## # ... with 1,103,655 more rows
Term Frequency
One of the common task in text mining is to look at word frequencies. Let’s analyze word frequencies in all of the headlines
news <- news %>%
# create year column
mutate(year = substr(publish_date,
start = 1, stop = 4),
linenumber = row_number())
news
## # A tibble: 1,103,665 x 4
## publish_date headline_text year linenumber
## <dbl> <chr> <chr> <int>
## 1 20030219 aba decides against community broadcasti… 2003 1
## 2 20030219 act fire witnesses must be aware of defa… 2003 2
## 3 20030219 a g calls for infrastructure protection … 2003 3
## 4 20030219 air nz staff in aust strike for pay rise 2003 4
## 5 20030219 air nz strike to affect australian trave… 2003 5
## 6 20030219 ambitious olsson wins triple jump 2003 6
## 7 20030219 antic delighted with record breaking bar… 2003 7
## 8 20030219 aussie qualifier stosur wastes four memp… 2003 8
## 9 20030219 aust addresses un security council over … 2003 9
## 10 20030219 australia is locked into war timetable o… 2003 10
## # ... with 1,103,655 more rows
we can use unnest_tokens
to separate each line into words. The default tokenizing is for words, but other options include characters, sentences, lines, paragraphs, or separation around regex pattern.
tidy_news <- news %>%
unnest_tokens(word, headline_text)
tidy_news
## # A tibble: 7,070,525 x 4
## publish_date year linenumber word
## <dbl> <chr> <int> <chr>
## 1 20030219 2003 1 aba
## 2 20030219 2003 1 decides
## 3 20030219 2003 1 against
## 4 20030219 2003 1 community
## 5 20030219 2003 1 broadcasting
## 6 20030219 2003 1 licence
## 7 20030219 2003 2 act
## 8 20030219 2003 2 fire
## 9 20030219 2003 2 witnesses
## 10 20030219 2003 2 must
## # ... with 7,070,515 more rows
Now we can manipulate the data and do term frequency analysis. First, let’s remove stop words which can be obtain from dataset stop_words
with the function anti_join
. Stop words are words which do not contain important significance. We filter out stop words as it could affect our analysis.
# remove stopwords
data("stop_words")
tidy_news <- tidy_news %>%
anti_join(stop_words)
Let’s see the most frequent words use in the news headlines since 2003:
# most common words
tidy_news %>%
count(word, sort = TRUE) %>%
head(20) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_bar(stat = "identity") +
coord_flip() +
ylab("Number of occurences") +
xlab("Word")
We can see here most of the headlines contain the words, “police”, “court”, “council”.
Network of Words
Let’s count the words that occur together in the headlines from 2017. Using pairwise_count
function from widyr
package, we can count highest co-occurances pair of words.
headlines_2017 <- tidy_news %>%
filter(year == "2017") %>%
pairwise_count(word, linenumber, sort = TRUE)
headlines_2017
## # A tibble: 1,004,826 x 3
## item1 item2 n
## <chr> <chr> <dbl>
## 1 trump donald 612
## 2 donald trump 612
## 3 korea north 301
## 4 north korea 301
## 5 marriage sex 285
## 6 sex marriage 285
## 7 turnbull malcolm 197
## 8 malcolm turnbull 197
## 9 election wa 149
## 10 wa election 149
## # ... with 1,004,816 more rows
Donald trump is the highest occurences pair of words in 2017 followed by North Korea which unsurprising as the feud between them bring fear about nuclear war around the world. Also in 2017, Australia vote in favour of legalising same sex marriage which is big news across the country. Hence explains why sex marriage is just below Donald Trump and North Korea in frequency of co-occurences pair of words in 2017 headlines.
Let’s plot the network of words occurences:
#pairwise count
word_pairs <- tidy_news %>%
group_by(word) %>%
filter(n() > 5) %>%
ungroup() %>%
pairwise_count(item=word,
linenumber, sort = TRUE,
upper = FALSE) %>%
filter(n > 10)
#create plot
word_pairs %>%
top_n(100) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(edge_alpha = n,
edge_width = n)) +
geom_node_point(color = "darkslategray4",
size = 5) +
geom_node_text(aes(label = name),
vjust = 1.8) +
ggtitle(expression(paste(
"Word Network in ABC Headlines From
2003-2017"))) +
theme_void()
Next, we’ll look into sentiment analysis of these words so we can understand what type of sentiment have been used in most of these headlines.
Sentiment Analysis
Now let’s investigate sentiment analysis. When we reads a text, we use our understanding of the emotional intent of words to infer wheter a section of words is positive or negative and also categorized it into emotion like anger or joy. Let’s use bing lexicon from sentiments
dataset to categorized our words into positive or negative sentiment.
# create dataframe of words from bing lexicon
library(tidyr)
bing <- sentiments %>%
filter(lexicon == "bing") %>%
select(-score)
bing
## # A tibble: 6,788 x 3
## word sentiment lexicon
## <chr> <chr> <chr>
## 1 2-faced negative bing
## 2 2-faces negative bing
## 3 a+ positive bing
## 4 abnormal negative bing
## 5 abolish negative bing
## 6 abominable negative bing
## 7 abominably negative bing
## 8 abominate negative bing
## 9 abomination negative bing
## 10 abort negative bing
## # ... with 6,778 more rows
Using inner_join
function, we can categorized the words into positive or negative by joining bing
dataset.
# classified words into positive
## or negative based on bing lexicon
news_sentiment <- tidy_news %>%
inner_join(bing) %>%
count(year,sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(sentiment = positive - negative)
news_sentiment
## # A tibble: 15 x 4
## year negative positive sentiment
## <chr> <dbl> <dbl> <dbl>
## 1 2003 26303 12604 -13699
## 2 2004 31093 14291 -16802
## 3 2005 31705 13522 -18183
## 4 2006 28471 12123 -16348
## 5 2007 32875 13454 -19421
## 6 2008 34001 14123 -19878
## 7 2009 32679 13069 -19610
## 8 2010 31273 12589 -18684
## 9 2011 30169 11997 -18172
## 10 2012 30152 13555 -16597
## 11 2013 31884 14523 -17361
## 12 2014 28363 14290 -14073
## 13 2015 30389 14673 -15716
## 14 2016 24249 11487 -12762
## 15 2017 19247 9208 -10039
Most common positive and negative words
Now that we have data frame of positive and negative sentiments, we can analyze which words is most common in the positive and negative category. We can filter out NA
sentiment or neutral sentiment.
word_count <- tidy_news %>%
left_join(get_sentiments("bing"), by = "word") %>%
filter(!is.na(sentiment)) %>%
count(word, sentiment, sort = TRUE) %>%
ungroup()
top_sentiments_bing <- word_count %>%
filter(word != "wins") %>%
group_by(sentiment) %>%
top_n(5, n) %>%
mutate(num = ifelse(sentiment == "negative",
-n, n)) %>%
mutate(word = reorder(word, num)) %>%
ungroup()
top_sentiments_bing
## # A tibble: 10 x 4
## word sentiment n num
## <chr> <chr> <int> <int>
## 1 crash negative 11208 -11208
## 2 death negative 11173 -11173
## 3 murder negative 9217 -9217
## 4 win positive 8315 8315
## 5 killed negative 8129 -8129
## 6 attack negative 7166 -7166
## 7 boost positive 6997 6997
## 8 gold positive 6211 6211
## 9 top positive 5687 5687
## 10 support positive 5399 5399
Let’s see top 5 words from positive and negative sentiment.
ggplot(top_sentiments_bing, aes(reorder(word, num), num,
fill = sentiment)) +
geom_bar(stat = 'identity', alpha = 0.75) +
scale_fill_manual(guide = F, values = c("black",
"darkgreen")) +
scale_y_continuous(breaks = pretty_breaks(7)) +
labs(x = '', y = "Number of Occurrences",
title = 'News Headlines Sentiments',
subtitle = 'Most Common Positive and Negative Words') +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1,
size = 14, face = "bold"),
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
panel.grid.major.y = element_line(size = 1.1))
Word Cloud: Most Common Positive and Negative Words in News Headlines
library(wordcloud) # to create wordcloud
library(reshape2) # for acast() function
tidy_news %>%
inner_join(bing) %>%
count(word, sentiment, sort = TRUE) %>%
acast(word ~ sentiment, value.var = "n", fill = 0) %>%
comparison.cloud(colors = c("black", "darkgreen"),
title.size = 1.5)
Proportions of Positive and Negative Words
Now let’s see the proportions of negative and positive words to entire data set. after filtering out words categorized as neutral, we calculate the frequency by first grouping them along sentiment then counting the rows for each of these groups. Finally, we can calculate the percentage by dividing the sum of all the rows in the data set.
sentiment_bing <- tidy_news %>%
left_join(get_sentiments("bing"), by = "word") %>%
filter(!is.na(sentiment)) %>%
group_by(year, sentiment) %>%
summarise(n = n()) %>%
mutate(percent = n / sum(n)) %>%
ungroup()
sentiment_bing
## # A tibble: 30 x 4
## year sentiment n percent
## <chr> <chr> <int> <dbl>
## 1 2003 negative 26303 0.676
## 2 2003 positive 12604 0.324
## 3 2004 negative 31093 0.685
## 4 2004 positive 14291 0.315
## 5 2005 negative 31705 0.701
## 6 2005 positive 13522 0.299
## 7 2006 negative 28471 0.701
## 8 2006 positive 12123 0.299
## 9 2007 negative 32875 0.710
## 10 2007 positive 13454 0.290
## # ... with 20 more rows
sentiment_bing %>%
ggplot(aes(x = year, y = percent, color = sentiment,
group = sentiment)) +
geom_line(size = 1) +
geom_point(size = 3) +
scale_y_continuous(breaks = pretty_breaks(5),
labels = percent_format()) +
labs(x = "Album", y = "Emotion Words Count (as %)") +
scale_color_manual(values = c(positive = "darkgreen",
negative = "black")) +
ggtitle("Proportion of Positive and Negative Words",
subtitle = "Bing lexicon") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1,
size = 11, face = "bold"),
axis.title.x = element_blank(),
axis.text.y = element_text(size = 11, face = "bold"))
The proportion of negative sentiment words has been consistently much higher than proportion of positive sentiment words since 2003.
Let’s use NRC lexicon for sentiment analysis. NRC sentiments got list of sentiments way beyond positive and negative, it categorizes words into eight different emotion terms, Anger
, Anticipation
, Disgust
, Fear
, Joy
, Sadness
, Surprise
, and Trust
.
library(RColorBrewer)
cols <- colorRampPalette(brewer.pal(n = 8, name = "Set1"))(8)
cols
## [1] "#E41A1C" "#377EB8" "#4DAF4A" "#984EA3" "#FF7F00" "#FFFF33" "#A65628"
## [8] "#F781BF"
Now, let’s plot the distribution of emotion terms on boxplot:
cols <- c("anger" = "#E41A1C", "sadness" = "#377EB8",
"disgust" = "#4DAF4A", "fear" = "#984EA3",
"surprise" = "#FF7F00", "joy" = "#FFFF33",
"anticipation" = "#A65628", "trust" = "#F781BF")
news_nrc <- tidy_news %>%
left_join(get_sentiments("nrc"), by = "word") %>%
filter(!(sentiment == "negative" |
sentiment == "positive")) %>%
mutate(sentiment = as.factor(sentiment)) %>%
group_by(index = linenumber %/% 100,
sentiment) %>%
summarize(n = n()) %>%
mutate(percent = n / sum(n)) %>%
select(-n) %>%
ungroup()
library(hrbrthemes)
news_nrc %>%
ggplot() +
geom_boxplot(aes(x = reorder(sentiment, percent),
y = percent, fill = sentiment)) +
scale_y_continuous(breaks = pretty_breaks(5),
labels = percent_format()) +
scale_fill_manual(values = cols) +
ggtitle("Distribution of Emotion Terms") +
labs(x = "Emotion Term", y = "Percentage") +
theme_bw() +
theme(legend.position = "none",
axis.text.x = element_text(size = 11,
face = "bold"),
axis.text.y = element_text(size = 11,
face = "bold"))
Fear has highest percentage in the distribution. Next, we can see how the sentiment emotions of headlines change over time by creating bump chart that plots different sentiment groups
news_nrc2 <- tidy_news %>%
left_join(get_sentiments("nrc"), by = "word") %>%
filter(!(sentiment == "negative" |
sentiment == "positive")) %>%
mutate(sentiment = as.factor(sentiment)) %>%
group_by(year, sentiment) %>%
summarize(n = n()) %>%
mutate(percent = n / sum(n)) %>%
select(-n) %>%
ungroup()
news_nrc2 %>%
group_by(year) %>%
ggplot(aes(year, percent, color = sentiment,
group = sentiment)) +
geom_line(size = 1) +
geom_point(size = 3.5) +
scale_y_continuous(breaks = pretty_breaks(5),
labels = percent_format()) +
xlab("Year") + ylab("Proportion of Emotion Words") +
ggtitle("News Headlines Sentiments Across Years",
subtitle = "From 2003-2017") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45,
hjust = 1, size = 11,
face = "bold"),
axis.title.x = element_blank(),
axis.text.y = element_text(size = 11,
face = "bold")) +
scale_color_brewer(palette = "Set1")
We can see that the sentiment changes over time is quite consistent, with fear sentiment already at high level in 2003.
Let’s see what are the most words used that are associated with fear:
nrc_fear <- get_sentiments("nrc") %>%
filter(sentiment == "fear")
tidy_news %>%
inner_join(nrc_fear) %>%
count(word, sort = TRUE)
## # A tibble: 1,300 x 2
## word n
## <chr> <int>
## 1 police 35985
## 2 court 16380
## 3 fire 13910
## 4 crash 11208
## 5 death 11173
## 6 murder 9217
## 7 hospital 8815
## 8 accused 8094
## 9 government 7905
## 10 missing 7582
## # ... with 1,290 more rows
“police”, “fire”, “crash” are few of words associated with fear with the word “court” being the highest count.
Comparing how sentiments differ across the sentiment libraries
There’s three options for sentiment lexicons, let’s see how the three sentiment lexicon differ when used for these headlines.
First, let’s see how many positive and negative words each lexicon categorized.
Bing
get_sentiments("bing") %>%
count(sentiment)
## # A tibble: 2 x 2
## sentiment n
## <chr> <int>
## 1 negative 4782
## 2 positive 2006
NRC
get_sentiments("nrc") %>%
count(sentiment)
## # A tibble: 10 x 2
## sentiment n
## <chr> <int>
## 1 anger 1247
## 2 anticipation 839
## 3 disgust 1058
## 4 fear 1476
## 5 joy 689
## 6 negative 3324
## 7 positive 2312
## 8 sadness 1191
## 9 surprise 534
## 10 trust 1231
- Bing: there are 4782 words that can be categorized as negative, and 2006 positive.
- NRC : there are 3324 words that are categorized as negative, and 2312 positive.
The proportion of negative words in Bing lexicon is much higher than proportion of negative words in NRC lexicon.
Let’s count how many words in our text are categorized for each sentiment:
# Bing lexicon
tidy_news %>%
left_join(get_sentiments("bing"), by = "word") %>%
group_by(sentiment) %>%
summarize(sum = n())
## # A tibble: 3 x 2
## sentiment sum
## <chr> <int>
## 1 negative 442853
## 2 positive 195508
## 3 <NA> 4706155
# nrc lexicon
tidy_news %>%
left_join(get_sentiments("nrc"), by = "word") %>%
group_by(sentiment) %>%
summarize(sum = n())
## # A tibble: 11 x 2
## sentiment sum
## <chr> <int>
## 1 anger 306856
## 2 anticipation 309873
## 3 disgust 127805
## 4 fear 438814
## 5 joy 168191
## 6 negative 555330
## 7 positive 518200
## 8 sadness 270419
## 9 surprise 150105
## 10 trust 363695
## 11 <NA> 4027259
- For Bing: 193, 549 words are categorized as negative and 193,549 words are positive.
- For NRC: 549191 words are categorized as negative and 512,498 positive
In summary, NRC lexicon managed to categorized the words much more than Bing lexicon did.
Let’s see how AFINN lexicon categorized the words now, as it’s the only lexicon we haven’t touched yet in the tidytext package! The AFINN lexicon gives a score from -5 (for negative sentiment) to +5 (positive sentiment).
AFINN
headlines_afinn <- tidy_news %>%
left_join(get_sentiments("afinn"), by = "word") %>%
filter(!grepl('[0-9]', word))
# count NA category
headlines_afinn %>%
summarize(NAs= sum(is.na(score)))
## # A tibble: 1 x 1
## NAs
## <int>
## 1 4584704
headlines_afinn %>%
select(score) %>%
mutate(sentiment = if_else(score > 0,
"positive", "negative",
"NA")) %>%
group_by(sentiment) %>%
summarize(sum = n())
## # A tibble: 3 x 2
## sentiment sum
## <chr> <int>
## 1 NA 4584704
## 2 negative 469070
## 3 positive 204720
There are 4,532,575 words out of 5, 199, 782 words that was not categorized by AFINN. Let’s visualize scoring ability of each lexicon.
afinn_scores <- headlines_afinn %>%
replace_na(replace = list(score = 0)) %>%
group_by(index = linenumber %/% 10000) %>%
summarize(sentiment = sum(score)) %>%
mutate(lexicon = "AFINN")
# combine the Bing and NRC lexicons into one data frame:
bing_nrc_scores <- bind_rows(
tidy_news %>%
inner_join(get_sentiments("bing")) %>%
mutate(lexicon = "Bing"),
tidy_news %>%
inner_join(get_sentiments("nrc") %>%
filter(sentiment %in% c("positive",
"negative"))) %>%
mutate(lexicon = "NRC")) %>%
# from here we count the sentiments,
## then spread on positive/negative, then create the score:
count(lexicon, index = linenumber %/% 10000,
sentiment) %>%
spread(sentiment, n, fill = 0) %>%
mutate(lexicon = as.factor(lexicon),
sentiment = positive - negative)
# combine all lexicons into one data frame
all_lexicons <- bind_rows(afinn_scores, bing_nrc_scores)
lexicon_cols <- c("AFINN" = "#E41A1C",
"NRC" = "#377EB8", "Bing" = "#4DAF4A")
all_lexicons
## # A tibble: 333 x 5
## index sentiment lexicon negative positive
## <dbl> <dbl> <chr> <dbl> <dbl>
## 1 0 -5364 AFINN NA NA
## 2 1 -3761 AFINN NA NA
## 3 2 -3952 AFINN NA NA
## 4 3 -4294 AFINN NA NA
## 5 4 -4481 AFINN NA NA
## 6 5 -4077 AFINN NA NA
## 7 6 -4979 AFINN NA NA
## 8 7 -4659 AFINN NA NA
## 9 8 -5018 AFINN NA NA
## 10 9 -5087 AFINN NA NA
## # ... with 323 more rows
all_lexicons %>%
ggplot(aes(index, sentiment, fill = lexicon)) +
geom_col() +
facet_wrap(~lexicon, ncol = 1, scales = "free_y") +
scale_fill_manual(values = lexicon_cols) +
ggtitle("Comparison of Sentiments") +
labs(x = "Index of All Headlines From 2003-2017" ,
y = "Sentiment Score") +
theme_bw() +
theme(axis.text.x = element_blank())
We can see that AFINN and Bing lexicon sentiment across the years have been negative, there’s really no positive sentiment at all! But we can see in the latest index the negative score is really small, is the trend changing? we need more data to confirm that.
Generally, across all lexicon, the sentiment of the headlines has all been negative.
Summary
We can see from the analysis that negative sentiment has been dominating media headlines in Australia since 2013 with fear being the dominating theme emotion. Most of these negative sentiment came from reporting of crime, automobile crash etc.These types of headlines are the most appealing to readers, hence why their term frequencies are high. However, the 3 lexicons we used in this analysis failed to categorized so many words in all the headlines. To be exact, there are 4,652,959, 3,982,132, and 4,532,575 words that are not categorized by Bing, NRC, and AFINN lexicon consecutively.
It is important to note that lexicons in the tidytext
package are not the be all and end all for text/sentiment analysis. One can even create their own lexicons through crowd-sourcing (such as Amazon Mechanical-Turk, which is how some of the lexicons shown here were created), from utilizing word lists accrued by your own company throughout the years dealing with customer/employee feedback, etc. It would be interesting to compare this datasets with headlines from another country. For example, we can compare the most focused term used by headlines in different country using the tf-idf statistic.