save this notebook
save.image(file = "c:/users/juand/desktop/r/hello.RData")
In this notebook, we examine the differences between authors Brackett and Cummings, in three ways: first, we analyse their books by pairing them with the NRC sentiment database, and we obtain a possible ordering of the emotions conveyed by the works and how they vary throughout them. Second, we use the afinn sentiment database to give words a score from -5 to 5, and obtain a āplot profileā of each book by aggregating these scores into chunks of the story. Lastly, we perform a principal components analysis in order to obtain possible stylistic differences between the authors.
Adding essential libraries
install.packages(c("tidytext","textdata","gutenbergr","ggplot2","tidyr","janeaustenr","stringr","devtools","curl"))
Error in install.packages : Updating loaded packages
install.packages("ggplotly")
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:
https://cran.rstudio.com/bin/windows/Rtools/
Installing package into ć¤¼ćø±C:/Users/juand/Documents/R/win-library/4.0ć¤¼ćø²
(as ć¤¼ćø±libć¤¼ćø² is unspecified)
Warning in install.packages :
package āggplotlyā is not available for this version of R
A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
install.packages(c("tidytext", "textdata", "gutenbergr", "ggplot2", "tidyr", "janeaustenr", "stringr", "devtools", "curl"))
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:
https://cran.rstudio.com/bin/windows/Rtools/
Installing packages into ć¤¼ćø±C:/Users/juand/Documents/R/win-library/4.0ć¤¼ćø²
(as ć¤¼ćø±libć¤¼ćø² is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/tidytext_0.3.2.zip'
Content type 'application/zip' length 3050365 bytes (2.9 MB)
downloaded 2.9 MB
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/textdata_0.4.1.zip'
Content type 'application/zip' length 496596 bytes (484 KB)
downloaded 484 KB
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/gutenbergr_0.2.1.zip'
Content type 'application/zip' length 4070950 bytes (3.9 MB)
downloaded 3.9 MB
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/ggplot2_3.3.5.zip'
Content type 'application/zip' length 4127688 bytes (3.9 MB)
downloaded 3.9 MB
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/tidyr_1.1.4.zip'
Content type 'application/zip' length 1070426 bytes (1.0 MB)
downloaded 1.0 MB
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/janeaustenr_0.1.5.zip'
Content type 'application/zip' length 1625468 bytes (1.6 MB)
downloaded 1.6 MB
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/stringr_1.4.0.zip'
Content type 'application/zip' length 216777 bytes (211 KB)
downloaded 211 KB
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/devtools_2.4.2.zip'
Content type 'application/zip' length 397109 bytes (387 KB)
downloaded 387 KB
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/curl_4.3.2.zip'
Content type 'application/zip' length 4322383 bytes (4.1 MB)
downloaded 4.1 MB
package ātidytextā successfully unpacked and MD5 sums checked
package ātextdataā successfully unpacked and MD5 sums checked
package āgutenbergrā successfully unpacked and MD5 sums checked
package āggplot2ā successfully unpacked and MD5 sums checked
package ātidyrā successfully unpacked and MD5 sums checked
package ājaneaustenrā successfully unpacked and MD5 sums checked
package āstringrā successfully unpacked and MD5 sums checked
package ādevtoolsā successfully unpacked and MD5 sums checked
package ācurlā successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\juand\AppData\Local\Temp\Rtmp630153\downloaded_packages
install.packages(c("scales","plotly"))
Error in install.packages : Updating loaded packages
Downloading the books from PG
library("gutenbergr")
package ć¤¼ćø±gutenbergrć¤¼ćø² was built under R version 4.0.5
cummings1 <- gutenberg_download(19066)
Determining mirror for Project Gutenberg from http://www.gutenberg.org/robot/harvest
Using mirror http://aleph.gutenberg.org
install.packages(c("scales", "plotly"))
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:
https://cran.rstudio.com/bin/windows/Rtools/
Installing packages into ć¤¼ćø±C:/Users/juand/Documents/R/win-library/4.0ć¤¼ćø²
(as ć¤¼ćø±libć¤¼ćø² is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/scales_1.1.1.zip'
Content type 'application/zip' length 556840 bytes (543 KB)
downloaded 543 KB
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/plotly_4.10.0.zip'
Content type 'application/zip' length 3175893 bytes (3.0 MB)
downloaded 3.0 MB
package āscalesā successfully unpacked and MD5 sums checked
package āplotlyā successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\juand\AppData\Local\Temp\Rtmp630153\downloaded_packages
cummings2 <- gutenberg_download(61884)
brackett1 <- gutenberg_download(32664)
brackett2 <- gutenberg_download(64043)
Activating necessary libraries for text cleaning
library(dplyr)
package ć¤¼ćø±dplyrć¤¼ćø² was built under R version 4.0.5
Attaching package: ć¤¼ćø±dplyrć¤¼ćø²
The following objects are masked from ć¤¼ćø±package:statsć¤¼ćø²:
filter, lag
The following objects are masked from ć¤¼ćø±package:baseć¤¼ćø²:
intersect, setdiff, setequal, union
library(stringr)
package ć¤¼ćø±stringrć¤¼ćø² was built under R version 4.0.5
library(tidytext)
package ć¤¼ćø±tidytextć¤¼ćø² was built under R version 4.0.5
Attaching package: ć¤¼ćø±tidytextć¤¼ćø²
The following object is masked _by_ ć¤¼ćø±.GlobalEnvć¤¼ćø²:
sentiments
library(textdata)
package ć¤¼ćø±textdatać¤¼ćø² was built under R version 4.0.5
install.packages("scales")
Error in install.packages : Updating loaded packages
Transforming the books into tidy format
tidy_cummings1 <- cummings1 %>% mutate(linenumber = row_number()) %>% unnest_tokens(word, text) %>% anti_join(stop_words)
Joining, by = "word"
tidy_brackett1 <- brackett1 %>% mutate(linenumber = row_number()) %>% unnest_tokens(word, text) %>% anti_join(stop_words)
Joining, by = "word"
tidy_cummings2 <- cummings2 %>% mutate(linenumber = row_number()) %>% unnest_tokens(word, text) %>% anti_join(stop_words)
Joining, by = "word"
tidy_brackett2 <- brackett2 %>% mutate(linenumber = row_number()) %>% unnest_tokens(word, text) %>% anti_join(stop_words)
Joining, by = "word"
all_books <- rbind(tidy_cummings1, tidy_cummings2, tidy_brackett1, tidy_brackett2)
1: Sentiment Analysis with NRC for reference: sentiments are anger, anticipation, disgust, fear, joy, sadness, surprise, trust
nrc <- get_sentiments("nrc")
sentiments <- c("anger","anticipation", "disgust", "fear", "joy", "sadness", "surprise", "trust")
make a table with all lines, then count all sentiments per line
lines <- c(1:9801,1:2298,1:3157,1:3137)
books <- c(rep(19066, 9801),rep(61884, 2298),rep(32664, 3157),rep(64043, 3137))
sentiment_table <- data.frame(books, lines, anger = integer(18393), anticipation = integer(18393), disgust = integer(18393), fear = integer(18393), joy = integer(18393), sadness = integer(18393), surprise = integer(18393), trust = integer(18393))
add counts for sentiments for each line to sentiment_table
for (i in 1:length(sentiments))
{
new_table <- all_books %>% inner_join(nrc %>% filter(sentiment == sentiments[i])) %>% count(gutenberg_id, index = linenumber)
for (j in 1:nrow(new_table))
{
sentiment_table[sentiment_table$books == (new_table$gutenberg_id)[j] & sentiment_table$lines == (new_table$index)[j], sentiments[i]] <- (new_table$n)[j]
}
}
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
Joining, by = "word"
counting the number of words per sentiment in every 100 line chunk (this had to be done dirtily as we didnāt find an equivalent function to group the chunks)
sectors <- c(0:122, 0:28, 0:39, 0:39)
books <- c(rep(19066, 123), rep(61884, 29), rep(32664, 40), rep(64043, 40))
sentiment_summary <- data.frame(sectors, books, anger = integer(232), anticipation = integer(232), disgust = integer(232), fear = integer(232), joy = integer(232), sadness = integer(232), surprise = integer(232), trust = integer(232))
for(i in 1:length(sentiments)){
s <- sentiments[i]
col <- pull(sentiment_table, s)
for(j in 1:nrow(sentiment_table))
{
index <- (sentiment_table$lines[j]) %/% 100
sentiment_summary[sentiment_summary$books == sentiment_table$books[j] & sentiment_summary$sectors == index, sentiments[i]] <- sentiment_summary[sentiment_summary$books == sentiment_table$books[j] & sentiment_summary$sectors == index, sentiments[i]]+ col[j]
}
}
install.packages("scales")
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:
https://cran.rstudio.com/bin/windows/Rtools/
Installing package into ć¤¼ćø±C:/Users/juand/Documents/R/win-library/4.0ć¤¼ćø²
(as ć¤¼ćø±libć¤¼ćø² is unspecified)
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.0/scales_1.1.1.zip'
Content type 'application/zip' length 556840 bytes (543 KB)
downloaded 543 KB
package āscalesā successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\juand\AppData\Local\Temp\Rtmp630153\downloaded_packages
We turn the columns of data into a factor to tidy them and then we plot using a streamgraph: For some reason, the streamgraphs only admit date values. We convert the sectors to ādatesā:
devtools::install_github("hrbrmstr/streamgraph")
WARNING: Rtools is required to build R packages, but is not currently installed.
Please download and install Rtools 4.0 from https://cran.r-project.org/bin/windows/Rtools/.
Skipping install of 'streamgraph' from a github remote, the SHA1 (76f7173e) has not changed since last install.
Use `force = TRUE` to force installation
library(streamgraph)
sentiment_summary <- pivot_longer(sentiment_summary, anger:trust, "sentiment")
sentiment_summary$sectors <- as.Date(as.Date(ISOdate(sentiment_summary$sectors, 1, 1)))
sg1 <- sentiment_summary %>% filter(books==19066) %>% streamgraph("sentiment", "value", date="sectors", interpolate="step")
sg2 <- sentiment_summary %>% filter(books==61884) %>% streamgraph("sentiment", "value", date="sectors", interpolate="step")
sg3 <- sentiment_summary %>% filter(books==32664) %>% streamgraph("sentiment", "value", date="sectors", interpolate="step")
sg4 <- sentiment_summary %>% filter(books==64043) %>% streamgraph("sentiment", "value", date="sectors", interpolate="step")
sg1
streamgraph_html returned an object of class `list` instead of a `shiny.tag`.streamgraph_html returned an object of class `list` instead of a `shiny.tag`.
sg2
streamgraph_html returned an object of class `list` instead of a `shiny.tag`.streamgraph_html returned an object of class `list` instead of a `shiny.tag`.
sg3
streamgraph_html returned an object of class `list` instead of a `shiny.tag`.streamgraph_html returned an object of class `list` instead of a `shiny.tag`.
sg4
streamgraph_html returned an object of class `list` instead of a `shiny.tag`.streamgraph_html returned an object of class `list` instead of a `shiny.tag`.
2. Plot profile using afinn We group each book in groups of 100 lines, and then we use the lexicon to give every word a value. Then, we add them across the chunks to get a tentative sentiment score.
library(tidyverse)
package ć¤¼ćø±tidyverseć¤¼ćø² was built under R version 4.0.5Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
-- Attaching packages ---------------------------------------------------------------- tidyverse 1.3.1 --
v ggplot2 3.3.5 v readr 2.0.1
v tibble 3.1.4 v purrr 0.3.4
v tidyr 1.1.4 v forcats 0.5.1
package ć¤¼ćø±ggplot2ć¤¼ćø² was built under R version 4.0.5package ć¤¼ćø±tibbleć¤¼ćø² was built under R version 4.0.5package ć¤¼ćø±tidyrć¤¼ćø² was built under R version 4.0.5package ć¤¼ćø±readrć¤¼ćø² was built under R version 4.0.5package ć¤¼ćø±purrrć¤¼ćø² was built under R version 4.0.5package ć¤¼ćø±forcatsć¤¼ćø² was built under R version 4.0.5-- Conflicts ------------------------------------------------------------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
sentiments_bing <- all_books %>% group_by(gutenberg_id) %>% inner_join(get_sentiments("afinn")) %>%
count(gutenberg_id, index = linenumber %/% 100, value) %>% spread(value, n, fill = 0) %>% mutate(val = `-5`*-5 + `-4`*-4 + `-3`*-3 + `-2`*-2 + `-1`*-1+ `5`*5 + `4`*4 + `3`*3 + `2`*2 + `1`*1 )
Joining, by = "word"
We then plot them:
library(plotly)
package ć¤¼ćø±plotlyć¤¼ćø² was built under R version 4.0.5Registered S3 method overwritten by 'data.table':
method from
print.data.table
Registered S3 methods overwritten by 'htmltools':
method from
print.html tools:rstudio
print.shiny.tag tools:rstudio
print.shiny.tag.list tools:rstudio
Registered S3 method overwritten by 'htmlwidgets':
method from
print.htmlwidget tools:rstudio
Attaching package: ć¤¼ćø±plotlyć¤¼ćø²
The following object is masked from ć¤¼ćø±package:ggplot2ć¤¼ćø²:
last_plot
The following object is masked from ć¤¼ćø±package:statsć¤¼ćø²:
filter
The following object is masked from ć¤¼ćø±package:graphicsć¤¼ćø²:
layout
sentiments_bing$gutenberg_id <- sapply(sentiments_bing$gutenberg_id, toString)
sentiments_bing[sentiments_bing == 19066] <- "Brigands - Cummings"
sentiments_bing[sentiments_bing == 61884] <- "War Nymphs - Cummings"
sentiments_bing[sentiments_bing == 32664] <- "Black Amazon - Brackett"
sentiments_bing[sentiments_bing == 64043] <- "Enchantress - Brackett"
library(ggplot2)
p <- ggplot(sentiments_bing, aes(index, val, fill = gutenberg_id)) +
geom_col(show.legend = FALSE) +
facet_wrap(~gutenberg_id, ncol = 2, scales = "free_x") +
labs (x = "novel segment", y = "sentiment score")
ggplotly(p)
3. Principal Components Analysis
We are interested in seeing the differences in the vocabulary used between the two authors. We change their IDs.
all_books$gutenberg_id <- sapply(all_books$gutenberg_id, toString)
all_books[all_books == 19066] <- "Cummings"
all_books[all_books == 61884] <- "Cummings"
all_books[all_books == 32664] <- "Brackett"
all_books[all_books == 64043] <- "Brackett"
frequency <- all_books %>%
count(gutenberg_id, word) %>%
group_by(gutenberg_id) %>%
mutate(proportion = n / sum(n)) %>%
select(-n) %>%
spread(gutenberg_id, proportion)
# expect a warning about rows with missing values being removed
t <- ggplot(frequency, aes(x = `Brackett`, y = `Cummings`, color = abs(`Cummings` - `Brackett`))) +
geom_abline(color = "gray40", lty = 2) +
geom_jitter(alpha = 0.1, size = 2.5, width = 0.3, height = 0.3) +
geom_text(aes(label = word), check_overlap = TRUE, vjust = 1.5) +
scale_color_gradient(limits = c(0, 0.001), low = "darkslategray4", high = "gray75") +
scale_x_log10() +
scale_y_log10() +
theme(legend.position="none") +
labs(y = "Cummings", x = NULL)
ggplotly(t)