After hearing one of Scott Morrison's recent speeches I began to wonder if transcripts of all his speeches and interviews were publicly available. It turns out they are! His speeches and interviews are part of the public record and available on his official website. Little did I realise how fun this task would be.
This article is broken into 3 separate parts.
- Building the prime ministerial dataset
- Sentiment analysis
- Identifying words that are most important in each record
Part 1: Building the Dataset Using the RVEST Package in R
Here is the code I wrote to scrape and organise the dataset. You can copy and paste this code directly into RStudio and try it out for yourself! First we need to load some essential libraries and setup a tibble to iterate over the pages. Note that it was just trial and error to find the maximum page number of 108. This will change over time.
Now we need to create a function for getting the links to the articles to scrape. The tags were identified using the Google Extension SelectorGadget. We take the opportunity with this function to save some of the meta data such as date which will be used later for a time series analysis.
Once we have loaded the function into the global environment we can test that it works on one page and make any desired changes to the get_link_df function before moving forward.
Once we are happy with the data returned by the function we can map the function to the media_pages tibble we prepared earlier. This will take a bit of time because you are scraping and collecting data from 108 different webpages. Purrr allows you to iterate without for loops and returns the data in a tidy list column format. This is so much simpler to manage than for loops don't you think?
Next we use the unnest function to create a dataframe of links and meta data to save as a csv.
So far we have a data frame with the links to each record along with some meta data. Now we need to create a function to extract the actual text from that record. In this function we are using rvest to extract the html and then some dplyr to annotate the text with the speaker's name using the separate function. It's not perfect because it will separate the text wherever there is a colon ":" but does the job. We can clean up some of the text that is obviously not speaker names later.
Now it's time to scrape all the text from each article. Again we are using the power of the PURRR package to iterate over all the links and extract the content. Finally we save the full content as a csv file.
Part 2: Turning Words Into Numbers
Measuring the changes in the PM's sentiment over the months since the Covid outbreak was one of the first things I wanted to explore. Many words that we use can have either a positive or negative sentiment. Take the word "good" for example. By itself this word would generally be positive until it is preceded by a word such as "not". Before getting into the nitty gritty of word embeddings it is a simple process to plot the raw sentiment using the get_sentiments() function in the tidytext package. Words are classified as either positive or negative using a sentiment lexicon from Bing Liu and collaborators and sentiment is estimated by subtracting the number of negative words form the number of positive words.
Plotting the sentiment over time shows how Prime Minister Scott Morrison's language scored a very negative sentiment during the early months of the Covid outbreak in 2020 before trending back up. Now with difficulties with the vaccine rollout and lockdowns in Melbourne and Sydney the PM's sentiment has been trending down toward July 2021 as can be seen by the blue line. Maybe winning the bid to host the Olympics in Brisbane in 2032 will help him reverse this trend.
Now let's dig into how to pick out words that uniquely characterised Scott Morrison's public speeches and interviews since 2020.
Part 3: Finding Words That Mean the Most (TF-IDF)
After web scraping the full text using the #rvest package and understanding trends in sentiment it was time to understand which words mattered most to the PM each month. The most uniquely important words for each month were found using Term Frequency - Inverse Document Frequency (TF-IDF). This technique allowed me to remove very common words that were present every month such as "People" and "Australians"using TF-IDF so that we can focus on specific words most important to the PM's dialogue in each month.
The plots below show the top 10 most important words used in Scott Morrison's speeches and interviews from 2020 to now. For example, "bushfires", "disaster" and "assistance" were most important in Jan 2020 and in May 2020 the PM was focused on the "covidsafe app". Fast forward to July 2021 and it is "lockdown", "vaccination" and "doses" that dominate the PM's dialogue.
- In this article, we have demonstrated how to scrape a large text dataset using RVEST and leveraging PURRR to collect data from multiple page links.
- Once the data were collected and tidied we then analysed this text data using the TIDYTEXT package.
- We measured sentiment over time by grouping the words by month since the Covid pandemic began in 2020 and ...
- Estimated word importance for each record using TF-IDF.
Now it's your turn. What dataset would you like to create and analyse? There are many other methods to try. What are your favourites?