Twitter Scraping Using R

“Big Reputation, Big Reputation, Ooh you and me we’ve got Big Reputations”

Taylor Swift has just released her 6th studio album ‘Reputation’. The old Taylor is dead, and is her place is a new edgier Taylor, toughened from the years of media scrutiny, turbulent relationships and high profile celebrity feuds. As the title suggests, this is an album all about the contrast in how the world sees you to compared to who you really are and how a negative portrayal can affect your relationships. Whether you like the album or not (personally I love it), this post is not really about Taylor swift. This is about my first experience delving into the world of twitter scraping.

Twitter was created in 2006 and since then has rapidly grown into a world wide social network with around 330 million active users in 2017. Twitter along with other social media giants such as Facebook, youtube, linkedIn, Instragram etc., have revolutionised the way we share news and interact with both friends and celebrities. Every day we leave behind a trail of our virtual movements including comments, likes, networks of friends and even just pages we view. With so much data being produced every day, companies are eager to collect this data and use it to provide customer insight and improve products and services. For example, a company can search for tweets mentioning their name to see how they are being perceived by the public or they can analyse their followers to look for patterns and expand their presence online.  Accessing data in this way is often called social media mining.

However social media mining is not just for big corporations or academics. With open software such as R or python, we can all access social media data and use it in our own projects.  Each social media has a slightly different way of accessing data so for this case study I am focusing on one of the most straight forward ways using twitter.  Twitter is a great place to start as unlike other platforms, almost all user’s tweets are public and accessible. You can pick any topic to scrap data on, and as an example I decided to focus on a celebrity with a huge twitter presence - Taylor Swift. Since her whole album focusses on her supposedly bad reputation, I wondered whether there was evidence to back this up and twitter provides an ocean of data to work with.

In order to extract data from twitter, you can use their API. This stands for application program interface and in simple terms it is a software intermediary that allows two applications to talk to each other. In our case, the API allows us to access twitter data and bring it directly into our program in the format we want. There are packages available in both R (‘twitteR’) and python (‘tweepy’) that make accessing the API and analysing specific tweets straightforward. Before you start you need to sign up for your own twitter access keys from the apps.twitter webpage. This is free but you need your own twitter account and a little bit of time to go through the steps. Extracting tweets can then be done in a few lines of code. All you need is a key word (or more than 1 if you like), the number of tweets you want to pull and the type of tweets (most recent or most popular for example).

consumer_key <- 'xxxx'
consumer_secret <- 'xxxxxx'
access_token <- 'xxxxx'
access_secret <- 'xxxxx'

setup_twitter_oauth(consumer_key, consumer_secret, access_token, access_secret)
TS <-searchTwitter("@taylorswift13", n=10, lang="en", resultType = "mixed")
TS_text <- sapply(TS, function(x) x$getText())

After extracting tweets, the next step is to perform some simple text cleaning. In R this can be done using the ‘tm’ package. Example steps might include removing numbers, punctuation, emoticons and blank space; changing all letters to lowercase and removing non-interesting words for example the key words you searched with and small words such as ‘and’ or ‘of’ known as stopwords.

TS_corpus <- Corpus(VectorSource(TS_clean2))
inspect(TS_corpus[1])
TS_clean <- tm_map(TS_corpus, function(x) iconv(x, to='UTF-8-MAC', sub='byte'))
TS_clean <- tm_map(TS_clean, removePunctuation)
TS_clean <- tm_map(TS_clean, content_transformer(tolower))
TS_clean <- tm_map(TS_clean, removeWords, stopwords("english"))
TS_clean <- tm_map(TS_clean, removeNumbers)
TS_clean <- tm_map(TS_clean, stripWhitespace)
TS_clean <- tm_map(TS_clean, removeWords, c("taylor swift"))
TS_clean <- tm_map(TS_clean, removeWords, c("taylorswift"))
TS_clean <- tm_map(TS_clean, removeWords, c("@taylorswift13"))
TS_clean <- tm_map(TS_clean, removeWords, c("taylor"))
TS_clean <- tm_map(TS_clean, removeWords, c("swifts"))
TS_clean <- tm_map(TS_clean, removeWords, c("taylorswifts"))
From this point it’s up to you what you want to do with all your, now cleaned, twitter data. For my project, I started by having a look at the most common words using a wordcloud. This creates an attractive looking visualisation of the most common words. The wordcloud created from 10,000 tweets about Taylor swift is shown below. Most common words include the name of her new album ‘Reputation’, words relating to her music and sales: ‘sales’, ‘billboard’, ‘chartdata’ and other miscellaneous words ‘week’ ,’total’ ‘pure’. This gives some insight into what sorts of things people have been saying about Taylor but it does not reveal that much about her reputation.

Therefore to answer my original question of whether Taylor Swift really has a big reputation I need to do some more complex analysis. What precisely are people saying about Taylor? Are the comments nasty or nice? Do they talk about her love life and celebrity feuds as much as her music would have us believe? With just a few tweets, you could easily go through each individually and categorise them, deciding whether the author was slating her or raving about her. However with a list of thousands of tweets this becomes more difficult.

One solution for this is to use text sentiment analysis. The idea is that you split a sentence into words and look to see how many positive and negative words are in the sentence. A positive word is given + 1 and a negative word is given -1. The sum of all the words gives an overall score which indicates whether the tweet is overall positive or negative. You can then score all the tweets and calculate summary statistics e.g. mean, median, standard deviation. For my Taylor Swift tweets the mean score was 0.26 with a standard deviation of 0.68. The distribution of scores is shown in the bar-plot below.
So what does this tell us? Well in general the tweets are positive. More people are saying nice things about her than negative - things aren’t as bad as you think Taylor! Of course that is just the average and we can see from the graph there are still negative tweets out there. In addition reputation tends to be relative. Perhaps if everyone else has a much higher mean score and less negative tweets then we might understand why Taylor feels hard done by. I re-ran the same analysis looking at some other celebrities. The table below shows the summary statistics for the following celebs: Best pal Selena Gomez, ex boyfriend Calvin Harris, rivals Katy Perry and Kim Kardashian, queen of pop Beyonce and finally US president with a definite big reputation, Donald Trump.

In comparison to these celebrities Taylor Swift has the third highest mean score so comparatively her tweets are more positive than most! Looking at the standard deviation, Taylor has the fourth highest. This represents how much variation there is between the scores of the tweets and therefore those with a higher standard deviation are likely to divide opinion more. Perhaps then, those celebrities with a larger standard deviation have the bigger reputation. (This is an overly simplified way to decide on a celebrities reputation of course.) Using this method, we can infer that Donald Trump and Katy Perry actually have the biggest reputation, although Taylor is not that far behind.

This process of text sentiment analysis gives us some good indication of positivity or negativity. However what about more specific topics. What are people saying about her love life, her music, her friends? Although more difficult, it is possible to select tweets with words relating to these topics in a similar way. The bar-plot below shows the number of tweets about each topic - note that a tweet might fall into more than 1 category.

So unsurprisingly the biggest talking point is her music, followed by tweets about her friends/enemies. Relatively few people have been talking about her love life! Perhaps since finding love with her currently boyfriend Joe Alwyn there isn’t as much to talk about as there used to be.

Conclusion

So to answer my original question of does Taylor swift have a big reputation? The evidence collected suggests that her reputation is similar to that of other celebrities tested, although perhaps not as much a Donald trump or Katy Perry. In general her reputation is positive and the majority of tweets are about her music, as oppose to gossip about her love life or enemies. Therefore since the new album has come out it looks like her reputation has made a recovery and now, she’s doing Ok.

Of course there is plenty more I could have done with my twitter data. My initial analysis only uses a small sample of 10,000 tweets taken one day in early January. It is likely that the outcome of the analysis would change depending on when I extracted the data and how many tweets I looked at. An extension to this research might look at how variable tweets about Taylor are from one day to the next  and how her ‘reputation’ changes over time.  It may also look at more complex text analysis techniques, for example looking for patterns in the words and forming clusters of the words that often appear together (cluster analysis).

As for twitter scraping, accessing tweets and running your own analysis is relatively simple and  accessible to anyone with a twitter account and R/python. There is much that can be found out by looking at tweets (aside from analysing celebrities) that could be of benefit to many organisations, e.g. monitoring public opinion over time, customer feedback or identifying new customers. It is a fascinating data source and I would highly recommend investigating it for yourself!