When I began immersing myself in learning Data Science, I came across Natural Language Processing (NLP). I immediately became fascinated with how we can use a few relatively simple techniques to analyze the most accessible and ubiquitously available Data there is: The words we write and speak. In this blog post, I will demonstrate how to use Python to retrieve some written text from the internet and do a simple analysis of the contents and polarity of the sentiments. In other words: Given a topic, what are people saying about it, and is it positive or negative?
These days, if you want to get a quick picture of people’s opinion on a topic, there’s hardly a better source of data than Twitter. Thanks to Twitter’s application programming interface (API), it only takes a few lines of code to either follow the conversation in real time, and/or retrieve Tweets from a specific time in the past.
To demonstrate, I picked a random, somewhat controversial topic, and looked at Tweets about synthetic diamonds, also called artificial, fake, lab-grown, man-made, or manufactured diamonds. Over the course of 30 days in May and June of 2021, the polarity of tweets on this topic developed like this:
How did I get to this graph?
The full code and step-by-step instructions are available in this github repository. Here’s the highlights: After I set up my Twitter developer account and created a dev environment called “diamonds30”, I used the tweepy Python library to download Tweets containing the search term “synthetic diamonds”:
tweets = api.search_30_day(environment_name = 'diamonds30', query='synthetic+diamonds', fromDate=202105250000, toDate=202106250000)
One randomly selected example of such a Tweet:
The Tweets I downloaded are in json format, which is like a little suitcase containing various kinds of additional information (Tweet text, screen name, time of tweeting, hashtags used, the full user profile, etc.). Here, I am mainly interested in the Tweet’s text and time of creation.
I calculated the polarity of this statement by using another Python library called textblob. Polarity is a measure of how positive a text is, ranging from -1 (most negative) to +1 (most positive).
from textblob import TextBlob In : TextBlob(tweet).sentiment.polarity Out: 0.175
So this Tweet got a neutral-to-slightly-positive rating of 0.175. Do this for all Tweets found across the 30-day window, and plot it against time, and you’ll get the figure at the beginning of the post!
Content of the Tweets
Before we can process the data, it needs to be cleaned up. In our case, this means converting the text into the appropriate format and removing all elements that would disturb the analysis:
- @mentions of other usernames, used in retweets and replies
- stop words
Stop words are all of those filler words (e.g. articles, pronouns, prepositions) that make up the majority of natural language, but transport least of the meaning. Examples for stop words in the English language include “a”, “for”, “the”, or “because”. The cleaned up text of our example Tweet then reads “synthetic diamonds cheaper easier produce sustainable ethical mined consumers prefer traditional gemstones”.
So what are people talking about, who mention “synthetic diamonds”? These are the most common words between May and June 2021:
The search term matters
I noticed that great care must be used in selecting the right search term, as it strongly affects the outcome of the analysis. Although various terms may, objectively, very much refer to the same item, people will use different terms to talk about different aspects of the item. Particularly for topics of commercial or political interest, the wording used by an advertisement or a press item has substantial influence on the wording used in the public conversation. In the figure below, note the increased mentions of “lab-grown diamonds” (green line) after several news outlets reported that Pandora, the world’s biggest jeweller, would soon sell lab-grown diamonds. Similarly, conversation increased about “fake diamonds” (red line) in the aftermath of a gossip “news” story involving the phrase.
The search term also strongly influences the polarity of sentiments, as seen in the two figures below. They show the same data, but in different types of visualization.
The figures illustrate that the terms “lab-grown” and “synthetic” are used in a much more positive manner than “fake” or “artificial”.
To learn more about how you can reproduce the code, please take a look at the github repository.