A DIY Twitter Tracking Tool Using the Elastic Stack
Preface
Knowing that Australians are coming out of a second wave of the COVID-19 pandemic while the rest of the world goes into a new wave is a mixed feeling. Despite disparity of hopes and emotional experiences, a common lesson of the new-normal is to keep the learning mentality up and going. The lockdown has inspired a surge in Do-It-Yourself (DIY) projects across the world. Many individuals have been creative in lockdown, turning trash and unused items into wonderful crafts or boosting their skillset. If you’re still looking for your next exciting DIY challenge, in this tutorial, I take you through the steps to deploy a Twitter tracking tool using Elastic Logstash Kibana (ELK) stack.
There are numerous applications for analysis of social media data, from disaster awareness and monitoring national mood to discovering a new fungi. The use case I have chosen for this tutorial is the latest Twitter outage that happened on the morning of October 16 (US evening October 15). It was controversial since the outage came one day after Twitter and Facebook took an unprecedented action in restricting the spread of information in the lead-up to the 2020 US election. Although Twitter was quickly brought back online and later backtracked its decision, the hashtag #twitterdown was trending. I started collecting real-time tweets to dig into collective stances on the massive outrage.
This tutorial will cover how to understand and extract insights from tweets and covers,
- Installing the data capture and visualisation tool – ELK stack
- Steps for streaming Twitter data with Logstash Twitter plugin into the data store
- Indexing/Re-indexing Elasticsearch dataset & Visualisation
- Named Entity Recognition with Open-NLP Ingest Processor
The ELK solution
ELK encompasses three main components: Elasticsearch is a search engine built on top of Apache Lucene, a powerful and probably the most popular full-text search engine for data used globally. This search engine is used to store, search, and analyse data. Logstash is a data ingestion pipeline for Elasticsearch and can dynamically transfer and ship data regardless of the format. It comes with a Twitter input plugin and that is the reason we are using it for our use case. Lastly, Kibana is the graphical user interface to visualise data.
The aim of this tutorial is to give you high-level insight rather than comprehensive training in the ELK stack. But get acquainted with Elastic Lingo for a better experience.
Prerequisites – A Twitter Account
To use the Twitter streaming API, you need to have an existing Twitter account, then apply for developer access to create a client application and obtain your consumer API keys for authentication.
Get the ELK stack up and running
A great benefit of Elastic stack is that it ships with good defaults and requires very little configuration, particularly if it is installed via a package manager such as Homebrew. To install, first tap into the Elastic Homebrew repository and then use brew install as follows:
% brew tap elastic/tap % brew install elastic/tap/elasticsearch-full % brew install logstash % brew install elastic/tap/kibana-full
To have launchd start Elasticsearch and Kibana, just type their names in a command line. Note that for this tutorial, ELK stack runs on macOS Catalina set up with Elasticsearch and Logstash version 7.9.2 along with Kibana version 7.8.1.
Configure Twitter input
Before running Logstash, you need a configuration file otherwise it complains with a message saying “Pipelines YAML file is empty”. Below I prepared a configuration file–twitter_pipeline.conf— to collect tweets on 16 Oct 2020 AEDT (UTC+11h) a few hours after the Twitter outage. This filters tweets that are in English with the keyword twitterdown. Due to rate limiting with Twitter API requests, I excluded retweets. This can also help in avoiding the logical fallacy of ad populum–attributed to Twitter’s trending function–and having access to more diverse conversation. You also have the option to filter tweets from certain regions (e.g. Australia) which requires passing bounding box coordinates (line 11).
Stream Tweets into ELK data store
To start streaming tweets, run:
% logstash -f twitter_pipeline.conf
At this point, you should see the tweets are written to stdout or alternatively open a browser and inspect real-time tweets from http://localhost:9200/tweets/_search/?pretty (host specified in line 17 of the configuration file). Below is the snapshot of my stdout. Remember, we have Kibana for searching and visualising Elastic data. Let’s launch it now.
Index
Prior to visualisation (or to do anything useful in a human timescale), you need to have at least one data index, and one index pattern. Indices, the largest unit of data in Elasticsearch, are logical partitions grouping similar data together (corresponds to a database in relational databases). If you look back at line 18 of twitter_pipeline.conf, I stored Twitter documents in the tweets index. To verify this, visit Index Management console at http://localhost:5601/app/management/data/index_management/indices. As shown below, tweets index exists and hit 32,355 documents that I collected over a 12-hour period.
An Index pattern is a glue that connects Kibana to Elasticsearch data and because of its regular expression syntax, it can match the name of a single index, or include a wildcard (*) to match multiple indices. For creating an index pattern, visit Stack Management > Kibana > Index Patterns and click on create index pattern:
- Fill in index pattern name with tweets
- Fill in the Time field with @timestamp
Visualise
Next, head to Kibana Discover tab and change the index pattern to tweets. Make sure the time picker is set properly.
Done ✅ The Twitter tracking tool is firing. You can now go to the Visualise panel and create as many plots as you want and present them all together as a single dashboard. For creating plots with Kibana, keep in mind that there are two main types of aggregation, namely metric and bucket. Metric aggregations are used to compute numerical values whereas bucket aggregations are used to group data that share a common criterion (similar to rows/columns in a pivot table). Here is my dashboard:
Re-index
If you have followed all the steps up to here, chances are you had difficulties in plotting the Map graph and Tagcloud of Named Entities (the two bottom plots in Fig.5). For drawing those plots, the Twitter dataset requires additional fields. But can you change the mapping of already indexed data?
Similar to a schema in relational databases, Elastic mapping defines different types that reside within an index. Except for supported mapping parameters, changing an existing field could invalidate data that has already been indexed. If you need to change the mapping of a field, you need to create a new index with the correct mapping and reindex your data accordingly. The reasons for reindexing vary–from data type changes, analysis changes, to the introduction of new fields that need to be populated. I performed reindexing for two main reasons:
- Kibana Maps required geolocation data in geo-point format, but this data type is not natively supported with the Logstash Twitter plugin.
- Enriching Twitter data objects with some Natural Language Processing (NLP) features such as Named Entities, required adding new fields.
Although there has been significant efforts to enhance Elastic stack for data analytics use cases (such as regression and classification), its reliance on a complex REST API still does not provide an ideal architecture for a NLP ecosystem, in particular for easy integration with state-of-the-art libraries such as sparkNLP or spaCy. While this does not stop you from developing your out-of-the-box text analysis scripts, for a quick demonstration, I decided to use OpenNLP ingest processor plugin that features Named Entity Recognition (NER). Follow the OpenNLP installation instructions to get started.
Text analysis with OpenNLP plugin
The default NER model with OpenNLP plugin identifies entities of type Person, Date and Location. Assuming, you have already installed the Open-NLP plugin, hit the Elastic console and execute the 3 following steps (Fig.6):
In a nutshell, in Step 1, a new empty index, called tweets-reindexed, with the correct mapping of geolocation data is created. In Step 2, the OpenNLP pipeline is configured. Step 3 calls reindex API to populate data from the previous index. The reindexing process often takes a while for large datasets. So to avoid a timeout error while reindexing, you may need to set wait_for_completion parameter to false. After a few minutes, go to the Index Management panel and verify whether the same number of Twitter documents hit with the new index.
To verify NER values generated by OpenNLP pipeline (i.e. entities.person, entities.location and entities.dates), you can create a new index pattern and dive into data from the Kibana Discover tab:
Or if you feel fancy, you can experiment with the Elastic graph analytics features. For my use case, I was curious whether there is any relationship between named persons and locations. To draw a graph, launch Kibana’s Graph Plugin at http://localhost:5601/app/graph, and simply select your index and the fields in the data as vertices:
As depicted in Fig.7, the graph exploration speaks for itself. In summary, United States users were incensed with Twitter shutdown, accusing the platform of censorship in favor of Joe Biden to stop the spread of the email scandal about his son, Hunter Biden. In the meantime, Twitter users of Nigeria hold a positive stance continuing to reference Jack Dorsey (Twitter CEO) for his support of ENDsSARs and the special #EndSARS Emoji
In our local context, user responses were mixed with the majority bemused that they couldn’t post anything and continued reposting their messages with the hashtag #twitterdown. Australians were also discontent but not solely for the platform interfering with the US election. Among the top named entities in Fig. 7, Australian Football athletes were at the centre of #twitterdown footy fan conversation which was heating up with the AFL Grand Final a week ahead.
This tutorial only touched the surface of the possibilities Elastic stack can offer for data analytics. If you are already comfortable with Python for data analysis, there is a brand new package called Eland that abstracts away a lot of Elasticsearch’s specific syntax with a powerful and familiar pandas-compatible API. For exploring its main features checkout Eland’s documentation.
Header image courtesy of Unsplash