【337】Text Mining Using Twitter Streaming API and Python

Reference: An Introduction to Text Mining using Twitter Streaming API and Python

Reference: How to Register a Twitter App in 8 Easy Steps

Getting Data from Twitter Streaming API
Reading and Understanding the data
Mining the tweets

Key Methods:

Map()
Lambda()
Set()
Pandas.DataFrame()
matplotlib

1. Getting Data from Twitter Streaming API

twitter_streaming.py, this file is used to extract information from Twitter.

#Import the necessary methods from tweepy library
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream

#Variables that contains the user credentials to access Twitter API 
access_token = "ENTER YOUR ACCESS TOKEN"
access_token_secret = "ENTER YOUR ACCESS TOKEN SECRET"
consumer_key = "ENTER YOUR API KEY"
consumer_secret = "ENTER YOUR API SECRET"


#This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener):

    def on_data(self, data):
        print(data)
        return True

    def on_error(self, status):
        print(status)


if __name__ == '__main__':

    #This handles Twitter authetification and the connection to Twitter Streaming API
    l = StdOutListener()
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    stream = Stream(auth, l)

    #This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby'
    stream.filter(track=['python', 'javascript', 'ruby'])

You can use the following command to store information in the specific file. (By CMD)

python twitter_streaming.py > twitter_data.txt

Then we will get the information from the above text file and store them in JSON format.

import json
tweets_data_path = r"..	witter_data.txt"
tweets_data = []
tweets_file = open(tweets_data_path, "r")
for line in tweets_file:
	try:
		tweet = json.loads(line)
		tweets_data.append(tweet)
	except:
		continue

Data are stored in tweets_data, and we can get the specific information by the following scripts.

Reference: python JSON only get keys in first level

# get the text content, language from the specific tweets
num = 0
for tweet in tweets_data:
	num += 1
	if num == 10:
		break
	else:
		tweet_text = tweet["text"]
		tweet_lang = tweet["lang"]
		print(str(num))
		print(tweet_lang)
		print(tweet_text)
		print()

# get all the keys from json
tweets_data[0].keys()

2. Reading and Understanding the data

Now we can also get the specific key by list(), map() and lambda() with the following scripts.

Reference: Python中map与lambda的结合使用

>>> a = list(map(lambda tweet: tweet['text'], tweets_data))
>>> len(a)
1633
>>> a[0]
'RT @neet_se: 案件数って点だけならJavaがダントツ、つまり仕事に繋がりやすい。https://t.co/rqxp…'

Or we can also use set() method to get the unique values of the list.

Reference: Python set() 函数

Reference: Python统计列表中的重复项出现的次数的方法

>>> langs = list(map(lambda tweet: tweet['lang'], tweets_data))
>>> len(langs)
1633
>>> set(langs)
{'zh', 'de', 'es', 'et', 'th', 'cy', 'ru', 'in', 'lt', 'pt', 'tl', 'en', 'it', 'ja', 'ro', 'fa', 'pl', 'fr', 'ht', 'ar', 'tr', 'ca', 'cs', 'und', 'da'}

Next, we will structure the tweets data into a pandas DataFrame to simplify the data manipulation.

>>> import pandas as pd
>>> tweets = pd.DataFrame()
>>> tweets['text'] = list(map(lambda tweet: tweet['text'], tweets_data))
>>> tweets['lang'] = list(map(lambda tweet: tweet['lang'], tweets_data))
>>> tweets['country'] = list(map(lambda tweet: tweet['place']['country'] if tweet['place'] != None else None, tweets_data))
>>> tweets['lang'].value_counts()
en     1119
ja      278
es      113
pt       36
und      26
...

Next, we will use matplotlib to create a chart describing the Top 5 languages in which the tweets were written.

>>> tweets_by_lang = tweets['lang'].value_counts()

>>> import matplotlib.pyplot as plt
>>> fig, ax = plt.subplots()
>>> ax.tick_params(axis='x', labelsize=15)
>>> ax.tick_params(axis='y', labelsize=10)
>>> ax.set_xlabel('Languages', fontsize=15)
Text(0.5, 0, 'Languages')
>>> ax.set_ylabel('Number of tweets' , fontsize=15)
Text(0, 0.5, 'Number of tweets')
>>> ax.set_title('Top 5 languages', fontsize=15, fontweight='bold')
Text(0.5, 1.0, 'Top 5 languages')
>>> tweets_by_lang[:5].plot(ax=ax, kind='bar', color='red')
<matplotlib.axes._subplots.AxesSubplot object at 0x00000189B635D630>
>>> plt.show()

Next, we will create a chart describing the Top 5 countries from which the tweets were sent.

>>> tweets_by_country = tweets['country'].value_counts()

>>> fig, ax = plt.subplots()
>>> ax.tick_params(axis='x', labelsize=15)
>>> ax.tick_params(axis='y', labelsize=10)
>>> ax.set_xlabel('Countries', fontsize=15)
Text(0.5, 0, 'Countries')
>>> ax.set_ylabel('Number of tweets' , fontsize=15)
Text(0, 0.5, 'Number of tweets')
>>> ax.set_title('Top 5 countries', fontsize=15, fontweight='bold')
Text(0.5, 1.0, 'Top 5 countries')
>>> tweets_by_country[:5].plot(ax=ax, kind='bar', color='blue')
<matplotlib.axes._subplots.AxesSubplot object at 0x00000189BA6038D0>
>>> plt.show()

3. Mining the tweets

Out main goals in these text mining tasks are: compare the popularity of Python, Ruby and Javascript programming languages and to retrieve programming tutorial links. We will do this in 3 steps:

We will add tags to our tweets DataFrame in order to be able to manipulate the data easily.
Target tweets that have "programming" or tutorial" keywords.
Extract links from the relevant tweets.

Adding Python, Ruby, and Javascript tags

First, we will create a function that checks if a specific keyword is present in a text. We will do this by using regular expression (正则表达式).

Python provides a library for regular expression called re. We will start by importing this library.

Next, we will create a function called word_in_text(word, text). This function return True if a word is found in text, otherwise it returns False.

>>> import re
>>> def word_in_text(word, text):
	word = word.lower()
	text = text.lower()
	match = re.search(word, text)
	if match:
		return True
	return False

Next, we will add 3 columns to our tweets DataFrame by pandas.DataFrame.apply().

>>> tweets['python'] = tweets['text'].apply(lambda tweet: word_in_text('python', tweet))
>>> tweets['ruby'] = tweets['text'].apply(lambda tweet: word_in_text('ruby', tweet))
>>> tweets['javascript'] = tweets['text'].apply(lambda tweet: word_in_text('javascript', tweet))

We can calculate the number of tweets for each programming language by pandas.Series.value_counts as follows:

>>> print(tweets['python'].value_counts()[True])	       
447
>>> print(tweets['ruby'].value_counts()[True])	       
529
>>> print(tweets['javascript'].value_counts()[True])	       
275

We can make a simple comparison chart by executing the following:

>>> prg_langs = ['python', 'ruby', 'javascript']  
>>> tweets_by_prg_lang = [tweets['python'].value_counts()[True], tweets['ruby'].value_counts()[True], tweets['javascript'].value_counts()[True]]     
>>> x_pos = list(range(len(prg_langs)))
>>> width = 0.8       
>>> fig, ax = plt.subplots()  
>>> plt.bar(x_pos, tweets_by_prg_lang, width, alpha=1, color='g')	       
<BarContainer object of 3 artists>
>>> # Setting axis labels and ticks       
>>> ax.set_ylabel('Number of tweets', fontsize=15)       
Text(0, 0.5, 'Number of tweets')
>>> ax.set_title('Ranking: python vs. javascript vs. ruby (Raw data)', fontsize=10, fontweight='bold')       
Text(0.5, 1.0, 'Ranking: python vs. javascript vs. ruby (Raw data)')
>>> ax.set_xticks([p + 0.4 * width for p in x_pos])      
[<matplotlib.axis.XTick object at 0x00000189BA5D1F28>, <matplotlib.axis.XTick object at 0x00000189BA603D30>, <matplotlib.axis.XTick object at 0x00000189BA5D15F8>]
>>> ax.set_xticklabels(prg_langs)       
[Text(0, 0, 'python'), Text(0, 0, 'ruby'), Text(0, 0, 'javascript')]
>>> plt.grid()       
>>> plt.show()

This shows, that the keyword ruby is the most popular, followed by python then javascript. However, the tweets DataFrame contains information about all tweets that contains one of the 3 keywords and doesn't restrict the information to the programming languages. For example, there are a lot of tweets that contains the keyword ruby and that are related to a political scandal Rubygate. In the next section, we will filter the tweets and re-run the analysis to make a more accurate comparison.

Targeting relevant tweets

We are interested in targeting tweets that are related to programming languages. Such tweets often have one of the 2 keywords: "programming" or "tutorial". We will create 2 additional columns to our tweets DataFrame where we will add this information.

>>> tweets['programming'] = tweets['text'].apply(lambda tweet: word_in_text('programming', tweet))
>>> tweets['tutorial'] = tweets['text'].apply(lambda tweet: word_in_text('tutorial', tweet))

We will add an additional column called relevant that take value True if the tweet has either "programming" or "tutorial" keyword, otherwise it takes value False.

>>> tweets['relevant'] = tweets['text'].apply(lambda tweet: word_in_text('programming', tweet) or word_in_text('tutorial', tweet))

We can print the counts of relevant tweet by executing the commands below.

>>> print(tweets['programming'].value_counts()[True])       
55
>>> print(tweets['tutorial'].value_counts()[True])       
22
>>> print(tweets['relevant'].value_counts()[True])  
74

We can compare now the popularity of the programming languages by executing the commands below.

tweets[tweets['relevant'] == True]['python'] # 将 relevant 为 True 的索引对应 Python 组成一个新的列

>>> print(tweets[tweets['relevant'] == True]['python'].value_counts()[True])       
31
>>> print(tweets[tweets['relevant'] == True]['ruby'].value_counts()[True])
8
>>> print(tweets[tweets['relevant'] == True]['javascript'].value_counts()[True])   
11

Python is the most popular with a count of 31, followed by javascript by a count of 11, and ruby by a count of 185. We can make a comparison

>>> tweets_by_prg_lang = [tweets[tweets['relevant'] == True]['python'].value_counts()[True],
			  tweets[tweets['relevant'] == True]['ruby'].value_counts()[True],
			  tweets[tweets['relevant'] == True]['javascript'].value_counts()[True]] 
>>> x_pos = list(range(len(prg_langs)))
>>> width = 0.8
>>> fig, ax = plt.subplots()
>>> plt.bar(x_pos, tweets_by_prg_lang, width,alpha=1,color='g')
<BarContainer object of 3 artists>
>>> ax.set_ylabel('Number of tweets', fontsize=15)
Text(0, 0.5, 'Number of tweets')
>>> ax.set_title('Ranking: python vs. javascript vs. ruby (Relevant data)', fontsize=10, fontweight='bold')
Text(0.5, 1.0, 'Ranking: python vs. javascript vs. ruby (Relevant data)')
>>> ax.set_xticks([p + 0.4 * width for p in x_pos])
[<matplotlib.axis.XTick object at 0x00000189B6E9E128>, <matplotlib.axis.XTick object at 0x00000189B430F9E8>, <matplotlib.axis.XTick object at 0x00000189B430F5C0>]
>>> ax.set_xticklabels(prg_langs) 
[Text(0, 0, 'python'), Text(0, 0, 'ruby'), Text(0, 0, 'javascript')]
>>> plt.grid()
>>> plt.show()

Extracting links from the relevants tweets

Now that we extracted the relevant tweets, we want to retrieve links to programming tutorials. We will start by creating a function that uses regular expressions for retrieving link that start with "http://" or "https:" from a text. This function will return the url if found, otherwise it returns an empty string.

>>> def extract_link(text):
	regex = r'https?://[^s<>"]+|www.[^s<>"]+'
	match = re.search(regex, text)
	if match:
		return match.group()
	return ''

Next, we will add a column called link to our tweets DataFrame. This column will contain the urls information.

>>> tweets['link'] = tweets['text'].apply(lambda tweet: extract_link(tweet))

Next, we will create a new DataFrame called tweets_relevant_with_link. This DataFrame is a subset of tweets DataFrame and contains all relevant tweets that have a link.

将原有 DataFrame 进行截取。

>>> tweets_relevant = tweets[tweets['relevant'] == True]       
>>> tweets_relevant_with_link = tweets_relevant[tweets_relevant['link'] != '']

We can now print out all links for python, ruby, and javascript by executing the commands below:

>>> print(tweets_relevant_with_link[tweets_relevant_with_link['python'] == True]['link'])       
40      https://t.co/zoAgyQuMAZ
105     https://t.co/ogaPbuIbEW
274     https://t.co/y4sUmovFOn
329     https://t.co/A030fqWeWA
339     https://t.co/LaaVc5T2rQ
391     https://t.co/8bYvlziCZb
413     https://t.co/8bYvlziCZb
436     https://t.co/EByqxT1qyN
444     https://t.co/8bYvlziCZb
445     https://t.co/5Jujg6h31B
462     https://t.co/UrFHlOaJYf
476     https://t.co/5Jujg6h31B
477     https://t.co/EByqxT1qyN
589     https://t.co/UrFHlOaJYf
603     https://t.co/5Jujg6h31B
822     https://t.co/Oc21FrzQc5
1060    https://t.co/qOAIuKfyD0
1097    https://t.co/qOAIuKfyD0
1248    https://t.co/V3ZNKuYsK7
1278    https://t.co/qOAIuKfyD0
1411    https://t.co/szHRHavQKy
1594    https://t.co/X6KWMlzlv6
Name: link, dtype: object
>>> print(tweets_relevant_with_link[tweets_relevant_with_link['ruby'] == True]['link'])	       
782     https://t.co/JgY40r2NSo
833     https://t.co/JgY40r2NSo
1177    https://t.co/xycOG3ndi9
1254    https://t.co/xycOG3ndi9
1293    https://t.co/LMHW050TGs
1328    https://t.co/SS4DzEnSBZ
1393    https://t.co/NZlUce5Ne8
1619    https://t.co/e4nwrn3N2j
Name: link, dtype: object
>>> print(tweets_relevant_with_link[tweets_relevant_with_link['javascript'] == True]['link'])     
130     https://t.co/AbJFaSI0B8
286     https://t.co/7dNBIsQ5Gq
467     https://t.co/3YIK588j8t
471     https://t.co/vjBJWWzvfv
830     https://t.co/T4mUjwUcgL
1093    https://t.co/wvLZLjuVKF
1180    https://t.co/luxL2qbxte
1526    https://t.co/G3ZTFL0RKv
Name: link, dtype: object