Tweets Analysis with Python and NLP
Introduction
You should be already familiar with the concepts of NLP from our previous post, so today we'll see more useful case of analysis the tweets and classifying them into marketing and non-marketing tweets. We won't get into details of tweets retrieval, this can be done with various packages with Tweepy being the most popular one.
Baseline
For the purpose of the discussion we already have 2 sets of tweets separated into files and are uploaded into GitHub folder. First we download the datasets, add target column as 1 for marketing tweets and unite the datasets. Then we'll check the baseline classification results, without any pre-processing. We do this so later we could understand whether our changes improve the metrics. We'll be using Random Forest for classification, since it doesn't expect linear features or even features that interact linearly and it can handle very well high dimensional spaces as well as large number of training examples. Plus it doesn't require a lot of configuration. Have a look at Random Forest and classifier boosting articles for more details.
# -*- coding: utf-8 -*- import re import numpy as np import pandas as pd from nltk.tokenize import WordPunctTokenizer from sklearn.cross_validation import cross_val_predict from sklearn.ensemble import RandomForestClassifier from sklearn.feature_extraction.text import CountVectorizer from sklearn.metrics import f1_score, accuracy_score, roc_auc_score prefix = 'https://github.com/aie0/data/blob/master/tweets/' badMarketingTweetsURL = prefix + 'bad_marketing_tweets.txt' goodMarketingTweetsURL = prefix + 'good_non_marketing_tweets.txt' vectorizer = CountVectorizer(max_features=5000) tokenizer = WordPunctTokenizer() # load ds badMarketingTweetsDS = pd.read_csv(badMarketingTweetsURL, sep='\t', encoding='utf-8', names=['ID', 'Content']) goodMarketingTweetsDS = pd.read_csv(goodMarketingTweetsURL, sep='\t', encoding='utf-8', names=['ID', 'Content']) # marking target badMarketingTweetsDS['isMarket'] = pd.Series(np.ones(len(badMarketingTweetsDS)), index=badMarketingTweetsDS.index) goodMarketingTweetsDS['isMarket'] = pd.Series(np.zeros(len(goodMarketingTweetsDS)), index=goodMarketingTweetsDS.index) X = badMarketingTweetsDS.append(goodMarketingTweetsDS) y = X['isMarket'] X = X['Content'] def test_RF(transformator, options={}): tweets = transformator(X, options) features = vectorizer.fit_transform(tweets).toarray() y_pred = cross_val_predict(RandomForestClassifier(verbose=3, n_estimators=100, max_depth=10, n_jobs=-1), features, y) acc = accuracy_score(y, y_pred) roc = roc_auc_score(y, y_pred) return acc, roc, f1 # 1. baseline def transform_tweets1(tweets, options): return tweets print(test_RF(transform_tweets1)) # (0.83827723901882489, 0.83818275302681977, 0.8342105263157894) pass
Please notice the header of the file, # -*- coding: utf-8 -*-. Since tweets contain non-ascii characters, according to PEP 263, we should mark the file as such.
Our pipeline consists of 3 phases: pre-process the tweets (currently doing nothing), vectorizing the tweets and training the classifier using cross-validation to avoid overfitting. For details about cross-validation, read appropriate paragraph in linear regression article
We need vectorization since classifier cannot work with words and requires numeric vectors. To achieve the goal, we use scikit-learn CountVectorizer class, which implements bag of words technique.
Using no pre-processing we achieve 0.83 accuracy. Remember to check F1 and ROC metrics as well to spot skewed datasets, for more details see machine learning metrics article.
Pre-processing
Let's try several things to see what effect our changes have on the metrics.
def transform_tweets2(tweets, options): results = [] length = len(tweets) i = 0 for tweet in tweets: if i % 100 is 0: print("%d of %d\n" % (i, length)) i += 1 s = tweet.lower() if 'markEmoji' in options: try: s.decode('ascii') except: new_str = '' for l in s: new_str += (" VGEMOJINAME " if ord(l) > 128 else l) s = new_str if 'patterns' in options: for (pattern, repl) in options['patterns']: s = re.sub(pattern, repl, s) words = tokenizer.tokenize(s) if 'remove_stop_words' in options: stops = set(stopwords.words("english")) result = " ".join([w for w in words if not w in stops]) else: result = " ".join([w for w in words]) results.append(result) return results
Removing stop words
We've all been taught that removing the stop words should the first step of any NLP pipeline. So that's only natural we start by doing so. The performance however not only hasn't improved, but actually showed a significant decline. Remember that we always should take into account the total number of instances when interpreting the performance, thus 0.834 - 0.822 = 0.012 decrease at 7000 instances is about 90 cases, which is a lot.
options = { 'remove_stop_words': True } print(test_RF(transform_tweets2, options)) #(0.82943525385054195, 0.8290763420757612, 0.82242424242424241)
Marking links
Since removing the stop words didn't help, let's try something else - all links in tweeter are encoded, they don't provide any additional information and may only worsen the performance. Let's replace all links with hardcoded string VGLINKNAME. The performance increases by nearly 1 percent - good start!
replacement_patterns = [ (r"http:\/\/t.co\/[a-zA-Z0-9]+", " VGLINKNAME ") ] options = { 'patterns': [(re.compile(regex), repl) for (regex, repl) in replacement_patterns] } print(test_RF(transform_tweets2, options)) #(0.84811751283513981, 0.84782082009277737, 0.85790527018012008)
Marking money
Marketing content usually contains monetary strings like "Win 100$". Let's try identify them and mark then with VGMONEYNAME. And the result - another percent up.
replacement_patterns = [ (r"http:\/\/t.co\/[a-zA-Z0-9]+", " VGLINKNAME "), #link (r'\$\s{0,3}[0-9,]+', ' VGMONEYNAME ') # money ] patterns = [(re.compile(regex), repl) for (regex, repl) in replacement_patterns] options = { 'patterns': [(re.compile(regex), repl) for (regex, repl) in replacement_patterns] } print(test_RF(transform_tweets2, options)) # (0.85667427267541363, 0.85637385309532066, 0.86601786428476202)
Marking Emojis
How about emotions? Do non-marketing emails contain more emotions through the use of Emojis signs? Let's try to create a feature around this idea by marking all non-ascii characters as VGEMOJINAME and check the results. Again the increase in performance by around half percent.
replacement_patterns = [ (r"http:\/\/t.co\/[a-zA-Z0-9]+", " VGLINKNAME "), #link (r'\$\s{0,3}[0-9,]+', ' VGMONEYNAME ') #money ] patterns = [(re.compile(regex), repl) for (regex, repl) in replacement_patterns] options = { 'patterns': [(re.compile(regex), repl) for (regex, repl) in replacement_patterns], 'markEmoji': True } print(test_RF(transform_tweets2, patterns)) #(0.85853850541928122, 0.85730503637390198, 0.87023249526899159)
Conclusions
We can check more different ways to improve the performance, some will work, others won't. The main point you should take from this article is always rely on the data, never on what people say. Removing stop words may be a good idea in some domains and worsen the performance in others. Validate every assumption and play with the data as many as possible. Good luck!
Comments
Post a Comment