Twitter is a rich source of information about people’s personal opinions and life spheres. But it is also a frustrating one if you want to study it: there’s so many tweets and there’s plenty of them that you would have preferred in the garbage bin rather than on your screen. In comes Natural Language Processing to save the day, our preferred hero of automatization.
Using Natural Language Processing to process Twitter also has its quirks though. Especially if you have to prepare a dataset from scratch and you want it to work on your specific use case. There are plenty of problems to get it right. Where do you start? How do you take it from there? How well does it actually work? In this post we describe how we dealt with creating a usable tweet-collection to train a machine learning model. We decided to do something different from the all-time favorite problem of sentiment analysis and turned our attention to what we consider a socially more relevant topic. We try to model comments related to eating disorders. More specifically to see whether a comment shows signs of problematic behavior or encourages it. (Andrea has written a more qualitative post as well, focusing more on the use case itself.)
We figured a good starting point would be to identify a set of hashtags that are commonly used within eating disorder communities.

Typical eating-disorder hashtags are #proana, #promia, #proed, #anamia, #anacoach, #anabuddy, #EDtwt, #meanspo, #bonespo, etc.
We used the Twitter API to query for these hashtags at two different time points to gather a sizeable dataset of 31.806 tweets. After automatic removal of retweets and tweets that only contain emoticons or hashtags about 16.000 remain.
In order to understand the data better, we delved in. We committed to a binary classification scheme (toxic, neutral) and started providing labels using our in-house developed annotation tool. It showed rather harshly that binary, black and white classification schemes are sometimes also grey and that Twitter is a portal to a rather confusing universe. It not only contains weird, automatically generated nonsensical tweets, but also the amount of unwanted and ill-advised advertisement is staggering. The number of usable tweets prove rare at first sight.
It was clear that we needed to speak things through. So we sat down, went through examples we didn’t agree on. We talked and decided which cases should be considered toxic, and which ones were neutral. The rules were written down and promoted to ‘annotation guidelines’ pinned on the wall of the office.
To weed out the littered collection of tweets, we decided to intervene, strongly, and luckily with the power of automatisation backing us:
1) We identified users who generated noise in our collection (generally well-intended bots) and filtered them out.
2) Tweets with a cosine similarity higher than .97 from any tweet were removed from the collection. Often these are messages that are identical, save for different links to the same shady product that literally promises to reduce you to zero as fast as possible.
3) We built an additional probabilistic classifier that distinguishes between high-quality content and low-quality content. This is effectively our gatekeeper from now on when dealing with Twitter data: it decides what goes in the garbage bin and what seems sensible and human enough. We were tired of nonsense, so imposed a harsh cut-off criterium to filter out advertisements, calls for action, etc.
4) We tried out an additional way to collect tweets. Instead of using a query-based strategy, we manually selected a few interesting accounts that act as hubs in the community. By collecting the tweets of their followers, we ended up with a far more natural collection of tweets, discussing life. We thus ended up with 2 subcorpora, one query-based, one community driven.
Guided by the annotation scheme on the wall, we annotated 250 examples of each subcorpus. Agreement was reached in approximately 80% of the cases. Oral discussion clearly showed that disagreements were now often due to a different interpretation of the tweet resulting from contextually induced bias. What I mean with that is that it is very clear when you read the tweets, that the community driven collection contains many problematic cases given its author. In a different context, those tweets might actually be quite upbeat and certainly non-problematic. The community-based tweet collection was chosen for further annotation as it contained the best balance between problematic tweets and normal ones. We ended with a fairly small collection of 1500 tweets, tagged either as problematic or not. When in doubt, the tweet was considered problematic.
The training set was used to train a Convolutional Neural Network, implemented in Tensorflow. We chose to preprocess the corpus only slightly. We lowercased all words, left in punctuation and emoticons, and more importantly we did not remove hashtags. In this particular case, hashtags are sometimes part of the syntax and they provide important contextual clues, while at the same time their presence is certainly not fully indicative of its problematic status. After optimizing several parameters of the Machine Learning algorithm, we achieved an accuracy of about 81%, comparable to our inter annotator agreement, which we consider to be a good measure of comparison. Manual inspection of newly tagged tweets, show that explicit cases talking about physical acts, negative (self-)image and unhealthy food habits are picked up well by the algorithm and can certainly be used to process tweets to detect this kind of behaviour.

Here are some examples (slightly modified to preserve anonymity):
I wanted to purge in the school toilets but someone came in
I hate not having gag reflex when I want to purge
me: I want to die – my brain: you can die when you’re skinny
We plan on making the API’s available (Twitter Toxic Comment Detection, Twitter Quality Control) with a free tier in the near future, so that you can process your own tweets. If you are interested, please contact us through the contact form so we can send you information about future developments.