Teaching the machine
Machine Learning might seem something very mysterious, but at its very core what we try to do is teach the machine to mimic what we do/think. If we see a picture of an elephant, we think elephant. We might want the machine to think the same if it sees that very picture. If someone asks us: what body parts do you see in the picture? We think trunk, feet and tusks. We would like the machine to answer the same. But in order for the machine to ‘get’ it, we need to show it examples of what we want it to get. The examples take the form of pictures of trunk, feet, musks together with their appropriate labels. The machine then associates the picture with the label; this is how/what it learns.
Annotation is providing the machine with labelled examples.
NLP: computer and language
The goal of Natural Language Processing is training machines to understand or produce human language. Easier said than done… Because the meanings present in texts aren’t actually as straightforward as you might think. Just like the picture of the elephant, you might be interested in a general view of the elephant, or you might want to zoom in on its body parts, but perhaps you want to know where the picture is taken, such as Africa, India, a zoo, …
For language, there are also many layers of meaning that you might want your computer to identify. And as you can imagine by now, the way to do this is by feeding the machine enough examples of what you are focusing on. You provide the piece of text with the right kind of information, a label, which turns the text into an annotated example. In other words, we “explain” meaning to the machine by adding information to the examples used for training. We call those explanations “metadata” (data about the data) and we add them in the form of labels/tags. Eventually, annotation results in the transformation of texts in datasets for the machine to learn from. (If you want to know more, Stubbs and Pustejovsky’s Natural Language Annotation for Machine Learning is a good read.)
Annotation is mostly human work
When we speak of manual annotation we mean that a person tags parts of the text with certain information. We do this for many texts so that the machine has enough examples to learn the problem. Sounds tiring, right? Well, you might be doing it already without realizing it. Do the following tasks ring a bell? Adding information to texts every time you upload them in a system, classifying an email or a downloaded document in a particular folder, … If we teach the machine what it needs to understand in order to perform those tasks, it can assist you and save you time and brain power.
Specific software to annotate faster
The whole process of training the machine involves a back-and-forth methodology that benefits greatly from working with specific software. Very, very often, you need to revisit your annotation sets, redefine, merge or split categories, correct concrete examples, and retrieve the adequate output for Machine Learning. Many linguists and other human scientists are still annotating without specific software for research. I have done it for years and didn’t die! However, that would be impossible for Machine Learning purposes. Good software will save you hours of work.
Annotation depends on the task
We add specific metadata depending on what we want to teach the machine in a particular task. For instance, a task might be to teach the machine to identify the linguistic expressions for diseases, symptoms, tests and treatments from a collection of medical texts. For this task, we will annotate a certain amount of those texts with the tags disease, symptom, test and treatment whenever we encounter an expression referring to one of them. Some other well known tasks are annotating named entities (such as place, date, person, etc. ) or labelling texts according to their topic or the sentiment or intention of the writer.
Annotators: knowledge & expertise
In NLP, linguists and engineers work together in training machines. It is team work. Engineers know how to program. Linguists know how to annotate. It is no wonder that in the biggest companies, such as Google, linguists lead the annotation teams to ensure the quality of their datasets.
Annotation is very laborious and far more complicated than it seems. Since it is the pilar of the Machine Learning process, any mistakes or inaccuracies that we make as humans are also taught to the computer. Therefore, annotation benefits greatly from theoretical and empirical expertise in Linguistics, often paired with domain-specific knowledge. More often than not, important nuances are encoded in very small words. What we include in a tagged expression can have consequences on what the machine learns.
So… a word of advice: If you need annotation for an NLP solution, make sure there are linguists with the proper training to annotate your texts. There is more to it than engineering!