Cognitive system testing: Natural language processing unit testing

Part 4 of the Cognitive System Testing series, originally posted in 2016 on IBM Developer.

Introduction

Natural Language Processing (NLP) is the way cognitive systems extract meaningful information out of plain text. NLP is part art and part science, and as such may seem difficult to test using automation. In this chapter and the next chapter I will describe how we test our NLP pipelines. Again referring to the test pyramid we will test at the unit, functional, and system levels. Today’s chapter focuses on unit-level NLP testing.

Definition

A natural language processing pipeline is made up of several components. In several parlances, including the Apache UIMA framework, these components are called “annotators”, since they annotate a span of text as having meaning. This annotation includes an annotation type, a span (“covered text”), and potentially other attributes, enough to tell you what text was interesting, where it was located, and why it is interesting. An NLP unit test will focus on a single annotator and determine if it correctly annotates one aspect of a given text.

How to build an NLP unit test suite

Generally, as you build a series of NLP annotators, you work with example snippets of text and decide how you can train your NLP system (either with rules or machine learning) to properly annotate as many of these texts as possible. As you find more text snippets that express a target concept in varying ways, you will continually adapt your NLP to handle them. It is important to capture each target variation in a test case, so that as you add more variations, you can verify that function is not regressed.

Worked example

Let’s pretend we want to write some NLP code to extract instances of dogs from blocks of text. We start with the sentence “I have a dog”. Our first version of the annotator is exceptionally naive:

for(word in sentence):
  if(word == “dog”) then annotate Dog

We record a test case to verify output like “I have a dog”.

We find another sentence “Dogs are great”. We add “Dogs are great” to our test suite. We also find a sentence “The corgi played with the ball.” Corgis are dogs too, so we’ll add “The corgi played with the ball.” We create a dictionary of dog-related terms called DogDictionary (this exercise left to the reader), and we update the annotator as follows:

for(word in sentence):
  if(word in DogDictionary) then annotate Dog

Finally, we find an example of a text we do NOT want annotated. In the sentence “I was dog tired after work today”, we do not want to annotate any word in this sentence as there are no literal dogs mentioned. We add “I was dog tired after work today” to our test suite, with an indication that the text should contain zero annotations. Our annotator is now:

for(word in sentence):
  if(word in DogDictionary
  and word.part_of_speech=noun) then annotate Dog

You can imagine after exploring more and more sentences the annotator will become increasingly complex. It is important to maintain the test suite as new patterns are discovered.

Importance of positive and negative tests

Natural language processing has two types of errors:

False positives, text was annotated that should not have been annotated (affects precision, the measure of how many annotations are correct)

False negatives, text was not annotated but it should have been (affects recall, the measure of how many true instances were annotated)

Improving NLP accuracy is a careful dance of reducing these two complementary kinds of errors. An overly-aggressive annotator will have high recall and low precision, while an overly-passive annotator will have high precision and low recall. Get into the habit of collecting representative examples for each error you fix and you will be able to improve both measures.

Conclusion

Natural Language Processing is part art, part science, but that does not mean it can’t be tested with automation. NLP can be tested at both the unit and functional level. When testing at the unit level, collect examples of text that you want (and don’t want) to receive a certain type of annotation. Functional level NLP testing will be covered next.