Cognitive system testing: Overall system accuracy testing

Part 5 of the Cognitive System Testing series, originally posted in 2016 on IBM Developer.

Introduction

As described in previous chapters, cognitive systems are probabilistic, non-deterministic systems, which will never achieve 100% accuracy. (After all – what is the accuracy of a human expert, the best alternate cognitive system?) Even though we don’t expect to achieve 100% accuracy, we still need to measure the accuracy of our system and verify that accuracy is maintained or improved as we continue development. The previous chapters describe how we verify accuracy of the components. This chapter describes how we verify the accuracy of the whole.

Overall Approach

We test the system against a collection of ground truth. Ground truth is generically a set of correct input/output pairs for the system, i.e. if we give the system input_A, we expect to receive output_A. For the Jeopardy! Watson system, the inputs were Jeopardy! clues and the outputs were answers (of course, in the form of a question). For optimal testing of the system, a large amount of ground truth is required, and this ground truth should be as varied as the total possible set of inputs that will be sent to the system.

Let’s once again consider the system in layers. At minimum, we will have an entire system computational layer and a single-document NLP computational layer. Your solution may have additional layers, for instance a medical solution may have a patient computational layer, where the complete view of a patient is derived from multiple documents. Once more let’s consider the testing pyramid. We want to test each layer, but we have a problem – collecting ground truth is very time-consuming and expensive. Thus, we start at the top, with the entire system-level view.

Measuring system-level accuracy

Depending on the problem being solved, there may be an existing database or repository suitable for seeding a system-level ground truth collection. For instance you may have a few years’ historical Jeopardy! clues and answers, or an electronic medical record database with anonymized patient attributes and outcomes. In other systems this system-level data may be curated by hand as such might be the only ground truth you get. In any case, you should have an automation suite that runs your cognitive system over all of the inputs in your ground truth, collects the outputs, and compares them with the ground truth. The automation suite should output a report suitable for analysts to determine where the system is performing well (and not so well). The report columns should include a ground truth ID, key input variables, the actual output, the correct output, and a true/false “System correct?” flag. This report is suitable for a variety of sorting methods, so the analysts can determine which variables contribute the most to errors.

Example report:

Input IDInput VariablesSystem OutputCorrect OutputSystem Correct?
1Female patient, ailment 1, age 50Drug ADrug AYes
2Male patient, ailment 2, age 60Drug BDrug CNo

Measuring sub-system accuracy

The system-level view of accuracy is great for telling you about errors in the big picture, but it will be insufficient for finding out where those errors are being made. My colleague Robert Nielsen says the system-level view is like taking a quiz, showing your work, and only being told which answers were wrong – it’s more useful to be told which step was wrong. Thus a best practice is to step down a level and repeat our ground truth exercise. If the system-level test uses a collection of documents, you should have an NLP-level test which only considers a single document.

Let’s look at how we would set up an NLP-level test for the system described above. Assume there are three documents for each patient. Set up new ground truth that tests what you expect to get out of each document.

Document IDSystem OutputCorrect OutputSystem Correct?
1.1Ailment 1Ailment 1 Yes
1.2Ailment 1 No
1.3Ailment 1Ailment 2 No
1.4Ailment 2 No
1.5Ailment 2 No
2.1Ailment 2Ailment 2 Yes
2.2Ailment 2 Ailment 2 Yes
2.3Ailment 2 Ailment 2 Yes
2.4Ailment 2 Ailment 2 Yes
2.5Ailment 3No

Now we can learn additional information about the system. For case 1, we may have gotten the right answer for the wrong reason, as the system suggested ailment 1 even though most source documents had ailment 2. For case 2, the NLP is pretty accurate at detecting ailments, so we can look elsewhere for the source of the error.

Quick note on NLP measurements

The first measurement people think to take for NLP is a simple accuracy measure: how many right answers divided by how many total questions.   This can be misleading. Consider if you are trying to detect something rare, that only occurs in 1% of documents. A simple “no-op” annotator will never create any annotations, and will be 99% accurate! This is clearly not what is desired.

Rather, we measure with F1 score. F1 is a harmonic mean of precision (when the system gives an answer, how often is it right) and recall (how many correct answers does the system give). The F1 score of our “no-op” annotator above is 0 due to a recall of 0. Our ailment annotator above has 5/7 = 71% precision and 5/9 = 55% recall for F1 score of 62% (accuracy was 50%). Thus we now know specifically that the ailment annotator needs to be more aggressive in surfacing ailments, and the patterns in documents 1.2, 1.3, and 1.5 are a good place to start.

Iterate, iterate, iterate

Improving the performance of your cognitive system is an iterative process. Look at where the system has the lowest performance, dive into the subsystems and components causing the bad performance, improve those subsystems, and repeat. Continue this process until the system is “good enough”. Remember that 100% accuracy will not be possible. And be sure to measure the accuracy performance after every build, to carefully monitor if the system accuracy is increasing or decreasing and why.

Conclusion

The accuracy performance of a cognitive system should be carefully measured. The best method is to collect ground truth (inputs with desired outputs) for each level of the system, starting at the top and working down through sub-components. Measure accuracy after each build and improve the system by looking at where the system performs the worst, fixing the worst parts, and iterating until accuracy reaches a desired target (even though 100% accuracy is impossible).

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.