Balancing precision and recall in your cognitive system

In building a cognitive model it can be difficult to know when it’s accurate enough.  (You are measuring your daily performance, right?)  We know that cognitive performance often involves making a tradeoff between the competing factors of precision and recall.  In this post I’ll explore some strategies for dealing with the tradeoffs.

Strategy 1: Use different F-scores

In most of my posts I describe using F1, but you can use any F-score.  F1 gives equal weight to precision and recall.  Depending on your application certain kinds of mistakes are worse than others and a different F-score is appropriate.  You can plug other constants into the F-score formula to favor precision or recall.

If you are building an application where not surfacing a potential correct answer is a big deal (for instance, a search application), you should favor recall by using F2.  This punishes false negatives twice as much as false positives.  If the worse thing your application can do is produce a false positive, favor precision by using F0.5.  If your users demand an even more extreme balance, ratchet up (or down) that F-score coefficient as needed.

You get what you measure, so it is important to decide the metric that measures success.

Strategy 2: Use multiple layers with different optimizations

This is an approach popularized by the Jeopardy!-playing Watson system.  The first layer in that system favored recall, generating multiple hypotheses that were sent to a second layer that merged and ranked them.  This second layer favored precision, keeping only the highest ranked answer (and only if it cleared a confidence threshold).

This approach is adaptable to other cognitive approaches.  Let’s assume we have a Product Color annotator that runs on product descriptions and has a 60% F1 score.

Example resolution layer with majority voting

Product #Annotation layer resultsVoting layer resultActual color
Product 1green, green, purplegreengreen
Product 2red, red, blue, redredred
Product 3yellow, orange-yellow

We can improve our detection of colors using a resolution layer, in this case simple majority voting.  Many other resolution techniques are possible.  The resolution layer allows the NLP layer to be a little looser, favoring recall, and tightening (favoring precision) can happen later.

Strategy 3: Build a dynamic confidence model

Rather than using a one-size-fits-all method to balance precision and recall, you can adapt the balance between them on a case-by-case basis.  If you are using rules-based NLP, you could measure the precision of each of your rules and let the best rules ‘overrule’ the weakest.  You could examine document metadata and decide to trust (or be suspicious of) results from documents of given type or age.

The model should be tuned from experience, or better yet actual results, to improve the chances of imperfect tools giving quality results.

This strategy is a variation of Strategy 2 using a much fancier layer.  For additional ideas, see my previous post about building an NLP confidence model.


There are multiple strategies you can use to balance precision and recall in your cognitive system.  First decide what types of errors are most harmful to your system, then use these strategies to balance precision and recall for optimal results.