Calculating confidence of natural language processing results using machine learning

I’ve been building a set of natural language processing (NLP) annotators and have frequently been asked about my confidence in the results.  This is an interesting question with a simple answer and a more complex (though probably more interesting answer).

The simple answer for “what confidence should I have in an annotator” is “the precision of that annotator”.  (You are measuring precision right?)  Precision is the percentage of correct answers surfaced from your annotator divided by the total number of answers from your annotator.  If an annotator gave 4 answers and 3 are correct, its precision is 75%.  Thus for any give answer it provides, without knowing anything else, we should have 75% confidence that the answer is correct.

Measuring precision of two sample annotators

Annotator #Correct Value
1Yes
1Yes
1Yes
1No
2Yes
2No
2No
2No

If you’re in an 80/20 situation, you can stop here.  The precision for an annotator is often the most predictive factor in whether a given answer from that annotator is correct, and it costs very little to generate this confidence.  But let’s suppose you have a burning desire to improve on this.

We will use a machine learning algorithm called “logistic regression” to help find the confidence.  In short, we will come up with a list of variables that we think contribute to whether or not an answer is correct.  In machine learning parlance, these variables are called “features”.  The process of selecting which features to use is called “feature engineering“, and it is part art/part science, and you can spend a lot of time there.  For our example, let us build a very simple model using only two features.

The first feature, as you might have guessed, is the precision of an annotator.  Let’s assume we have a hypothesis that for our problem domain, correct answers are found earlier in the text rather than later in the text.  Thus our second feature is “position within document (relative to size of document).  If the annotation occurs on the 5th word of 100 in a document, we call that 0.95 ((100-5)/100).

Quick note on features: several machine learning algorithms work best if you features are all on the same scale.  0-1 is a common scale.  Additionally, I like to have my features produce 0 in the worst case and 1 in the best case.

Build a CSV file with columns for each feature and a column for the outcome.  The outcome column is 0 if the outcome of the features was wrong, and 1 if the outcome was right.  Logistic regression is thus a prediction of whether the input feature scores are more likely to produce incorrect or correct output.

Measuring confidence of annotators using logistic regression - sample data

PrecisionPosition scoreCorrect
0.750.951
0.750.41
0.750.51
0.750.350
0.250.71
0.250.50
0.250.90
0.250.250

For my sample data, the logistic regression produced a logit function of -3.46 + 4.06*F1 + 2.57*F2, where F1 is the annotator precision and F2 is the position within document score.  We then use the sigmoid function (1/(1+e^(-logit))) to determine the confidence.  For instance where our 75% annotator found an answer in the 5th word, our logit is -3.46 + 4.06*.75 + 2.57*.95 = 2.03 with sigmoid (confidence) 1/(1+e^(-2.03)) = 88.3%.  When the answer was found in the midpoint of the document, the confidence drops to 70.4%.

Thus in this case, the machine learning algorithm validated our assumption.  Even though the annotator’s precision was the most significant predictor (it had the highest coefficient in our equation), a more exacting confidence value can be generated with additional features.

It is left as an exercise to the reader what features you may need to try, but now that you know the method, experiment away!

2 Comments

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.