Feature engineering lessons learned while calculating natural language processing confidence

In my post on Calculating confidence of natural language processing results using machine learning I described building a logistic regression model using two simple features.  I learned a few things and made a few mistakes along the way and I’ll describe both here.

The first was that my logit function (b0 + b1*x1 + … + bn*xn, where bi are the feature coefficients and xi the feature scores) had negative coefficients.  It’s not wrong to have negative coefficients, but it does mean that the associated feature correlates negatively to producing correct outcomes.  With a negative coefficient on a feature that scores 0-1, if that feature has the best possible score then the overall confidence in the result goes down.  In this case you can try to “fix” the feature scoring, or just simply remove the feature from consideration.

A similar lesson comes with small coefficients.  If my logit function is -3 + 6*x1 + 0.003*x2, then this means x2 has very little predictive power.  Depending on the cost to calculate features, I may just want to drop the calculation of x2, as it can only move my prediction by a minuscule fraction.

The final lesson is to do a sanity check on your logit function.  If you follow my advice of scaling all feature scores from 0 to 1, with 0 always being produced by worst possible input and 1 from best possible input, you should hope for the constant value (b0) to be approximately -3 and the sum of all coefficients (b0+b1+…+bn) to be approximately +3.  (These come from passing all 0s, then all 1s, respectively, to your logit function).  Logit values of -3 and +3 map to 5% and 95% confidence in the classified result (using the sigmoid function to determine confidence).

This sanity check makes intuitive sense.  If an input you have never seen before generates the worst possible feature scores, you should have very low confidence that the associated answer will be correct.  5% is even optimistic here.  On the other hand, for an input which generates perfect feature scores, your predicted accuracy tops out in the high 90s percentage-wise.  Intuitively it seems wrong to give a 100% confidence to almost any prediction.

If your logit function does not range from about -3 to +3, then machine learning is telling you that either the features you have selected are not good features, or that you lack training data.  In my instance I was testing up to three features on hundreds of data rows, which is sufficient for a decent prediction.  My logit ranged from -2 to +0.4, indicating that the machine learning algorithm would never be more than 60% confident in its prediction based on my input features.  It turns out due to a coding error, I had a bug in my feature scorer and was thus generating non-predictive data.  I knew my feature scorer bug was solved when I retrained the model and the logit now ranged from -3 to +3.

After running through these lessons, I built a model I am pretty happy with.  As is always the case with machine learning, I intend to revisit this model and retrain it once my input data significantly changes.

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.