Why does machine learning require so much training data?

Everywhere I look I see people trying to solve problems with machine learning. It’s an exciting time to be a cognitive developer! One area I see these people struggling in is gathering sufficient training data to produce a good model. At first, this seems unintuitive. “Machine learning” has the words “machine” and “learning” built right in – why doesn’t the machine just learn?

Machine learning – what is it really?

“Machine learning” has a high mystique factor in its name – kudos for whoever came up with that name. “Pattern extraction from input data” does not have nearly the same mystique but adds descriptive power. This new name gives a higher credence to the input data, which you need to extract patterns or learn.

Machine learning uses complex, high-order math to solve equations with hundreds, thousands, or millions of input variables. The outputs of machine learning can only be textually visualized in long vectors of numbers or graphically visualized in hyper-dimensional space. All that said – gathering sufficient training data is the hardest part of machine learning!

How much training data do you need?

My internal/personal minimum is to get to 50 pieces of example data for each combination of input and output variables.  For a binary classification (one input, two possible outputs), this implies 2*50 = 100 examples.  For a regression model over 10 variables (ten inputs, one output), that means at least 10*50 = 500 examples.  For a classification model of 10 variables to 3 targets, that means 10*3*50 = 1500 examples.  For machine-learning based NLP, ala Watson Knowledge Studio, I would want at least 50 data points (at least sentences, if not paragraphs or documents) for both positive and negative instances (ie, when the term should and should not be extracted), for each type I expect NLP to extract.  Thus 20 NLP types would take a minimum of 20*50*2=2000 data points.  Depending on your problem, this means a ton of training data.  And of course, 50 is a minimum — you’ll do much better with 100 or more!

The volume requirements might be frightening.  But the reason you need such volume is to make sure your training data is representative and that it captures all the natural variations that exist in your data.  If your training data set is not sufficiently large, the noise that naturally appears in any sample will mask the proper signal from the variations that you actually want to capture.

I did a simple natural language classification exercise, to lunch or not to lunch with Watson Developer Cloud, using only 25 examples per input variable.  As you can read in the post, the machine learning model starts to make some nice predictions but very quickly breaks down in cases with any ambiguity, and makes several flat-out wrong guesses.  These performance problems can only be resolved by adding more training data.

Training data shortcuts – and why they don’t work

The first shortcut most folks have is to artificially generate training data.  For a conversational system, people may try to write questions “they think people would ask the system”.  This is wishful thinking, no matter how much you try to think like a user, you cannot match what the users would actually say.  There are too many possible variations in grammar structures, capitalizations, short vs long questions, etc, that you are not going to be able to capture.  A model trained on proper English sentences is going to perform horribly if the real users are sending SMS messages with “text-speak”.

A variation on this erroneous shortcut for more mathematical problems is “can I generate random training data”.  The problem is similar to the text case above.  The machine learning model will learn to extract random patterns, since the random data will not have the same natural variations as “real” data.

The other shortcut commonly taken is to just use less training data anyway and “see what happens”.  As described previously, this can produce decent results in some cases.  If you are doing a proof of concept, by all means, give it a shot.  If you are expecting expert human-level performance from your model, you will probably be disappointed, especially in cases with any ambiguity.

A final plea for training data

Many tasks where I see machine learning applied are tasks formerly performed by experts.  I often think about the institutional knowledge that went into training these experts.  This institutional knowledge is built up from thousands, tens/hundreds of thousands, or more interactions.  If I want to help these experts out by shifting some workload to a machine learning model, how much data do I need?  If my institutional knowledge is derived from 10,000 data points, should I expect to replicate their performance if I give 100 data points to machine learning?  It’s fair to expect machine learning to have expert performance with less than 10,000 data points in this case, but it’s going to take more than 100.

So, gather training data in any way you can, as long as it comes from real users.  Gather data from logs, query histories, chat transcripts.  Show users mock-ups of your application and save everything they type into it.  Use existing databases or spreadsheets of data once examined.  Your machine learning model will only be as good as the data you feed it!