Training a personal chatbot with Watson Developer Cloud APIs

The Watson Developer Cloud APIs are intended to give you building blocks for building your own applications with the cognitive capabilities of Watson.  There’s been a lot of press on combining Watson and Slack, and I’d like to give my own views on how one should get started with the training process.

What is classification?

The key technology behind the Conversation and Natural Language Classifier services is the use of classification.  For text classification, the core question boils down to “given a block of text, is it more likely an X or Y?” or “given a block of text, should I classify it as an X or Y? The classifier is trained by selecting examples of text that mean X and examples that mean Y.  After the classifier is trained, it can make classification decisions on blocks of text it has never seen before.

Finding training data for the classifier behind a chat bot

That’s all well and good in the abstract.  But how do we put it to use?  I’m going to describe how to build a personal chat bot.  I use an instant messaging program at work – it would be great if a chat bot could screen my incoming messages and respond for me, at least for the easy stuff.  To do this, I need to gather a significant amount of training data and train some classifiers.

Fortunately, my instant messaging application logs all chat transcripts.  700MB of chat transcripts, in fact!  I first narrowed down the transcripts by searching for all messages containing a question mark.  This yields over 66,000 questions.  Scanning the questions gives me a rough idea of what kinds of questions I am being asked, but building a chat bot still feels overwhelming.  Now what?

Searching for training data:
grep -R \? SametimeTranscripts > ../SametimeQuestions.html
cat ../SametimeQuestions.html | cut -d '>' -f 12 | cut -d '<' -f 1 | sort | uniq > ../SametimeQuestions.txt

First, a major thing I did, and another that I did NOT do.  The thing I did was to narrow my aim.  Instead of building an overall chat bot, I narrowed in on a lunch-assisting chat bot.  The bot’s only job is to determine if a question is about lunch (my answer will be “I’d love to go, pick me up at 11:30”) or about anything else (in which case, don’t answer for me).  The thing I did NOT do was give up on my chat transcripts and manually create questions.  This is important, because classifiers work best if they are trained on representative data.  It would take less effort to create ‘synthetic’ questions, but the classifier performance would suffer (more on this later).

Focusing on a simple classification scheme (“lunch” or “other”) made it easier to find representative questions.

For “lunch”, I used simple ‘grep’ commands to find the word ‘lunch’ or restaurants I commonly go to.  Examples:
grep -i lunch SametimeQuestions.txt
grep -i carmens SametimeQuestions.txt

For “other”, I simply grabbed a random sampling of questions (Perl script reference)
perl -ne 'print ((0 == $. % 100) ? $_ : "")' SametimeQuestions.txt

For a starter set, I copied 20+ questions from both command sets into a single Excel file.  Column A for the question, Column B for classification (‘lunch’ or ‘other’).

Sample training data:
lunch @ 1130?,lunch
carmen’s?,lunch
moes?,lunch
Can I call?,other
Do I need to assign a change set in order to check in?,other
Andrew- u have extra shinguard?,other
let's chat after I grab some lunch .. you have some time this afternoon?,other

Full training data: Lunch vs Other NLC Training Data

The command which produced the question is a strong indicator of the classification I should use, but it’s worth noting this is not 100% reliable.  For instance, one of my questions was “let’s chat after I grab some lunch .. you have some time this afternoon?” – this is clearly not a lunch invitation but rather a desire to have a meeting, so I had to pay attention while classifying it.  After saving this file as a CSV, I was ready to upload it to the Watson APIs as training data.

Protip: Double-check your file contents here.  String-escape your questions (or remove the commas), so that your training file is treated as a two-column CSV.  Beware also of special characters like “smart quotes”, which will be rejected by the training API.  (Fortunately, the API will tell you the line number of any special characters it doesn’t understand.)

Sending training data to an NLC classifier service

I used NLC API instead of going right to Conversation since I want to focus on the training techniques and I find NLC slightly easier to get started with.  I followed the NLC Getting Started Reference and created a service instance at https://console.ng.bluemix.net/catalog/services/natural-language-classifier/.  Clicking the Credentials tab gives me the user, password, and URL I need to get started.  (The NLC API Reference is a great cookbook to use once you have these three things.)

Training my first classifier:
curl -u "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx":"xxxxxxxxxxxx" -F training_data=@AndrewChatsNLCTraining.csv -F training_metadata="{\"language\":\"en\",\"name\":"ARF-Chatbot-NLC\"}" https://gateway.watsonplatform.net/natural-language-classifier/api/v1/classifiers
{
"classifier_id" : "xxxxxxxxxx-nlc-xxxx",
"name" : "ARF-Chatbot-NLC",
"language" : "en",
"created" : "2016-11-07T03:27:26.678Z",
"url" : "https://gateway.watsonplatform.net/natural-language-classifier/api/v1/classifiers/xxxxxxxxxx-nlc-xxxx",
"status" : "Training",
"status_description" : "The classifier instance is in its training phase, not yet ready to accept classify requests"
}

Protip: Make sure your curl command doesn’t have smart-quotes either, or the command won’t be recognized! (Several times curl gave me an error that a URL was not in the command, due to smart quotes)

A few minutes later, I checked the status and the classifier was ready to sort text into “lunch” or “other” categories.

Testing the classifer

My first question was close to the training set:
curl -G -u "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx":"xxxxxxxxxxxx" https://gateway.watsonplatform.net/natural-language-classifier/api/v1/classifiers/xxxxxxxxxx-nlc-xxxx/classify --data-urlencode "text=lunch @ 1145?"

with high confidence results:
{
"classifier_id" : "xxxxxxxxxx-nlc-xxxx",
"url" : "https://gateway.watsonplatform.net/natural-language-classifier/api/v1/classifiers/xxxxxxxxxx-nlc-xxxx",
"text" : "lunch @ 1145?",
"top_class" : "lunch",
"classes" : [ {
"class_name" : "lunch",
"confidence" : 0.9923420767749419
}, {
"class_name" : "other",
"confidence" : 0.007657923225058088
} ]
}

The training process yielded some other interesting results:

QuestionClassificationConfidence
sure, when do you need it by?other99.4%
talk after lunch?lunch99.0%
can we talk after your lunch?lunch79.9%
o'charleys?lunch95.1%
production?lunch94.9%
How much wood would a woodchuck chuck if a woodchuck could chuck wood?other80.7%

The classifier gets some questions right with very high confidence, and some very wrong as well.  It has extracted patterns from the training data.  My training data shows that a short question, or a question that uses the word lunch, is very likely to be a lunch question.  This was good for O’Charleys (a place I have not gone for lunch), or the 11:45 question (too late for me to go to lunch!), but bad for the quick question about our production environment (who can go to lunch if production is down?)

How to improve the classifier

The solution will be to use more training data.  50 is the bare minimum number of cases to train a classifier, I should really have tried to collect 100 examples for each classification.  Additionally I should check that I have accurately sampled my transcripts (do I have questions about lunch with more words? shorter questions not about lunch? non-lunch questions that contain the word lunch?)

It’s worth reiterating here the need to use representative training data rather than trying to create it synthetically.  If I was creating training data manually, I would have had a bunch of questions like “Hey Andrew, do you want to eat lunch with us today?”  Not only is that far better punctuation than I see in a typical chat transcript, it would provide other distracting or misleading signals to the classifier.  In my chat transcripts, the word “eat” appears in only 3 out of 66,000 questions, and none were in invitations to lunch!

Conclusion

While the use of a lunch concierge chat bot is a bit tongue in cheek, I hope this shows you how to get started with text classification.  It’s important to use training data from reality, not your imagination, and to use a representative sample of that real data.

Click through to part 2 where I explore more complicated classification scenarios, including classifying against multiple dimensions.

2 Comments

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.