Training a personal chatbot with Watson Developer Cloud APIs – Part 2

In part 1 of Training a personal chatbot with Watson Developer Cloud APIs I gave an intro to text classification and how you could use it to create a chat bot that screened your incoming questions on the basis of whether or not they were related to lunch.

Covering more topics with classification

My first classification scheme was relatively simple – “lunch or not lunch?” This was a very coarse classification scheme but still presented challenges in gathering useful training data. The end result is not quite useful enough for a chat bot – I can hardly give an interesting response to a question knowing only that it pertains to lunch. The two most important parts about lunch are “when” and “where”. Let’s consider the possibilities.

Given a question about lunch, it could be a “when should we go” question, “where should we go” question, “when and where should we go” question, or an “other” lunch question. Thus, I could reasonably sub-divide my “lunch” classification into four sub-classifications.

Alternatively, I could think about classifying in orthogonal dimensions. My first dimension could be “lunch vs other”, the second dimension could be “when vs not_when”, the third dimension could be “where vs not_where”. (Or, my second dimension could be “when vs where vs neither”). Then, I could classify a piece of text against both dimensions simultaneously and take the intersection. If the first classifier says it’s a “lunch” question, the second says it’s a “when” question, then the intersection means it’s a “when should we lunch” question.

General or specific classifiers – which are better?

Should you prefer very specific classifiers on a single dimension, or more general classifiers on multiple dimensions? A classic “it depends” question! You should be able to get good results with either approach, if you have sufficient training data. (This is a big if, we’ll come back to it.) Specific classifiers should in theory be extra precise, if trained well. There is a significant benefit to the ‘general’ classifiers method however. Consider how many topics you want your chat bot to handle. Let’s say you will want your chat bot to not just handle lunch arrangements but also regular business meetings. Business meetings will also include where and when aspects. If you use ‘general’ classifiers, you have the opportunity to reuse existing classifiers and will have less training to do overall.

What does the data allow?

The philosophical questions sometimes have to take a backseat to what the data allows. Recall that my initial possible training data set is 66,000 questions pulled from my chat transcripts. Let’s see what the data allows in terms of lunch-related questions. For a shortcut, I will run a grep for the term lunch and the seven restaurants my team visits the most:

This yields approximately 100 questions. Through a quick scan, only 9 appear to be “when” questions. A similar number are “where” questions. (The vast majority are simply “interested in lunch?” questions or not even related to a lunch invitation). 9 examples is not going to be enough for classifying, I don’t want to think about using a classifier that doesn’t have at least 20 examples, and I don’t want to generate synthetic examples since I know they won’t be representative of questions I will actually receive. Thus the data appears to be forcing our hand.

It is worth verifying, do we have enough when and where questions to classify? I again use a simple grep as a proxy for estimating possible training data. Most invitations, lunch or otherwise, happen on 15-minute intervals, so checking the “minutes” value of the clock plus the “when” keyword is a good proxy for how many “when” questions I will have. For “where” questions, I use the obvious keyword plus a handful of common locations.

These queries yield approximately 800 and 500 questions, respectively. Thus, I should have plenty of training data available to build multiple, generalized classifiers.

Training data for two dimensions

At a high level, the training exercise looks a lot like what we did in the previous post, we just have more work to do since we are doing more dimensions. Just grab questions and manually classify them. To train against “when”/”not when”, I added 70 questions from my “00|15|30|45|when” question set, being sure to include questions that used times or the word “when” in a way that did NOT indicate a when question.

I added these questions to the same spreadsheet as my first training data, simply adding a column to represent when/not-when. Thus my columns are question, lunch/other, when/not-when. Note that this requires additional classification, adding when/not-when to my original lunch/other questions, and adding lunch/other to the new when/not-when questions. However this is largely an Excel “fill” exercise – most of the original questions were not-when, and most of the new questions are other (not-lunch).

Here’s a sample of the new training data (full data here: Lunch vs Other and When vs Not-When NLC Training Data):
lunch @ 1130?,lunch,when are you going out for lunch today?,lunch,not_when Andrew- u have extra shinguard?,other,not_when let's chat after I grab some lunch .. you have some time this afternoon?,other,when do we have a room for the 1:30?,other,not_when does this happen when you build and deploy?,other,not_when when is your next meeting?,other,when
There are a few possibilities for how to proceed, depending on whether we want to use multiple Natural Language Classifier instances (one per dimension) or whether we want to “hack” Natural Language Classifier by forcing it to run multi-dimensional classification in a single instance.

Using two classifiers

The first method is to use two classifiers. I can take my 2d training file and alternately delete the “lunch” or “when” column, to create two training files, each of one dimension. See When vs Not-When NLC Training Data.

After training my “when” classifier, I ask it the same questions I asked in my previous post. The “lunch @ 1145” question classifies as 86.5% confident to “when”. Thus my two classifiers tell me with very high confidence that this is a “when lunch” question. “sure, when do you need it by?” is 99.5% confidence for “when” and 99.4% “other”, clearly a “when not-lunch” question. “does it work when you restart the build?” is 84.5% confidence for “not-when”.

From these results you can see that the “when/not-when” classifier performs very well, even better than the lunch/other classifier described in my previous post. This is not surprising when you consider it received twice as much training data.

Thus, by using two classifiers and combining their results, you can get high confidence on each dimension and make a very good guess at a question’s true two-dimensional classification.

Multiple dimensions in one classifier

I can upload my two-dimensional training data directly into a single classifier. When I run a question against it, I get interesting results:

curl -G -u "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx":"xxxxxxxxxxxx" https://gateway.watsonplatform.net/natural-language-classifier/api/v1/classifiers/xxxxxxxxxx-nlc-xxxx/classify --data-urlencode "text=lunch @ 1145?" { "classifier_id" : "xxxxxxxxxx-nlc-xxxx", "url" : "https://gateway.watsonplatform.net/natural-language-classifier/api/v1/classifiers/8aff06x106-nlc-13700", "text" : "lunch @ 1145?", "top_class" : "lunch", "classes" : [ { "class_name" : "lunch", "confidence" : 0.8798495105446424 }, { "class_name" : "when", "confidence" : 0.06570656619039483 }, { "class_name" : "not_when", "confidence" : 0.04098451924057577 }, { "class_name" : "other", "confidence" : 0.013459404024386896 }] }

We can view this question as a “lunch” and “when” question because they are the highest confidence classifications. But we lose significant fidelity against one of our dimensions, since the classifier tries to pick a primary classification.

Question	Lunch confidence	Other confidence	When confidence	Not-when confidence
sure, when do you need it by?	96.5%	0.3%	2.78%	0.5%
talk after lunch?	64.0%	10.6%	22.0%	3.5%
o'charleys?	7.5%	2.8%	88.2%	1.5%
production?	4.2%	4.1%	1.3%	90.2%
are we going to lunch today?	90.4%	0.8%	0.7%	8.1%
does it work when you restart the build?	0.2%	97.8%	0.8%	1.1%
what time is lunch?	3.4%	6.1%	88.7%	1.7%

We can see this in the “O’Charley’s” (a restaurant we visit) and “production” questions. The highest lunch/other classification for both is “lunch”. But how confident can we feel in the result? The problem is that all classifier confidences must add up to 100%, and the highest-matching classification takes up too much of the confidence pie to make a great delineation on the second dimension. If we had tried to force three dimensions of classification into this classifier, the additional dimensions would be even harder to tease out of the classifier.

How many classifiers is best?

Simple classifiers can be called in parallel and their results can be combined in an intuitive way. One classifier can be overloaded and produce multiple dimensions. Overloading one classifier may simplify classifier management, and it surely cuts down on our API usage bill, but it makes it harder to get a true read of the confidence on all of the classification dimensions. Your mileage may vary, but I prefer the higher fidelity of using multiple classifiers.

Conclusion

You can classify text against multiple dimensions, assuming those dimensions are orthogonal. This can be accomplished using specific multi-dimensional classifications (ex: when_lunch) or multiple, general classifications (ex: when, lunch). Multiple, general classifications can be achieved with a single or multiple classifiers. Experimentation will help you decide what classification techniques your data requires.

Training a personal chatbot with Watson Developer Cloud APIs – Part 2

1 Comment

Leave a Comment Cancel reply