{"id":98,"date":"2017-01-20T01:51:51","date_gmt":"2017-01-20T01:51:51","guid":{"rendered":"http:\/\/freedville.com\/blog\/?p=98"},"modified":"2019-04-12T14:13:09","modified_gmt":"2019-04-12T14:13:09","slug":"improving-simple-natural-language-processing-models-with-rules-or-machine-learning","status":"publish","type":"post","link":"https:\/\/freedville.com\/blog\/2017\/01\/20\/improving-simple-natural-language-processing-models-with-rules-or-machine-learning\/","title":{"rendered":"Improving simple natural language processing models with rules or machine learning"},"content":{"rendered":"<p><strong>Introduction<\/strong><\/p>\n<p>In my last post, I introduced a simple Natural Language Processing (NLP) problem: extracting mentions of programming languages from the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Programming_language\">Wikipedia article on programming languages<\/a>, using both a <a href=\"http:\/\/freedville.com\/blog\/2017\/01\/13\/demo-of-natural-language-processing-with-rules-and-machine-learning-based-approaches\/\">rules-based technique and a machine learning technique<\/a>. &nbsp;In this post, I&#8217;ll talk about how I would improve the models used in each technique.<\/p>\n<p><strong>Improving the rules-based model<\/strong><\/p>\n<p>Improving a rules-based model requires an expert in writing rules. &nbsp;The expert designs rules to cover as most of the problem space as possible.<\/p>\n<p>In the last post, I used a single rule. &nbsp;I scanned the text for sentences that contained the word &#8216;language&#8217;, removed the first word, and declared any other capitalized words in those sentences to be programming languages. &nbsp;This rule attained approximately 53% accuracy, but had several shortcomings:<\/p>\n<ol>\n<li>The rule returned plenty of &#8216;false positives&#8217; &#8211; mentions that were NOT programming languages, including people and corporation names<\/li>\n<li>Programming languages were missed entirely when the sentence did not contain the word &#8216;language&#8217;<\/li>\n<li>Non-capitalized languages were not detected<\/li>\n<li>The rule did not detect languages with multi-word names<\/li>\n<\/ol>\n<p>I&#8217;ve already discussed in my <a href=\"http:\/\/freedville.com\/blog\/2016\/12\/04\/cognitive-system-testing-from-a-to-z\/\">Cognitive Systems Testing posts<\/a> that 100% accuracy is generally impossible in a cognitive or NLP-based system, but we can definitely do better than 53%. &nbsp;Let&#8217;s first break down the types of errors&nbsp;we had.<\/p>\n<ul>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Precision_and_recall\">Precision<\/a>&nbsp;errors, also called &#8216;false positives&#8217;. Think of these as finding a result you shouldn&#8217;t have found. &nbsp;Problem 1 is a precision error.<\/li>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Precision_and_recall\">Recall<\/a> errors &#8211; also called &#8216;false negatives&#8217;. Think of these as not finding a result that you should have found. &nbsp;Problems 2 and 3 are recall errors.<\/li>\n<li>Wrong answer errors. Think of these as results that incur both a precision and recall penalty. &nbsp;Problem 4 is a wrong answer error.<\/li>\n<\/ul>\n<p>Generally speaking, fixing precision errors means being more conservative in your annotations and fixing recall errors means being more liberal. &nbsp;There is a delicate dance to improve both competing concerns. &nbsp;With each new rule, the rules developer must maximize a gain in one concern while minimizing loss in the other. &nbsp;And, the developer must write rules with the highest bang-for-the-buck.<\/p>\n<p>Problems 1 and 2 encompass the vast majority of the errors. &nbsp;Problem 3 affects a handful of languages and Problem 4 affects one, so let us focus our efforts on the first two problems.<\/p>\n<p>Here are additional rules I would try:<\/p>\n<p>First new rule, detect capitalized words in sentences with additional &#8216;trigger&#8217; words. &nbsp;A trigger word is a word that indicates nearby word(s) are significant. &nbsp;I would try a expanding the list of trigger words to include &#8216;programming&#8217; and &#8216;syntax&#8217;&nbsp;in addition to &#8216;language&#8217;. &nbsp;This should improve recall.<\/p>\n<p>Second new rule, detect capitalized words that are playing the role of people or corporations. &nbsp;Ideally, I could find another annotator that does this, since detecting people and corporations is a common problem. &nbsp;If not, I could build a list of trigger words that mean the next capitalization is&nbsp;<strong>not<\/strong> significant: &#8220;invented by&#8221;, &#8220;developed by&#8221;, &#8220;programmed by&#8221; are triggers that mean the next capital word is probably not a programming language.<\/p>\n<p>I would test out the results of these new rules before adding any others, to make sure these rules actually worked. &nbsp;Achieving accuracy with a rules-based system is an iterative process, and at some point I will reach diminishing returns with every incremental rule added.<\/p>\n<p>Aside from these rule improvements, I would now start using a real NLP development workbench like IBM Watson Explorer, which has better primitives and capabilities for developing rules quickly and evaluating their results.<\/p>\n<p><strong>Improving the machine-learning-based model<\/strong><\/p>\n<p>This section is much easier to write. &nbsp;With a purely machine learning-based model, you only have one thing you do can do: change the training data.<\/p>\n<p>Generally, this means adding new ground truth by bringing in additional documents and annotating them by hand. &nbsp;However, there are cases where you might want to remove some training data if your training data is not a representative sampling of what you will actually run your model against. &nbsp;For instance, if you train on Wikipedia pages but test against Internet blogs, you will be disappointed in your model&#8217;s performance. &nbsp;Thus when adding new training documents be sure you are getting a representative mix, and when you are uploading document segments be sure to send use some beginning, middle, and ending segments. &nbsp;Variation is key.<\/p>\n<p>When you add new documents to your training set, you can use your existing machine-learning model to &#8220;pre-annotate&#8221; those documents. &nbsp;This reduces the amount of work you the human annotator have to do, since you now only need to correct mistakes the model made. &nbsp;When the model has a precision error (false positive), you simply undo the annotation. &nbsp;You still have to scan the whole document to fix recall errors (false negatives), since the model didn&#8217;t annotate something it should have. &nbsp;And you can do this process iteratively: train a model, collect\/pre-annotate new ground truth, train the model again, collect\/pre-annotate more new ground truth, etc, until your desired accuracy is attained.<\/p>\n<figure id=\"attachment_103\" aria-describedby=\"caption-attachment-103\" style=\"width: 700px\" class=\"wp-caption alignnone\"><img decoding=\"async\" loading=\"lazy\" class=\"size-large wp-image-103\" src=\"http:\/\/freedville.com\/blog\/wp-content\/uploads\/2017\/01\/WKS_Preannotated_Document-700x204.png\" alt=\"Pre-annotations on a new training document including some recall errors that the human annotator must fix.\" width=\"700\" height=\"204\" srcset=\"https:\/\/freedville.com\/blog\/wp-content\/uploads\/2017\/01\/WKS_Preannotated_Document-700x204.png 700w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2017\/01\/WKS_Preannotated_Document-300x87.png 300w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2017\/01\/WKS_Preannotated_Document-768x224.png 768w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2017\/01\/WKS_Preannotated_Document.png 900w\" sizes=\"(max-width: 700px) 100vw, 700px\" \/><figcaption id=\"caption-attachment-103\" class=\"wp-caption-text\">A document pre-annotated with a Watson Knowledge Studio model<\/figcaption><\/figure>\n<p>Since I developed a model entirely from one Wikipedia article, and only trained the model against ~70% of that article, I clearly need to add more training data. &nbsp;I would improve this model by adding more documents about computer programming: blog posts, journal articles, and other technical sources.<\/p>\n<figure id=\"attachment_102\" aria-describedby=\"caption-attachment-102\" style=\"width: 700px\" class=\"wp-caption alignnone\"><img decoding=\"async\" loading=\"lazy\" class=\"size-large wp-image-102\" src=\"http:\/\/freedville.com\/blog\/wp-content\/uploads\/2017\/01\/WKS_Document_Set_Size-700x474.png\" alt=\"\" width=\"700\" height=\"474\" srcset=\"https:\/\/freedville.com\/blog\/wp-content\/uploads\/2017\/01\/WKS_Document_Set_Size-700x474.png 700w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2017\/01\/WKS_Document_Set_Size-300x203.png 300w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2017\/01\/WKS_Document_Set_Size-768x521.png 768w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2017\/01\/WKS_Document_Set_Size.png 900w\" sizes=\"(max-width: 700px) 100vw, 700px\" \/><figcaption id=\"caption-attachment-102\" class=\"wp-caption-text\">The Wikipedia article was split into 17 segments and distributed across training, test, and blind sets.<\/figcaption><\/figure>\n<p>See video demonstration of how I added new training documents and used my existing model to pre-annotate them:<\/p>\n<p><iframe loading=\"lazy\" width=\"700\" height=\"525\" src=\"https:\/\/www.youtube.com\/embed\/HtGveRmXUsM?feature=oembed\" frameborder=\"0\" allowfullscreen><\/iframe><\/p>\n<p><strong>Conclusion<\/strong><\/p>\n<p>In this post I talked about how I would specifically fix the very simple NLP models I produced with both rules-based and machine-learning-based techniques. &nbsp;In my <a href=\"http:\/\/freedville.com\/blog\/2017\/01\/25\/comparing-rules-and-machine-learning-natural-language-processing-approaches\/\">next post<\/a> I will generically discuss the pros and cons of both approaches.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction In my last post, I introduced a simple Natural Language Processing (NLP) problem: extracting mentions of programming languages from the Wikipedia article on programming languages, using both a rules-based technique and a machine learning&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[9,2,4],"_links":{"self":[{"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/posts\/98"}],"collection":[{"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/comments?post=98"}],"version-history":[{"count":12,"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/posts\/98\/revisions"}],"predecessor-version":[{"id":422,"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/posts\/98\/revisions\/422"}],"wp:attachment":[{"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/media?parent=98"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/categories?post=98"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/tags?post=98"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}