{"id":85,"date":"2017-01-13T13:39:30","date_gmt":"2017-01-13T13:39:30","guid":{"rendered":"http:\/\/freedville.com\/blog\/?p=85"},"modified":"2019-04-12T14:13:21","modified_gmt":"2019-04-12T14:13:21","slug":"demo-of-natural-language-processing-with-rules-and-machine-learning-based-approaches","status":"publish","type":"post","link":"https:\/\/freedville.com\/blog\/2017\/01\/13\/demo-of-natural-language-processing-with-rules-and-machine-learning-based-approaches\/","title":{"rendered":"Demo of natural language processing with rules and machine-learning based approaches"},"content":{"rendered":"<p><strong>Introduction<\/strong><\/p>\n<p>In this cognitive area, many people are interested in using natural language processing (NLP) to extract insights from their large collections of unstructured text. &nbsp;There are two main approaches to natural language processing: rules-based NLP and machine-learning-based NLP. &nbsp;I decided to put together a brief example of how both techniques work so that you can compare the two.<\/p>\n<p><strong>Problem setup<\/strong><\/p>\n<p>The problem I want to solve is to&nbsp;find all the times computer programming languages are mentioned in a block of text, including languages that I have never heard of before. &nbsp;Rather than using a predefined list of programming languages (this list called a &#8216;dictionary&#8217; in NLP terminology), I want the NLP system to use only semantic clues from the text itself to determine when a programming language is being discussed.<\/p>\n<p>Further, I add two&nbsp;constraints to keep the example simple. &nbsp;First, I only process&nbsp;one document: the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Programming_language\">Wikipedia article on programming languages<\/a>. &nbsp;Second, I use only one rule in my rules-based system. &nbsp;This low amount of training keeps both techniques on equal footing.<\/p>\n<p><strong>Rules-based NLP<\/strong><\/p>\n<p>Rules-based NLP is performed by expert rules developers. &nbsp;These developers scan source documents and try to discover rules that will help extract key data points, balancing rules that extract too little vs rules that extract too much. &nbsp;For fun, skim <a href=\"https:\/\/en.wikipedia.org\/wiki\/Programming_language\">the article<\/a> and see what rule(s) you might try.<\/p>\n<p>I settled on the following simple rule:<\/p>\n<p>For every sentence containing the word &#8220;language&#8221;, remove the first word, and any remaining capitalized words are programming languages.<\/p>\n<p>The results are interesting. &nbsp;I&#8217;ll select snippets of text and&nbsp;<strong>bold<\/strong> the &#8220;programming languages&#8221; detected by my NLP.<\/p>\n<p>Sometimes this rule is good:<\/p>\n<blockquote>\n<p class=\"p1\">The language above is <strong>Python.<\/strong><\/p>\n<\/blockquote>\n<p class=\"p1\">Sometimes this rule is great:<\/p>\n<blockquote>\n<p class=\"p1\">In 2013 the ten most popular programming languages are (in descending order by overall popularity): <strong>C<\/strong>, <strong>Java<\/strong>, <strong>PHP<\/strong>, <strong>JavaScript<\/strong>, <strong>C++<\/strong>, <strong>Python<\/strong>, <strong>Shell<\/strong>, <strong>Ruby<\/strong>, <strong>Objective-C<\/strong> and <strong>C#<\/strong>.<\/p>\n<\/blockquote>\n<p class=\"p1\">Sometimes the rule is way off (none of the bold words are programming languages):<\/p>\n<blockquote>\n<p class=\"p1\">Edsger <strong>Dijkstra<\/strong>, in a famous 1968 letter published in the <strong>Communications<\/strong> of the <strong>ACM<\/strong>, argued that <strong>GOTO<\/strong> statements should be eliminated from all &#8220;higher level&#8221; programming languages.<\/p>\n<\/blockquote>\n<p class=\"p1\">And the rule also misses some easy ones (Java is a programming language):<\/p>\n<blockquote>\n<p class=\"p1\">Java came to be used for server-side programming.<\/p>\n<\/blockquote>\n<p class=\"p1\">Still, the rule was simple to write with only a few lines of code, and it performed reasonably well. &nbsp;I manually counted the number of programming languages mentioned in the article as 103. &nbsp;The rule found 106 programming languages, 56 correctly detected, 47 incorrectly, thus giving precision of 52.8%, recall of 54.4%, and F1 score of 0.537. &nbsp;(F1 score is our accuracy metric.) Not bad for one rule.<\/p>\n<p class=\"p1\">See full code listing: <a href=\"http:\/\/freedville.com\/blog\/wp-content\/uploads\/2017\/01\/ProgrammingLanguageRulesBasedAnnotator.txt\">ProgrammingLanguageRulesBasedAnnotator.java<\/a>.<\/p>\n<p class=\"p1\">See video demonstration:<\/p>\n<p><iframe loading=\"lazy\" width=\"700\" height=\"525\" src=\"https:\/\/www.youtube.com\/embed\/MH7AOLD1TUE?feature=oembed\" frameborder=\"0\" allowfullscreen><\/iframe><\/p>\n<p class=\"p1\"><strong>Machine-learning-based NLP<\/strong><\/p>\n<p class=\"p1\">Machine-learning based NLP does not use any rules &#8211; rather it &#8220;learns&#8221;, or is &#8220;trained&#8221; by, source documents &#8220;annotated&#8221; by subject matter experts. &nbsp;Think of annotating as using a highlighter on a source document, highlighting every concept you want the machine learning model to learn. &nbsp;(Use a different colored highlighter for every &#8220;type&#8221; of concept you want to learn.) &nbsp;For my machine-learning-based NLP, I created a demo instance of <a href=\"https:\/\/www.ibm.com\/us-en\/marketplace\/supervised-machine-learning\/\">Watson Knowledge Studio<\/a>.<\/p>\n<p class=\"p1\">Watson Knowledge Studio suggests you break training documents into 2,000 word sub-documents for optimal machine-learning performance. &nbsp;I actually broke my document into 20-line segments, averaging 500-600 words. &nbsp;I did this mostly because annotating can get tedious and annotating smaller documents at a time gives you more breaks, but also so that Watson Knowledge Studio could better randomize my documents into training sets.<\/p>\n<figure id=\"attachment_88\" aria-describedby=\"caption-attachment-88\" style=\"width: 700px\" class=\"wp-caption alignnone\"><img decoding=\"async\" loading=\"lazy\" class=\"size-large wp-image-88\" src=\"http:\/\/freedville.com\/blog\/wp-content\/uploads\/2017\/01\/WKS_Training_Document-700x323.png\" alt=\"Using Watson Knowledge Studio to annotate a document, highlighting all instances of programming languages in the text.\" width=\"700\" height=\"323\" srcset=\"https:\/\/freedville.com\/blog\/wp-content\/uploads\/2017\/01\/WKS_Training_Document.png 700w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2017\/01\/WKS_Training_Document-300x138.png 300w\" sizes=\"(max-width: 700px) 100vw, 700px\" \/><figcaption id=\"caption-attachment-88\" class=\"wp-caption-text\">A training document in Watson Knowledge Studio editor.<\/figcaption><\/figure>\n<p>All of the documents I annotated became &#8220;ground truth&#8221; for the machine learning model. &nbsp;I submitted my document set to Watson Knowledge Studio for training, and in approximately 10 minutes I was able to review the results. &nbsp;A nice touch in Watson Knowledge Studio is that the same style of interface is used to review the machine learning results.<\/p>\n<figure id=\"attachment_92\" aria-describedby=\"caption-attachment-92\" style=\"width: 700px\" class=\"wp-caption alignnone\"><img decoding=\"async\" loading=\"lazy\" class=\"size-large wp-image-92\" src=\"http:\/\/freedville.com\/blog\/wp-content\/uploads\/2017\/01\/WKS_TestDocument_Medium-700x204.png\" alt=\"Reviewing machine learning results from Watson Knowledge Studio. The machine learning model has identified some programming languages correctly and has missed some instances.\" width=\"700\" height=\"204\" srcset=\"https:\/\/freedville.com\/blog\/wp-content\/uploads\/2017\/01\/WKS_TestDocument_Medium-700x204.png 700w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2017\/01\/WKS_TestDocument_Medium-300x87.png 300w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2017\/01\/WKS_TestDocument_Medium-768x224.png 768w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2017\/01\/WKS_TestDocument_Medium.png 900w\" sizes=\"(max-width: 700px) 100vw, 700px\" \/><figcaption id=\"caption-attachment-92\" class=\"wp-caption-text\">Watson Knowledge Studio training results for a single test document<\/figcaption><\/figure>\n<p>In the example we can see that the machine learning model correctly identified the languages Java and Smalltalk, and missed the programming language Scheme. &nbsp;Incidentally, our single rule&nbsp;would have missed Java and Smalltalk but would have found Scheme! &nbsp;Watson Knowledge Studio also gives us overall performance numbers for the model.<\/p>\n<figure id=\"attachment_93\" aria-describedby=\"caption-attachment-93\" style=\"width: 700px\" class=\"wp-caption alignnone\"><img decoding=\"async\" loading=\"lazy\" class=\"size-large wp-image-93\" src=\"http:\/\/freedville.com\/blog\/wp-content\/uploads\/2017\/01\/WKS_Training_Final_Medium-700x182.png\" alt=\"The machine learning model from Watson Knowledge studio reported 100% precision, 36% recall, and 0.54 F1 score.\" width=\"700\" height=\"182\" srcset=\"https:\/\/freedville.com\/blog\/wp-content\/uploads\/2017\/01\/WKS_Training_Final_Medium-700x182.png 700w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2017\/01\/WKS_Training_Final_Medium-300x78.png 300w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2017\/01\/WKS_Training_Final_Medium-768x200.png 768w, https:\/\/freedville.com\/blog\/wp-content\/uploads\/2017\/01\/WKS_Training_Final_Medium.png 900w\" sizes=\"(max-width: 700px) 100vw, 700px\" \/><figcaption id=\"caption-attachment-93\" class=\"wp-caption-text\">Overall accuracy results from Watson Knowledge Studio NLP model<\/figcaption><\/figure>\n<p>In a shocking turn of events, the machine-learning model has 100% precision, meaning that every time it highlighted text as a programming language, it was correct. &nbsp;However, the recall of 36% means that it only highlighted 36% of the actual programming languages labeled in the ground truth. &nbsp;100% precision is an unheard of result, surely due to my exceptionally small training data size. &nbsp;As the model improves its recall, precision will certainly drop a bit.<\/p>\n<p>An interesting turn of events is that the rules-based model and the machine-learning-based model had nearly identical F1 scores at approximately 0.53. &nbsp;This is a modest result, not sufficient for a production system but not bad for giving about an hour to each method.<\/p>\n<p>See video demonstration:<\/p>\n<p><iframe loading=\"lazy\" width=\"700\" height=\"525\" src=\"https:\/\/www.youtube.com\/embed\/iJ1A3i-NQGY?feature=oembed\" frameborder=\"0\" allowfullscreen><\/iframe><\/p>\n<p><strong>Conclusion<\/strong><\/p>\n<p>This post introduced a small problem to solve with natural language processing and demonstrated&nbsp;two different NLP approaches. &nbsp;Each produced&nbsp;modest results in a quick-and-dirty implementation. &nbsp;In a future post I will further discuss the <a href=\"http:\/\/freedville.com\/blog\/2017\/01\/25\/comparing-rules-and-machine-learning-natural-language-processing-approaches\/\">pros and cons of rules-based vs machine-learning-based<\/a> as well as discuss how I would go about <a href=\"http:\/\/freedville.com\/blog\/2017\/01\/20\/improving-simple-natural-language-processing-models-with-rules-or-machine-learning\/\">improving my results with each technique<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Introduction In this cognitive area, many people are interested in using natural language processing (NLP) to extract insights from their large collections of unstructured text. &nbsp;There are two main approaches to natural language processing: rules-based&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[9,2,4],"_links":{"self":[{"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/posts\/85"}],"collection":[{"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/comments?post=85"}],"version-history":[{"count":10,"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/posts\/85\/revisions"}],"predecessor-version":[{"id":423,"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/posts\/85\/revisions\/423"}],"wp:attachment":[{"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/media?parent=85"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/categories?post=85"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/tags?post=85"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}