{"id":452,"date":"2020-03-03T02:50:32","date_gmt":"2020-03-03T02:50:32","guid":{"rendered":"http:\/\/freedville.com\/blog\/?p=452"},"modified":"2020-03-03T02:50:32","modified_gmt":"2020-03-03T02:50:32","slug":"cognitive-system-testing-overall-system-accuracy-testing","status":"publish","type":"post","link":"https:\/\/freedville.com\/blog\/2020\/03\/03\/cognitive-system-testing-overall-system-accuracy-testing\/","title":{"rendered":"Cognitive system testing: Overall system accuracy testing"},"content":{"rendered":"\n<p>Part 5 of the\u00a0<a href=\"http:\/\/freedville.com\/blog\/2016\/12\/04\/cognitive-system-testing-from-a-to-z\/\"><strong>Cognitive System Testing<\/strong>\u00a0<\/a>series, originally posted in 2016 on <a href=\"https:\/\/developer.ibm.com\/\">IBM Developer<\/a>.   <\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Introduction <\/h2>\n\n\n\n<p>\nAs described in previous chapters, cognitive systems are\nprobabilistic, non-deterministic systems, which will never achieve\n100% accuracy. (After all \u2013 what is the accuracy of a human expert,\nthe best alternate cognitive system?) Even though we don\u2019t expect\nto achieve 100% accuracy, we still need to measure the accuracy of\nour system and verify that accuracy is maintained or improved as we\ncontinue development. The previous chapters describe how we verify\naccuracy of the components. This chapter describes how we verify the\naccuracy of the whole.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Overall Approach<\/h2>\n\n\n\n<p>\nWe test the system against a collection of ground truth. Ground truth\nis generically a set of correct input\/output pairs for the system,\ni.e. if we give the system&nbsp;<em>input_A<\/em>, we expect to\nreceive&nbsp;<em>output_A<\/em>. For the Jeopardy! Watson\nsystem, the inputs were Jeopardy! clues and the outputs were answers\n(of course, in the form of a question). For optimal testing of the\nsystem, a large amount of ground truth is required, and this ground\ntruth should be as varied as the total possible set of inputs that\nwill be sent to the system.<\/p>\n\n\n\n<p>\nLet\u2019s once again consider the system in layers. At minimum, we will\nhave an entire system computational layer and a single-document NLP\ncomputational layer. Your solution may have additional layers, for\ninstance a medical solution may have a patient computational layer,\nwhere the complete view of a patient is derived from multiple\ndocuments. Once more let\u2019s consider the&nbsp;<a href=\"http:\/\/martinfowler.com\/bliki\/TestPyramid.html\" target=\"_blank\" rel=\"noreferrer noopener\">testing\npyramid<\/a>. We want to test each layer, but we have a\nproblem \u2013 collecting ground truth is very time-consuming and\nexpensive. Thus, we start at the top, with the entire system-level\nview.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Measuring system-level accuracy<\/h2>\n\n\n\n<p> Depending on the problem being solved, there may be an existing database or repository suitable for seeding a system-level ground truth collection. For instance you may have a few years\u2019 historical Jeopardy! clues and answers, or an electronic medical record database with anonymized patient attributes and outcomes. In other systems this system-level data may be curated by hand as such might be the only ground truth you get. In any case, you should have an automation suite that runs your cognitive system over all of the inputs in your ground truth, collects the outputs, and compares them with the ground truth. The automation suite should output a report suitable for analysts to determine where the system is performing well (and not so well). The report columns should include a ground truth ID, key input variables, the actual output, the correct output, and a true\/false \u201cSystem correct?\u201d flag. This report is suitable for a variety of sorting methods, so the analysts can determine which variables contribute the most to errors.<\/p>\n\n\n\n<p>Example report:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"\"><tbody><tr><td><strong>Input ID<\/strong><\/td><td><strong>Input Variables<\/strong><\/td><td><strong>System Output<\/strong><\/td><td><strong>Correct Output<\/strong><\/td><td><strong>System Correct?<\/strong><\/td><\/tr><tr><td>1<\/td><td>Female patient, ailment 1, age 50<\/td><td>Drug A<\/td><td>Drug A<\/td><td>Yes<\/td><\/tr><tr><td>2<\/td><td>Male patient, ailment 2, age 60<\/td><td>Drug B<\/td><td>Drug C<\/td><td>No<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Measuring sub-system accuracy <\/h2>\n\n\n\n<p> The system-level view of accuracy is great for telling you about errors in the big picture, but it will be insufficient for finding out where those errors are being made. My colleague Robert Nielsen says the system-level view is like taking a quiz, showing your work, and only being told which answers were wrong \u2013 it\u2019s more useful to be told which step was wrong. Thus a best practice is to step down a level and repeat our ground truth exercise. If the system-level test uses a collection of documents, you should have an NLP-level test which only considers a single document.<br><br>Let\u2019s look at how we would set up an NLP-level test for the system described above. Assume there are three documents for each patient. Set up new ground truth that tests what you expect to get out of each document.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"\"><tbody><tr><td><strong>Document ID<\/strong><\/td><td><strong>System Output<\/strong><\/td><td><strong>Correct Output<\/strong><\/td><td><strong>System Correct?<\/strong><\/td><\/tr><tr><td>1.1<\/td><td>Ailment 1<\/td><td>Ailment 1 <\/td><td>Yes<\/td><\/tr><tr><td>1.2<\/td><td><\/td><td>Ailment 1 <\/td><td>No<\/td><\/tr><tr><td>1.3<\/td><td>Ailment 1<\/td><td>Ailment 2 <\/td><td>No<\/td><\/tr><tr><td>1.4<\/td><td><\/td><td>Ailment 2 <\/td><td>No<\/td><\/tr><tr><td>1.5<\/td><td><\/td><td>Ailment 2 <\/td><td>No<\/td><\/tr><tr><td>2.1<\/td><td>Ailment 2<\/td><td>Ailment 2 <\/td><td>Yes<\/td><\/tr><tr><td>2.2<\/td><td>Ailment 2 <\/td><td>Ailment 2 <\/td><td>Yes<\/td><\/tr><tr><td>2.3<\/td><td>Ailment 2 <\/td><td>Ailment 2 <\/td><td>Yes<\/td><\/tr><tr><td>2.4<\/td><td>Ailment 2 <\/td><td>Ailment 2 <\/td><td>Yes<\/td><\/tr><tr><td>2.5<\/td><td>Ailment 3<\/td><td><\/td><td>No<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>Now we can learn additional information about the system. For case 1, we may have gotten the right answer for the wrong reason, as the system suggested ailment 1 even though most source documents had ailment 2. For case 2, the NLP is pretty accurate at detecting ailments, so we can look elsewhere for the source of the error.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Quick note on NLP measurements<\/h2>\n\n\n\n<p>\nThe first measurement people think to take for NLP is a simple\naccuracy measure: how many right answers divided by how many total\nquestions.&nbsp;&nbsp; This can be misleading. Consider if you are\ntrying to detect something rare, that only occurs in 1% of documents.\nA simple \u201cno-op\u201d annotator will never create any annotations, and\nwill be 99% accurate! This is clearly not what is desired.<\/p>\n\n\n\n<p>\nRather, we measure with F1 score. F1 is a harmonic mean of precision\n(when the system gives an answer, how often is it right) and recall\n(how many correct answers does the system give). The F1 score of our\n\u201cno-op\u201d annotator above is 0 due to a recall of 0. Our ailment\nannotator above has 5\/7 = 71% precision and 5\/9 = 55% recall for F1\nscore of 62% (accuracy was 50%). Thus we now know specifically that\nthe ailment annotator needs to be more aggressive in surfacing\nailments, and the patterns in documents 1.2, 1.3, and 1.5 are a good\nplace to start.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Iterate, iterate, iterate<\/h2>\n\n\n\n<p>\nImproving the performance of your cognitive system is an iterative\nprocess. Look at where the system has the lowest performance, dive\ninto the subsystems and components causing the bad performance,\nimprove those subsystems, and repeat. Continue this process until the\nsystem is \u201cgood enough\u201d. Remember that 100% accuracy will not be\npossible. And be sure to measure the accuracy performance after every\nbuild, to carefully monitor if the system accuracy is increasing or\ndecreasing and why.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Conclusion<\/h2>\n\n\n\n<p> The accuracy performance of a cognitive system should be carefully measured. The best method is to collect ground truth (inputs with desired outputs) for each level of the system, starting at the top and working down through sub-components. Measure accuracy after each build and improve the system by looking at where the system performs the worst, fixing the worst parts, and iterating until accuracy reaches a desired target (even though 100% accuracy is impossible).<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Part 5 of the\u00a0Cognitive System Testing\u00a0series, originally posted in 2016 on IBM Developer. Introduction As described in previous chapters, cognitive systems are probabilistic, non-deterministic systems, which will never achieve 100% accuracy. (After all \u2013 what&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[],"_links":{"self":[{"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/posts\/452"}],"collection":[{"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/comments?post=452"}],"version-history":[{"count":2,"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/posts\/452\/revisions"}],"predecessor-version":[{"id":454,"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/posts\/452\/revisions\/454"}],"wp:attachment":[{"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/media?parent=452"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/categories?post=452"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/tags?post=452"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}