{"id":193,"date":"2017-10-12T00:46:17","date_gmt":"2017-10-12T00:46:17","guid":{"rendered":"http:\/\/freedville.com\/blog\/?p=193"},"modified":"2019-04-12T14:08:02","modified_gmt":"2019-04-12T14:08:02","slug":"what-does-your-cognitive-system-know-about-your-documents-and-why-does-it-matter","status":"publish","type":"post","link":"http:\/\/freedville.com\/blog\/2017\/10\/12\/what-does-your-cognitive-system-know-about-your-documents-and-why-does-it-matter\/","title":{"rendered":"What does your cognitive system know about your documents, and why does it matter?"},"content":{"rendered":"<p>I work with clients who want to use cognitive computing to extract insights locked away in their data.&nbsp; This data generally comes from a large volume and wide variety of source documents &#8211; thousands or millions of documents, dozens or hundreds of document types.&nbsp; When source documents come from some sort of structured source (like a database), we generally can answer questions like &#8220;what is the most recent document of type X?&#8221; from document metadata and we can use this to enhance insights from the data.<\/p>\n<p>But what if the documents have no metadata?&nbsp; What if they are scanned documents, or even worse, paper form?&nbsp; This post explores how difficult it is to get documents in condition to do fun cognitive things on them.<\/p>\n<p>Let me demonstrate just how difficult this problem is.&nbsp; I have a filing cabinet stuffed with years of papers that I have just never gotten around to organizing.&nbsp; I&#8217;d love to extract some insights like &#8220;what&#8217;s the maintenance history on my car?&#8221; but first I have to find the car repair statements and sort them by date.&nbsp; This same cabinet has all sorts of other receipts, statements, and the like as well.<\/p>\n<p>Let&#8217;s look at a couple examples.&nbsp; I&#8217;ve included four types of documents, used green boxes to highlight the document dates, and red boxes to highlight dates that are NOT the document date.<\/p>\n<figure id=\"attachment_194\" aria-describedby=\"caption-attachment-194\" style=\"width: 150px\" class=\"wp-caption alignleft\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-194 size-thumbnail\" src=\"http:\/\/freedville.com\/blog\/wp-content\/uploads\/2017\/10\/invoice_2_blurred_dates-150x150.jpg\" alt=\"\" width=\"150\" height=\"150\"><figcaption id=\"caption-attachment-194\" class=\"wp-caption-text\">Inspection invoice<\/figcaption><\/figure>\n<figure id=\"attachment_198\" aria-describedby=\"caption-attachment-198\" style=\"width: 150px\" class=\"wp-caption alignleft\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-198 size-thumbnail\" src=\"http:\/\/freedville.com\/blog\/wp-content\/uploads\/2017\/10\/invoice_3_blurred_dates2-150x150.jpg\" alt=\"\" width=\"150\" height=\"150\"><figcaption id=\"caption-attachment-198\" class=\"wp-caption-text\">Statement<\/figcaption><\/figure>\n<figure id=\"attachment_197\" aria-describedby=\"caption-attachment-197\" style=\"width: 150px\" class=\"wp-caption alignleft\"><img decoding=\"async\" loading=\"lazy\" class=\"wp-image-197 size-thumbnail\" src=\"http:\/\/freedville.com\/blog\/wp-content\/uploads\/2017\/10\/invoice_1_blurred_dates-150x150.jpg\" alt=\"\" width=\"150\" height=\"150\"><figcaption id=\"caption-attachment-197\" class=\"wp-caption-text\">Veterinary invoice<\/figcaption><\/figure>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>Had I used my scanner, these documents would get labels like &#8220;Scanned001.pdf&#8221; with a creation date of today.&nbsp; But to get real metadata (document type and date), I have to work pretty hard.&nbsp; For types, I can roughly classify documents by the logo (for documents with a logo) or by a phrase on the top (or bottom) of the document.&nbsp; None of these rules are surefire, but I can go pretty far with rules of thumb.<\/p>\n<p>For dates &#8211; look again at those boxes of &#8220;actual&#8221; and &#8220;not actually&#8221; document dates.&nbsp; Not only is there no quick pattern to the actual document dates (green boxes) &#8211; some are top-left, some are top-right, some middle &#8211; but there are plenty of extraneous document dates (red boxes) too.&nbsp; Sorting these documents is a tough nut to crack.<\/p>\n<p>For a personal filing cabinet, it&#8217;s probably easiest for me to use a high powered scanner and apply my own document metadata after the fact.&nbsp; It&#8217;s a few nights with piles of documents in my office and a trusty <a href=\"https:\/\/en.wikipedia.org\/wiki\/Bucket_sort\">bucket sort<\/a> algorithm.&nbsp; After organizing my documents, I&#8217;ll take some time off before I mine them for insights.<\/p>\n<p>But for an enterprise with thousands or millions of documents, the challenge is larger.&nbsp; Do you manually organize that many documents?&nbsp; Do you train an OCR engine to extract metadata?&nbsp; Or do you try to build your application without metadata?<\/p>\n<p>I don&#8217;t intend to solve the problem here, that is a separate future blog post.&nbsp; (Update: see <a href=\"http:\/\/freedville.com\/blog\/2017\/10\/27\/how-to-organize-documents-when-digitizing-them\/\">How to organize documents when digitizing them<\/a>) This post is just to demonstrate one underappreciated challenge in preparing yourself to build a cognitive application.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>I work with clients who want to use cognitive computing to extract insights locked away in their data.&nbsp; This data generally comes from a large volume and wide variety of source documents &#8211; thousands or&#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[1],"tags":[9,2,5],"_links":{"self":[{"href":"http:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/posts\/193"}],"collection":[{"href":"http:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"http:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"http:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"http:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/comments?post=193"}],"version-history":[{"count":11,"href":"http:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/posts\/193\/revisions"}],"predecessor-version":[{"id":413,"href":"http:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/posts\/193\/revisions\/413"}],"wp:attachment":[{"href":"http:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/media?parent=193"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"http:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/categories?post=193"},{"taxonomy":"post_tag","embeddable":true,"href":"http:\/\/freedville.com\/blog\/wp-json\/wp\/v2\/tags?post=193"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}