I work with clients who want to use cognitive computing to extract insights locked away in their data. This data generally comes from a large volume and wide variety of source documents – thousands or millions of documents, dozens or hundreds of document types. When source documents come from some sort of structured source (like a database), we generally can answer questions like “what is the most recent document of type X?” from document metadata and we can use this to enhance insights from the data.
But what if the documents have no metadata? What if they are scanned documents, or even worse, paper form? This post explores how difficult it is to get documents in condition to do fun cognitive things on them.
Let me demonstrate just how difficult this problem is. I have a filing cabinet stuffed with years of papers that I have just never gotten around to organizing. I’d love to extract some insights like “what’s the maintenance history on my car?” but first I have to find the car repair statements and sort them by date. This same cabinet has all sorts of other receipts, statements, and the like as well.
Let’s look at a couple examples. I’ve included four types of documents, used green boxes to highlight the document dates, and red boxes to highlight dates that are NOT the document date.
Had I used my scanner, these documents would get labels like “Scanned001.pdf” with a creation date of today. But to get real metadata (document type and date), I have to work pretty hard. For types, I can roughly classify documents by the logo (for documents with a logo) or by a phrase on the top (or bottom) of the document. None of these rules are surefire, but I can go pretty far with rules of thumb.
For dates – look again at those boxes of “actual” and “not actually” document dates. Not only is there no quick pattern to the actual document dates (green boxes) – some are top-left, some are top-right, some middle – but there are plenty of extraneous document dates (red boxes) too. Sorting these documents is a tough nut to crack.
For a personal filing cabinet, it’s probably easiest for me to use a high powered scanner and apply my own document metadata after the fact. It’s a few nights with piles of documents in my office and a trusty bucket sort algorithm. After organizing my documents, I’ll take some time off before I mine them for insights.
But for an enterprise with thousands or millions of documents, the challenge is larger. Do you manually organize that many documents? Do you train an OCR engine to extract metadata? Or do you try to build your application without metadata?
I don’t intend to solve the problem here, that is a separate future blog post. (Update: see How to organize documents when digitizing them) This post is just to demonstrate one underappreciated challenge in preparing yourself to build a cognitive application.