Amazon's optical character recognition toy Textract is here but still a bit short-sighted
Auto-detects structured data... some of the time
Amazon Web Services has announced the general availability of Textract, a service for converting scanned documents to text.
Optical character recognition (OCR) is a mature technology built into many applications. Insert a scanned document into Microsoft's OneNote, for example, and you can "copy text from picture" with reasonable results.
Textract has a number of advantages, though. Using the API, you can programmatically convert documents in bulk operations or as part of a workflow. You can use it in conjunction with other AWS services such as Amazon Translate or Amazon Comprehend (a machine learning service to find "insights and relationships in text", according to Amazon). Another relevant service is Elasticsearch, a deployment of Elastic's open-source search engine which lets you search and analyse text.
You can also use Textract to populate a database, and this ability is highlighted by AWS: "Amazon Textract goes beyond simple optical character recognition (OCR) to identify the contents of fields in forms, information stored in tables, and the context in which the information is presented."
Results from the table recognition capability will likely depend on whether or not your particular document matches one of Textract's "pre-trained learning models". The company said that tens of millions of documents have been fed into its machine learning system "including contracts, tax documents, sales orders, enrolment forms, benefit applications, insurance claims, policy documents and many more". If you upload something familiar to the system, the results should be good. Give it something unknown, though, and you will have more work to do.
Textract is generally used via its API but AWS also has a handy demo page where you can upload a scanned document and see the results. Your correspondent had a quick try using some mildly tricky documents and the results were nothing special.
I picked a page from a book on CSS by Eric Meyer with coloured text, code, a table of examples, and so on. Textract did a fair job on the basic text, though Fading became "ading" and the word "so" became upper case for no reason I can see (OneNote got this right), and the table was not obvious enough for Textract to find. Another document I tried had a table of bank fees: Textract failed to identify it as a table. The wrong kind of table perhaps. The documentation advises you to ensure that "the tables in your document are visually separated from surrounding elements on the page".
If you have documents in a standard format, which is the most common use case for Textract, you can code around imperfections. Textract breaks a document down into collections of blocks where each block represents an element such as a word, a line, a table, or a key value pair. You also get a confidence level so you could identify documents where manual intervention is needed. There are code samples in an AWS GitHub repository.
You can call the Textract API using a synchronous or asynchronous approach. There are a few limits, such as 5MB maximum size for bitmap images, 500MB for a PDF, no more than 10 per cent rotation from vertical alignment, and no handwriting detection.
AWS Textract has a generous free tier of up to 1,000 pages per month using the Detecting Document Text API and up to 100 pages per month using the Analyze Document Text API. ®