Building an OCR Service With TesseractJS in AWS Lambda

Tue, 21 Nov 2017 00:00:00 +0000

The past few days I was trying to make TesseractJS work in AWS Lambda so that I could do some OCR (Optical Character Recognition) on some images I had stored in an S3 bucket. However I am a bit new to NodeJS and I was running into some difficulties getting it to work in the Lambda environment. In this post I am going to go through some of these issues and how I solved them.

TesseractJS is a OCR library written in pure JavaScript. It can recognize the text in images, as well as provide information about the location of the paragraphs, lines, and words in the document.

We will be using a NodeJS 6.10 runtime in AWS Lambda. And I will be deploying the service with ClaudiaJS.

Downloading the TesseractJS Files

When running TesseractJS to recognize an image, TesseractJS will automatically begin downloading some files, which include tesseract language files, a core library file, and a worker file. These are all files that TesseractJS requires in order to correctly run.

The problem occurs when trying to download these inside AWS Lambda, since Lambda only allows writing to the /tmp/ directory, you will get an error like this in your logs:

Error: EROFS: read-only file system, open 'eng.traineddata'

nodejs on Andrés Álvarez

Building an OCR Service With TesseractJS in AWS Lambda

Downloading the TesseractJS Files