Many note-taking apps nowadays offer to take a picture of a document and turn it into text. I was curious and decided to dig a little deeper to see what exactly was going on.
Having done a little research I came across Optical Character Recognition — a field of research in pattern recognition and AI revolving around precisely what we are interested in, reading text from an image. There is a very promising JavaScript library implementing OCR called tesseract.js, which not only works in Node but also in a browser — no server needed!
I would like to focus on working out how to add tesseract.js to an application and then check how well it does its job by creating a function to mark all of the matched words in an image.
Here’s a link to the repository.
To add tesseract to a project we can simply type this in the terminal:
npm install tesseract.js
After importing it into our codebase everything should work as expected. At least according to the package’s docs. In reality, though, I kept getting an error about missing worker.js file, and since the docs and very thorough googling wasn’t of much help I used a workaround. I copied a file called worker.min.js from node_modules/tesseract.js, and pasted it to my public folder from which I serve my static files. After that I changed the path to the worker inside tesseract like so:
tesseract.workerOptions.workerPath = ‘http://localhost:8080/worker.min.js';
and everything worked correctly.
Let’s create a simple application to recognize text in an image. We would like it to render the image twice. Once to show the user their original image of choice and once to highlight the words that were matched. Finally, we would also like for our app to display for the user the progress it has made thus far (at all times).
<label for="recognition-image-input">Choose image</label> <input type="file" accept="image/jpeg, image/png" id="recognition-image-input" /><br /> <label for="recognition-confidence-input">Confidence</label> <input type="number" max="100" min="0" id="recognition-confidence-input" value="70" /><br /> <label for="recognition-progress">File recognition progress:</label> <progress id="recognition-progress" max="100" value="0">0%</progress> <div id="recognition-text"></div> <div id="recognition-images"> <div id="original-image"></div> <div id="labeled-image"></div> </div>
<input type=”file”>
lets the user choose an image and <input type=”number”>
— the desired confidence, which indicates how certain of the result would the user like the app to be. Matches which do not meet the confidence requirement won’t show up in the result. <progress>
informs the user how far along the recognition is, <div id=”recognition-text”>
shows the recognized text and <div id=”recognition-images”>
works as a placeholder for the images.
By listening on the change
event of the <input type=”file” />
we can get the user’s image of choice and render the results.
Before that, however, let’s save the references to the HTML elements in variables for the future code snippets to be more readable:
const recognitionImageInputElement = document.querySelector( '#recognition-image-input', ); const recognitionConfidenceInputElement = document.querySelector( '#recognition-confidence-input', ); const recognitionProgressElement = document.querySelector('#recognition-progress'); const recognitionTextElement = document.querySelector('#recognition-text'); const originalImageElement = document.querySelector('#original-image'); const labeledImageElement = document.querySelector('#labeled-image');
When the user selects an image on their computer the change
event is fired.
The <input type=”file”>
element has a property called files which holds all the files the user has selected. We are not accepting multiple files, however, so there will always be just one file at the 0th index.
recognitionImageElement.addEventListener('change', () => { if (!recognitionImageElement.files) { return null; } const file = recognitionImageElement.files[0]; })
Tesseract has a method called recognize which accepts two arguments — an imageLike
and options
. An imageLike
can be many things. In our case, we are going to use a File
object that will be available to us once a user chooses an image. options
are only used to set the language of the image or (in some advanced cases) to change the defaults of tesseract. We won’t, however, be interested in that here.
Every text recognized by tesseract has a confidence value (from 0 to 100) that tells us how sure tesseract is of the result.
Confidence can be tricky because of two things.
First, paragraphs have their own confidence, as do words and symbols. The confidence of a line is equal to the lowest amongst confidences of its constituent words. By the same principle, the confidence of a word is equal to the confidence of a symbol tesseract is least confident about.
This means that just because the confidence of a line is low doesn’t necessarily mean that the whole line was misrecognized — it could be just one word that is causing trouble.
Secondly, confidence indicates how much an object resembles a certain character.
If the image is, for instance, somebody’s face then the iris of their eye might be mistaken for the letter ‘O’ with fairly high confidence. This often means that filtering out everything below a given confidence level will leave us with nothing but good matches.
Now that we have a file let’s extract text from it by calling the .recognize()
method. Also, by adding a handler to the .progress()
method we can update the <progress>
element.
return tesseract .recognize(file, { lang: 'eng', }) .progress(({ progress, status }) => { if (!progress || !status || status !== 'recognizing text') { return null; } const p = (progress * 100).toFixed(2); recognitionProgressElement.textContent = `${status}: ${p}%`; recognitionProgressElement.value = p; })
Inside the .progress()
handler we are given the following information, progress
(which is a number ranging from 0 to 1) tells us how far along the processing is, and status
which is simply a message telling us what’s going on.
We multiply progress
by a hundred, so that as a result in status
we see 50 instead of 0.50.
The result of the .recognition()
method is confusing, to say the least. It is not well documented and so we have to deduce some things on our own:
{ blocks: Array[1] confidence: 87 html: "<div class='ocr_page' id='page_1' ..." lines: Array[3] oem: "DEFAULT" paragraphs: Array[1] psm: "SINGLE_BLOCK" symbols: Array[33] text: "Hello World↵from beyond↵the Cosmic Void↵↵" version: "3.04.00" words: Array[7] }
html
is the extracted text embedded into HTML tags. text
is the extracted text, paragraphs
, words
and symbols
(which are paragraphs, words and characters in the text respectively) are arrays of objects that look something like this:
We are going to use the paragraphs
property to show the extracted text to the user inside the <p>
elements, and the words
property to create black-bordered boxes and place them on the second picture to show the user exactly what the positions were of the matched words.
We want to render the paragraphs to the user and the best way to do so is to create a <p>
element for each paragraph. A paragraph has a text
property that can be set as the <p>
element’s textContent
.
Inside the previously created <div id=”#recognition-text”>
element we can render the paragraphs with the .append()
method:
const paragraphsElements = res.paragraphs.map(({ text }) => { const p = document.createElement('p'); p.textContent = text; return p; }); recognitionTextElement.append(...paragraphsElements);
To render the images we have to create them first because so far we only have the <div>
elements that work as containers:
const originalImage = document.createElement('img'); const labeledImage = originalImage.cloneNode(true);
There is a little problem, however, with setting their src
property as we don’t have the URL that points to the image — instead we have a File
object.
To render a File
object inside the <img>
tag we have to use the FileReader
constructor like this:
const setImageSrc = (image: HTMLImageElement, imageFile: File) => { return new Promise((resolve, reject) => { const fr = new FileReader(); fr.onload = function() { if (typeof fr.result !== 'string') { return reject(null); } image.src = fr.result; return resolve(); }; fr.onerror = reject; fr.readAsDataURL(imageFile); }); };
We pass the File
object to the .readAsDataURL()
method and then wait for the handler passed to the .onload()
method to fire with the result. The result can now be set as the src
of the image.
The code will look like this:
const originalImage = document.createElement('img'); await setImageSrc(originalImage, file); const labeledImage = originalImage.cloneNode(true);
To show the box on every matched word we have to first filter out every word whose confidence
is below the value previously set (inside the <input id=”recognition-confidence-input”>
element):
const wordsElements = res.words .filter(({ confidence }) => { return confidence > parseInt(recognitionConfidenceInputElement.value, 10); })
Then, thanks to a bbox
property that is available on each word object we know the coordinates of every matched word. The coordinates are x0
, x1
, y0
and y1
, where:
x0
— start of the word on the horizontal axis, it becomes the left
CSS property
y0
— start of the word on the vertical axis, it becomes the top
CSS property
x1
— end of the word on the horizontal axis (by subtracting x1
— x0
we get the width
property)
y1
— end of the word on the vertical axis (by subtracting y1
— y0
we get the height
property)
const wordsElements = res.words .filter(({ confidence }) => { return confidence > parseInt(recognitionConfidenceInputElement.value, 10); }) .map((word) => { const div = document.createElement('div'); const { x0, x1, y0, y1 } = word.bbox; div.classList.add('word-element'); Object.assign(div.style, { top: `${y0}px`, left: `${x0}px`, width: `${x1 - x0}px`, height: `${y1 - y0}px`, border: '1px solid black', position: 'absolute', }); return div; });
The last thing to do is to append both the images and the words to their respective parents which are <div class=”original-image”>
for the original image and <div class=”labeled-image”>
for images with the marked matches.
originalImageElement.appendChild(originalImage); labeledImageElement.appendChild(labeledImage); labeledImageElement.append(...wordsElements);
To get the boxes with position: absolute;
to be displayed on the image let’s add the required CSS:
#labeled-image { position: relative; }
With this out of the way, let’s see the app in action!
I have taken a screenshot of my recent post to see how well it handles a well-formatted text on a single-color background.
Original image:
Labeled image:
Here is the extracted text:
Recently on Facebook David Smooke (the CEO of Hackernoon) posted an article in which he listed 2018’s Top Tech Stories. He also mentioned that if someone wished to make a similar list about say JavaScript he would be happy to feature it on the frontpage of Hackernoon.
In a constant struggle to get more people to read my work I could not miss this opportunity, sol immediately started to plan how to approach making such a list.
And there you have it!
The tesseract.js library provides us with a ready-to-use OCR implementation that is efficient and, for the most part, accurate. The additional advantage of the library is its immense flexibility thanks to being compatible with both Node.js and a browser. There is even an option to include custom training data which could make it work better for your specific applications.
Deploying a Node-based web app or website is the easy part. Making sure your Node instance continues to serve resources to your app is where things get tougher. If you’re interested in ensuring requests to the backend or third-party services are successful, try LogRocket.
LogRocket is like a DVR for web and mobile apps, recording literally everything that happens while a user interacts with your app. Instead of guessing why problems happen, you can aggregate and report on problematic network requests to quickly understand the root cause.
LogRocket instruments your app to record baseline performance timings such as page load time, time to first byte, slow network requests, and also logs Redux, NgRx, and Vuex actions/state. Start monitoring for free.
Would you be interested in joining LogRocket's developer community?
Join LogRocket’s Content Advisory Board. You’ll help inform the type of content we create and get access to exclusive meetups, social accreditation, and swag.
Sign up nowJavaScript’s Date API has many limitations. Explore alternative libraries like Moment.js, date-fns, and the new Temporal API.
Explore use cases for using npm vs. npx such as long-term dependency management or temporary tasks and running packages on the fly.
Validating and auditing AI-generated code reduces code errors and ensures that code is compliant.
Build a real-time image background remover in Vue using Transformers.js and WebGPU for client-side processing with privacy and efficiency.
2 Replies to "How to extract text from an image using JavaScript"
do you have a live demo for this?
Under “Listening on the change event”, “recognotionImageElement” is not defined. Not sure which you set wrong, but it either needs to be “recognitionImageElement” or “recognitionImageInputElement”.