Maciej Cieślar A JavaScript developer and a blogger @ https://www.mcieslar.com/

How to extract text from an image using JavaScript

7 min read 2069

Tesseract.js

Many note-taking apps nowadays offer to take a picture of a document and turn it into text. I was curious and decided to dig a little deeper to see what exactly was going on.

Having done a little research I came across Optical Character Recognition — a field of research in pattern recognition and AI revolving around precisely what we are interested in, reading text from an image. There is a very promising JavaScript library implementing OCR called tesseract.js, which not only works in Node but also in a browser — no server needed!

I would like to focus on working out how to add tesseract.js to an application and then check how well it does its job by creating a function to mark all of the matched words in an image.

Here’s a link to the repository.

Tesseract.js

To add tesseract to a project we can simply type this in the terminal:

npm install tesseract.js

After importing it into our codebase everything should work as expected. At least according to the package’s docs. In reality, though, I kept getting an error about missing worker.js file, and since the docs and very thorough googling wasn’t of much help I used a workaround. I copied a file called worker.min.js from node_modules/tesseract.js, and pasted it to my public folder from which I serve my static files. After that I changed the path to the worker inside tesseract like so:

tesseract.workerOptions.workerPath = ‘http://localhost:8080/worker.min.js';

and everything worked correctly.

Application

Let’s create a simple application to recognize text in an image. We would like it to render the image twice. Once to show the user their original image of choice and once to highlight the words that were matched. Finally, we would also like for our app to display for the user the progress it has made thus far (at all times).

HTML markup

<label for="recognition-image-input">Choose image</label>
<input type="file" accept="image/jpeg, image/png" id="recognition-image-input" /><br />
<label for="recognition-confidence-input">Confidence</label>
<input type="number" max="100" min="0" id="recognition-confidence-input" value="70" /><br />
<label for="recognition-progress">File recognition progress:</label>
<progress id="recognition-progress" max="100" value="0">0%</progress>
<div id="recognition-text"></div>
<div id="recognition-images">
  <div id="original-image"></div>
  <div id="labeled-image"></div>
</div>

<input type=”file”> lets the user choose an image and <input type=”number”> — the desired confidence, which indicates how certain of the result would the user like the app to be. Matches which do not meet the confidence requirement won’t show up in the result. <progress> informs the user how far along the recognition is, <div id=”recognition-text”> shows the recognized text and <div id=”recognition-images”> works as a placeholder for the images.

By listening on the change event of the <input type=”file” /> we can get the user’s image of choice and render the results.

Before that, however, let’s save the references to the HTML elements in variables for the future code snippets to be more readable:

const recognitionImageInputElement = document.querySelector(
 '#recognition-image-input',
);
const recognitionConfidenceInputElement = document.querySelector(
 '#recognition-confidence-input',
);
const recognitionProgressElement = document.querySelector('#recognition-progress');
const recognitionTextElement = document.querySelector('#recognition-text');
const originalImageElement = document.querySelector('#original-image');
const labeledImageElement = document.querySelector('#labeled-image');

Listening on the change event

When the user selects an image on their computer the change event is fired.

The <input type=”file”> element has a property called files which holds all the files the user has selected. We are not accepting multiple files, however, so there will always be just one file at the 0th index.

recognitionImageElement.addEventListener('change', () => {
 if (!recognitionImageElement.files) {
   return null;
 }
const file = recognitionImageElement.files[0];
})

How to recognize an image

Tesseract has a method called recognize which accepts two arguments — an imageLike and options. An imageLike can be many things. In our case, we are going to use a File object that will be available to us once a user chooses an image. options are only used to set the language of the image or (in some advanced cases) to change the defaults of tesseract. We won’t, however, be interested in that here.

Every text recognized by tesseract has a confidence value (from 0 to 100) that tells us how sure tesseract is of the result.

A note about confidence

Confidence can be tricky because of two things.

First, paragraphs have their own confidence, as do words and symbols. The confidence of a line is equal to the lowest amongst confidences of its constituent words. By the same principle, the confidence of a word is equal to the confidence of a symbol tesseract is least confident about.

This means that just because the confidence of a line is low doesn’t necessarily mean that the whole line was misrecognized — it could be just one word that is causing trouble.

Secondly, confidence indicates how much an object resembles a certain character.

If the image is, for instance, somebody’s face then the iris of their eye might be mistaken for the letter ‘O’ with fairly high confidence. This often means that filtering out everything below a given confidence level will leave us with nothing but good matches.

Recognizing an image

Now that we have a file let’s extract text from it by calling the .recognize() method. Also, by adding a handler to the .progress() method we can update the <progress> element.

return tesseract
  .recognize(file, {
    lang: 'eng',
  })
  .progress(({ progress, status }) => {
    if (!progress || !status || status !== 'recognizing text') {
      return null;
    }
  const p = (progress * 100).toFixed(2);
  recognitionProgressElement.textContent = `${status}: ${p}%`;
  recognitionProgressElement.value = p;
})

Inside the .progress() handler we are given the following information, progress (which is a number ranging from 0 to 1) tells us how far along the processing is, and status which is simply a message telling us what’s going on.

We multiply progress by a hundred, so that as a result in status we see 50 instead of 0.50.

Dealing with the result

The result of the .recognition() method is confusing, to say the least. It is not well documented and so we have to deduce some things on our own:

{
    blocks: Array[1]
    confidence: 87
    html: "<div class='ocr_page' id='page_1' ..."
    lines: Array[3]
    oem: "DEFAULT"
    paragraphs: Array[1]
    psm: "SINGLE_BLOCK"
    symbols: Array[33]
    text: "Hello World↵from beyond↵the Cosmic Void↵↵"
    version: "3.04.00"
    words: Array[7]
}

html is the extracted text embedded into HTML tags. text is the extracted text, paragraphs, words and symbols (which are paragraphs, words and characters in the text respectively) are arrays of objects that look something like this:

We are going to use the paragraphs property to show the extracted text to the user inside the <p> elements, and the words property to create black-bordered boxes and place them on the second picture to show the user exactly what the positions were of the matched words.

Showing extracted text to the user

We want to render the paragraphs to the user and the best way to do so is to create a <p> element for each paragraph. A paragraph has a text property that can be set as the <p> element’s textContent.

Inside the previously created <div id=”#recognition-text”> element we can render the paragraphs with the .append() method:

const paragraphsElements = res.paragraphs.map(({ text }) => {
  const p = document.createElement('p');
  p.textContent = text;
  return p;
});
recognitionTextElement.append(...paragraphsElements);

Rendering images

To render the images we have to create them first because so far we only have the <div> elements that work as containers:

const originalImage = document.createElement('img');

const labeledImage = originalImage.cloneNode(true);

There is a little problem, however, with setting their src property as we don’t have the URL that points to the image — instead we have a File object.

To render a File object inside the <img> tag we have to use the FileReader constructor like this:

const setImageSrc = (image: HTMLImageElement, imageFile: File) => {
 return new Promise((resolve, reject) => {
   const fr = new FileReader();
   fr.onload = function() {
     if (typeof fr.result !== 'string') {
       return reject(null);
     }
     image.src = fr.result;
     return resolve();
   };
   fr.onerror = reject;
   fr.readAsDataURL(imageFile);
 });
};

We pass the File object to the .readAsDataURL() method and then wait for the handler passed to the .onload() method to fire with the result. The result can now be set as the src of the image.

The code will look like this:

const originalImage = document.createElement('img');
await setImageSrc(originalImage, file);
const labeledImage = originalImage.cloneNode(true);

Marking the matched words

To show the box on every matched word we have to first filter out every word whose confidence is below the value previously set (inside the <input id=”recognition-confidence-input”> element):

const wordsElements = res.words
  .filter(({ confidence }) => {
    return confidence > parseInt(recognitionConfidenceInputElement.value, 10);
})

Then, thanks to a bbox property that is available on each word object we know the coordinates of every matched word. The coordinates are x0, x1, y0 and y1, where:

x0 — start of the word on the horizontal axis, it becomes the left CSS property

y0 — start of the word on the vertical axis, it becomes the top CSS property

x1 — end of the word on the horizontal axis (by subtracting x1 — x0 we get the width property)

y1 — end of the word on the vertical axis (by subtracting y1 — y0 we get the height property)

const wordsElements = res.words
  .filter(({ confidence }) => {
    return confidence > parseInt(recognitionConfidenceInputElement.value, 10);
  })
  .map((word) => {
    const div = document.createElement('div');
    const { x0, x1, y0, y1 } = word.bbox;
    div.classList.add('word-element');
    Object.assign(div.style, {
      top: `${y0}px`,
      left: `${x0}px`,
      width: `${x1 - x0}px`,
      height: `${y1 - y0}px`,
      border: '1px solid black',
      position: 'absolute',
    });
    return div;
});

The last thing to do is to append both the images and the words to their respective parents which are <div class=”original-image”> for the original image and <div class=”labeled-image”> for images with the marked matches.

originalImageElement.appendChild(originalImage);
labeledImageElement.appendChild(labeledImage);
labeledImageElement.append(...wordsElements);

To get the boxes with position: absolute; to be displayed on the image let’s add the required CSS:

#labeled-image {
position: relative;
}

With this out of the way, let’s see the app in action!

Testing it out

I have taken a screenshot of my recent post to see how well it handles a well-formatted text on a single-color background.

Original image:

Labeled image:

Here is the extracted text:

Recently on Facebook David Smooke (the CEO of Hackernoon) posted an article in which he listed 2018’s Top Tech Stories. He also mentioned that if someone wished to make a similar list about say JavaScript he would be happy to feature it on the frontpage of Hackernoon.

In a constant struggle to get more people to read my work I could not miss this opportunity, sol immediately started to plan how to approach making such a list.

And there you have it!

Conclusion

The tesseract.js library provides us with a ready-to-use OCR implementation that is efficient and, for the most part, accurate. The additional advantage of the library is its immense flexibility thanks to being compatible with both Node.js and a browser. There is even an option to include custom training data which could make it work better for your specific applications.

Plug: LogRocket, a DVR for web apps

LogRocket is a frontend logging tool that lets you replay problems as if they happened in your own browser. Instead of guessing why errors happen, or asking users for screenshots and log dumps, LogRocket lets you replay the session to quickly understand what went wrong. It works perfectly with any app, regardless of framework, and has plugins to log additional context from Redux, Vuex, and @ngrx/store.

In addition to logging Redux actions and state, LogRocket records console logs, JavaScript errors, stacktraces, network requests/responses with headers + bodies, browser metadata, and custom logs. It also instruments the DOM to record the HTML and CSS on the page, recreating pixel-perfect videos of even the most complex single page apps.

Try it for free.

Maciej Cieślar A JavaScript developer and a blogger @ https://www.mcieslar.com/

Leave a Reply