Editor’s note: This article was last updated on 4 January 2023 to ensure that all information is compatible with the latest version of Node.js and to add information about other NLP libraries, like NLP.js and Compromise.cool.
The internet facilitates a never-ending creation of large volumes of unstructured textual data. Luckily, we have modern systems that can make sense of this kind of data.
Modern computer systems can make sense of natural languages using an underlying technology called natural language processing (NLP).
Python is usually the go-to language when it comes to NLP because of its wealth of language processing packages, like the Natural Language Toolkit. However, JavaScript is growing rapidly and the existence of npm gives its developers access to a large number of packages, including packages to perform NLP for different languages.
In this article, we will focus on getting started with NLP using Node. We will be using a JavaScript library called natural. By adding the natural library to our project, our code will be able to parse, interpret, manipulate, and understand natural languages from user input.
This article will barely scratch the surface of NLP, but it will be useful for developers who already use NLP with Python and want to transition to achieve the same results with Node. Complete newbies will also learn a lot about NLP as a technology and its usage with Node.
Jump ahead:
Natural language processing technology can process human language as input and perform one or more of the following operations:
NLP is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.
Significant implementations of NLP aren’t too far from us these days as most of our devices integrate AI, ML, and NLP to enhance human-to-machine communications. Here are some common examples of NLP in action.
One of the most helpful technologies is the Google Search engine. You put in text and receive millions of related results as a response. This is possible because of the NLP technology that can make sense of the input and perform a series of logical operations. This is also what allows Google Search to understand your intent and suggest the proper spelling to you when you spell a search term incorrectly.
Virtual assistants such as Siri, Alexa, and Google Assistant show an advanced level of the implementation of NLP. After receiving verbal input from you, they can identify the intent, perform an operation, and send back a response in a natural language.
Chatbots can analyze large amounts of textual data and give different responses based on large data and their ability to detect intent. This gives the overall feel of a natural conversation and not one with a machine.
Have you noticed that email clients are constantly getting better at filtering spam emails out of your inbox? This is possible because the filter engines can understand the content of emails — mostly using Bayesian spam filtering — and decide if it’s spam or not.
The use cases above show that AI, ML, and NLP are already being used heavily on the web. Because humans interact with websites using natural languages, we should build our websites with NLP capabilities.
To code along with this article, you will need to create an index.js
file and paste in the snippet you want to try, then run the file with Node. Let’s begin!
We can install natural by running the following command:
npm install natural
The source code to each of the following usage examples in the next section is available on GitHub. Feel free to clone it, fork it, or submit an issue.
Let’s learn how to perform some basic but important NLP tasks using natural.
Tokenization is the process of dividing/splitting input characters or words into smaller parts known as “tokens.” The tokens could be characters, words, or subwords. Tokenization is the initial step in natural language processing, which entails gathering data and breaking it into parts so that a machine can understand it.
For example, let’s look at the text string: The quick brown fox jumps over the lazy dog
The string isn’t implicitly segmented in spaces, as a natural language speaker would do. The raw input, the 43 characters, must be explicitly split into the 9 tokens with a given space delimiter (i.e., matching the string " "
or regular expression /\s{1}/
).
natural ships with a number of smart tokenizer algorithms that can break text into arrays of tokens. Here’s a code snippet showing the usage of the Word tokenizer:
// index.js var natural = require('natural'); var tokenizer = new natural.WordTokenizer(); console.log(tokenizer.tokenize("The quick brown fox jumps over the lazy dog"));
Running this with Node gives the following output:
[ 'The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog' ]
Stemming is the act of reducing a word to its word stem (also known as base or root form). Stemming is a feature of artificial intelligence retrieval and extraction as well as linguistic morphology. It is used by search engines to index words. For example, words such as cats, catlike, and catty will be stemmed down to the root word, cat.
Natural currently supports two stemming algorithms: Porter and Lancaster (Paice/Husk). Here’s a code snippet implementing stemming using the Porter algorithm:
// index.js const natural = require('natural'); console.log(natural.PorterStemmer.tokenizeAndStem("I can see that we are going to be friends"))
From the code above, we use the tokenizeAndStem()
method under the Porter algorithm to break the string into individual words and reduce each word to their base form. The result is an array of stemmed tokens:
[ 'go', 'friend' ]
N.B., in the result above, stop words have been removed by the algorithm. Stop words are words that are filtered out before the processing of natural language (e.g., be, an, and to are all stop words).
Porter algorithm is the most well-known and oldest stemming algorithm because it is the least aggressive. The word stems are reasonably clear and intelligible. Porter stemmer is a suffix stripping algorithm. In essence, it strips words down to their most basic forms using pre-defined principles. Porter stemmer employs more than 50 rules, organized into five phases and a few substeps, to eliminate frequent suffixes.
Some examples of the rules are:
On the other hand, Lancaster is quite aggressive due to its tight word-chopping style, which makes it incredibly perplexing. Because the stems lose some of their relatability, it is the least used. More than 100 rules make up Lancaster stemmer, which is roughly twice as many as Porter stemmer. A different notation than Porter’s stemming rules was used by the authors to define the rules. Each rule consists of five parts, of which two are optional.
Some examples of the rules are:
Natural provides an implementation of four algorithms for calculating string distance, Hamming distance, Jaro-Winkler, Levenshtein distance, and Dice coefficient. Using these algorithms, we can tell if two strings match or not. For the sake of this project, we will be using Hamming distance.
Hamming distance measures the distance between two strings of equal length by counting the number of different characters. The third parameter indicates whether the case should be ignored. By default, the algorithm is case sensitive.
Here’s a code snippet showing the usage of the Hemming algorithm for calculating string distance:
// index.js var natural = require('natural'); console.log(natural.HammingDistance("karolin", "kathrin", false)); console.log(natural.HammingDistance("karolin", "kerstin", false)); console.log(natural.HammingDistance("short string", "longer string", false));
The output:
3 3 -1
The first two comparisons return 3
because three letters differ. The last one returns -1
because the lengths of the strings being compared are different.
Text classification, also known as text tagging, is the process of classifying text into organized groups. That is, if we have a new unknown statement, our processing system can decide which category it fits into the most based on its content.
Some of the most common use cases for automatic text classification include the following:
natural currently supports two classifiers: Naive Bayes and logistic regression. The following examples use the BayesClassifier
class:
// index.js var natural = require('natural'); var classifier = new natural.BayesClassifier(); classifier.addDocument('i am long qqqq', 'buy'); classifier.addDocument('buy the q\'s', 'buy'); classifier.addDocument('short gold', 'sell'); classifier.addDocument('sell gold', 'sell'); classifier.train(); console.log(classifier.classify('i am short silver')); console.log(classifier.classify('i am long copper'));
In the code above, we trained the classifier on sample text. It will use reasonable defaults to tokenize and stem the text. Based on the sample text, the console will log the following output:
sell buy
Sentiment analysis, also known as opinion mining or emotion AI, is one of the most used applications of NLP, which identifies and extracts viewpoints from spoken or written language to ascertain the emotion of a person.
To assess if a piece of information is positive, negative, or neutral, sentiment analysis is utilized. Businesses use sentiment analysis to monitor brand awareness and consumer feedback in order to understand how well a product is performing and what is required to increase sales.
Natural supports algorithms that can calculate the sentiment of each piece of text by summing the polarity of each word and normalizing it with the length of the sentence. If a negation occurs, the result is made negative.
Here’s an example of its usage:
// index.js var natural = require('natural'); var Analyzer = natural.SentimentAnalyzer; var stemmer = natural.PorterStemmer; var analyzer = new Analyzer("English", stemmer, "afinn"); // getSentiment expects an array of strings console.log(analyzer.getSentiment(["I", "don't", "want", "to", "play", "with", "you"]));
The constructor has three parameters:
"afinn"
, "senticon"
or "pattern"
are valid valuesRunning the code above gives the following output:
0.42857142857142855 // indicates a relatively negative statement
Using natural, we can compare two words that are spelled differently but sound similar using phonetic matching. Here’s an example using the metaphone.compare()
method:
// index.js var natural = require('natural'); var metaphone = natural.Metaphone; var soundEx = natural.SoundEx; var wordA = 'phonetics'; var wordB = 'fonetix'; if (metaphone.compare(wordA, wordB)) console.log('They sound alike!'); // We can also obtain the raw phonetics of a word using process() console.log(metaphone.process('phonetics'));
We also obtained the raw phonetics of a word using process()
. We get the following output when we run the code above:
They sound alike! FNTKS
Users may make typographical errors when supplying input to a web application through a search bar or an input field. Natural has a probabilistic spellchecker that can suggest corrections for misspelled words using an array of tokens from a text corpus.
Let’s explore an example using an array of two words (also known as a corpus) for simplicity:
// index.js var natural = require('natural'); var corpus = ['something', 'soothing']; var spellcheck = new natural.Spellcheck(corpus); console.log(spellcheck.getCorrections('soemthing', 1)); console.log(spellcheck.getCorrections('soemthing', 2));
It suggests corrections (sorted by probability in descending order) that are up to a maximum edit distance away from the input word. A maximum distance of one will cover 80% to 95% of spelling mistakes. After a distance of two, it becomes very slow.
We get the following output from running the code:
[ 'something' ] [ 'something', 'soothing' ]
Created by the AXA group, NLP.js is an NLP package for bot development that supports 40 languages. It offers entity extraction, sentiment analysis, automatic language identification, and other features. It is the ideal Node.js library for creating chatbots:
const { NlpManager } = require('node-nlp'); const manager = new NlpManager({ languages: ['en'], forceNER: true }); // Adds the utterances and intents for the NLP manager.addDocument('en', 'bye bye take care', 'greetings.bye'); manager.addDocument('en', 'okay see you later', 'greetings.bye'); manager.addDocument('en', 'hello', 'greetings.hello'); manager.addDocument('en', 'hi', 'greetings.hello'); // Train also the NLG manager.addAnswer('en', 'greetings.bye', 'Till next time'); manager.addAnswer('en', 'greetings.bye', 'see you soon!'); manager.addAnswer('en', 'greetings.hello', 'Hey there!'); manager.addAnswer('en', 'greetings.hello', 'Greetings!'); // Train and save the model. (async() => { await manager.train(); manager.save(); const response = await manager.process('en', 'I should go now'); console.log(response); })();
Compromise.cool is an extremely user-friendly and lightweight library. By converting text to data, it may be used to run NLP in your browser and make defensible conclusions. Compromise only functions in the English language.
Here is a simple code snippet:
import nlp from 'compromise' var doc = nlp('Sam is coming') doc.verbs().toNegative() // 'Sam is not coming'
Wink offers NLP features for a variety of tasks, including enhancing negations, controlling elisions, generating ngrams, stems, and phonetic codes for tokens. It provides a collection of APIs for working with strings like names, sentences, paragraphs, and tokens, which are each represented as an array of strings or words. They carry out the necessary preprocessing for many ML applications, including classification and semantic search:
// Load wink-nlp-utils var nlp = require( 'wink-nlp-utils' ); // Extract person's name from a string: var name = nlp.string.extractPersonsName( 'Dr. Sarah Connor M. Tech., PhD. - AI' ); console.log( name ); // -> 'Sarah Connor' // Remove stop words: var t = nlp.tokens.removeWords( [ 'mary', 'had', 'a', 'little', 'lamb' ] ); console.log( t ); // -> [ 'mary', 'little', 'lamb' ]
Here’s a quick summary of what we’ve learned so far in this article:
The source code to each of the following usage examples in the next section is available on GitHub. Feel free to clone it, fork it, or submit an issue.
Deploying a Node-based web app or website is the easy part. Making sure your Node instance continues to serve resources to your app is where things get tougher. If you’re interested in ensuring requests to the backend or third-party services are successful, try LogRocket.
LogRocket is like a DVR for web and mobile apps, recording literally everything that happens while a user interacts with your app. Instead of guessing why problems happen, you can aggregate and report on problematic network requests to quickly understand the root cause.
LogRocket instruments your app to record baseline performance timings such as page load time, time to first byte, slow network requests, and also logs Redux, NgRx, and Vuex actions/state. Start monitoring for free.
Would you be interested in joining LogRocket's developer community?
Join LogRocket’s Content Advisory Board. You’ll help inform the type of content we create and get access to exclusive meetups, social accreditation, and swag.
Sign up nowJavaScript generators offer a powerful and often overlooked way to handle asynchronous operations, manage state, and process data streams.
webpack’s Module Federation allows you to easily share code and dependencies between applications, helpful in micro-frontend architecture.
Whether you’re part of the typed club or not, one function within TypeScript that can make life a lot easier is object destructuring.
useState
useState
can effectively replace ref
in many scenarios and prevent Nuxt hydration mismatches that can lead to unexpected behavior and errors.
2 Replies to "Natural language processing with Node.js"
Thanks for sharing article with us.
Thank you for providing a comprehensive guide to getting started with NLP in JavaScript. Your effort is highly appreciated!