Common NLP Techniques

* Before reading through our piece on common NLP techniques, make sure to check out the previous three installments of this series: “What is Natural Language Processing,” “Natural Language Processing Applications, and “What are the Top NLP Language Models.” 

What’s this about: Natural language processing (NLP) intersects the fields of computer science, artificial intelligence, and linguistics to enable computers to process and “understand” natural language. This in turn helps them carry out tasks like language translation and text summarization. NLP is quickly becoming one of the most crucial technologies of our day, especially given the rapid rise of voice interfaces and chatbots. It is extremely impressive how far NLP has come in a short time


Go Deeper to Learn More →


NLP primarily consists of two sub fields:

  1. Natural Language Generation (NLG): Subset of NLP that focuses on enabling computers to write and produce a human language text response based on data input.

  2. Natural Language Understanding (NLU): Subset of NLP that uses syntactic and semantic analysis of text and speech to determine the meaning of a sentence. 

Because of the recent surge of unstructured data from text, videos, and photos over the last few years, NLU is often used to extract valuable information like social media data and customer surveys. 

Here is a look at some of the most widely used techniques for extracting such data:

Sentiment Analysis

Sentiment analysis is the most common technique used in NLP, and it is useful for things like customer surveys, social media comments, reviews, and other areas where customers can provide opinions and feedback. It involves the use of NLP to analyze online pieces of writing to determine the emotional tone of a text.

When it comes to sentiment analysis, the most simple output is a 3-point scale:

  • Positive

  • Negative

  • Neutral

However, things can get more complex when the output is based on numeric scores, which can then be used with various categories. Social media text also poses particularly complex issues due to the symbolic nature of some of its constructs, such as emojis, animojis, acronyms, and memes.

In the case of a customer expressing many different sentiments in various parts of a text, NLP is often used to analyze the sentiment of each sentence. The negative and positive parts are then separated out, and a sentiment score can be given to help identify the most positive and negative aspects of the text. 

Both supervised and unsupervised techniques can be used for sentiment analysis, with the most popular method being Naive Bayes. [1] This technique requires a training corpus with sentiment labels, which is used to train a model to identify the sentiment. Some of the other techniques used include random forests [2] or gradient boosting. [3]

Named Entity Recognition

Named entity recognition is one of the most basic and useful techniques in all of NLP, and it involves extracting the entities in a text. This technique automatically identifies these entities and classifies them into predefined categories.

What exactly are entities? They can include names of people, locations, organizations, times, monetary values, quantities, dates, percentages, and much more. By recognizing these entities, NLP can enable you to extract key information to better understand a text, or it can be used to simply collect information that can then be stored in a database. 

Here is an example sentence and the recognized entities: 

“Twitter is one of the biggest social networks worldwide, earning a total revenue of 3.72 billion U.S. dollars in 2020.” 

This sentence has multiple types of entities: 

  • “Company”: Twitter

  • “Monetary Value”: 3.72 billion U.S. dollars

  • “Date”: 2020

  • “Country”: U.S.

For a named entity recognition (NER) model to work and detect a word, a string of words that make up an entity, and the categories the different entities belong to, developers must first create the entity categories. The NER model is then fed relevant training data, and some word and phrase samples are tagged with their corresponding entities. All of this enables the NER model to eventually learn how to detect the entities itself. 

NER models can be applied in many ways, especially when a large dataset is involved. For example, NER can be used for:

  • Categorizing tickets for customer support

  • Content recommendations

  • Analyzing customer feedback

  • Hiring processes 

  • Social media analysis

Text Summarization

Another one of the main NLP techniques is text summarization, which helps summarize the information contained in large texts. This technique is often deployed in use cases like news and research articles. 

There are two main approaches to text summarization:

  • Extraction: These methods extract parts from the text to create a summary.

  • Abstraction: These methods generate new text that conveys the idea of the original text in order to create a summary. 

There are many algorithms that are used in text summarization such as LexRank, which ranks the sentences based on similarity between them. For example, a sentence that is similar to more sentences receives a higher ranking. Some of the other common algorithms used include TextRank and Latent Semantic Analysis

Personality Analysis

One of the fastest growing techniques within NLP is personality analysis, which takes place when an algorithm analyzes a subject to provide insight into that subject’s behavior, tone, education, and personality traits. 

For example, an NLP algorithm can analyze a single message and predict if an individual is a Type-A person, if the message is hostile, if it is impulsive, if the person is emotionally distant, and much more. It can go even further, predicting the relationship between the author or recipient, as well as if the writing hints at some type of mental instability. Because these heuristics can be executed quickly, high volumes of data can be put through the algorithm in a short amount of time. 

One of the most popular tools for extracting insight about a subject is IBM Watson’s Personality Insight. The IBM tool consumes text and returns scores based on the axes, or “traits,” of three separate psychological models.

Those three models are:

  1. Big 5 Model

  2. Needs Model

  3. Values Model

The Big 5 model evaluates the openness, conscientiousness, extraversion, agreeableness, and emotional range of the input.

The Watson tool requires long samples in order to provide valid absolute scores. Therefore, it’s useful to view the tool as an assessment of persona instead of an absolute personality.

Topic Modeling

One of the more complex methods used in NLP is topic modeling, which enables you to identify the natural topics within a text. Topic modeling is an unsupervised technique, meaning it does not require a labeled training dataset or model training.

Here are some of the most common algorithms used for topic modeling:

  • Latent Semantic Analysis (LSA) 

  • Probabilistic Latent Semantic Analysis (PLSA)

  • Latent Dirichlet Allocation (LDA)

  • Correlated Topic Model (CTM) 

Out of these algorithms, one of the most popular is Latent Dirichlet Allocation, or LDA, which is based on the idea that each text contains several topics. Within those several topics, each one has several words. LDA only requires the text documents and expected number of topics as input.

LDA assumes there are two inherent topics, and it can identify common words across both of them. It then establishes common themes for each topic by grouping together these similar words.

NLP and the Rise of Unstructured Data

With the increasing amount of unstructured data available, these NLP techniques are more important than ever. These are just some of the many techniques available for those looking to extract meaning out of language, and they can transform both the way we see technology and perceive human languages. 

NLP techniques can make our lives easier, which is why there is such a strong interest in the subfield of AI. Humans have long desired to create computers that can understand and communicate with them through language, and with the incredible advancements taking place in the field of NLP, this is quickly approaching reality. These various NLP techniques are the building blocks to creating such a future, and each one offers its own potential to revolutionize today’s technology. 

Whether it’s sentiment analysis, named entity recognition, text summarization, personality analysis, or topic modeling, these various NLP techniques provide us with the opportunity to understand and analyze human language like never before. By unlocking the power of human language now, we inch closer to achieving incredible AI systems that could positively impact society in the future.

Make sure to look out for the upcoming final installment of our NLP series next month, which will cover the top Python tools and libraries.

If you want to gain more insight into NLP and other artificial intelligence technologies, make sure to sign up for the Medium blog: https://gcmori.medium.com/membership

----------

[1] Naive Bayes is a collection of classification algorithms based on Bayes Theorem. It is based on the common principle that every feature being classified is independent of the value of any other feature. 

[2] Random forests are an ensemble learning method for classification, regression and other tasks that works by constructing a multitude of decision trees at training time. 

[3] Gradient boosting is a machine learning technique that relies on the intuition that the best possible next model can be combined with previous models to minimize overall prediction error. 

Giancarlo Mori