Introduction

Natural Language Processing (NLP) is an exciting and rapidly growing field of study that combines linguistics, computer science, and artificial intelligence. With the increasing availability of large amounts of text data and the growing need for intelligent communication with machines, NLP has become more important than ever. In this blog post, we will explore some recent academic papers on NLP and the advancements that they have made in the field. We will discuss topics such as language modeling, sentiment analysis, machine translation, and more.

Libraries

Developing NLP applications can be challenging and time-consuming task without the right tools and resources. This is where NLP libraries come in, providing pre-built functions and modules that simplify the development process. In this article, we will explore some of the most popular NLP libraries and their features, highlighting their benefits and use cases.

nltk

NLTK (Natural Language Toolkit) is a powerful open-source Python library that provides easy-to-use interfaces to numerous natural language processing (NLP) tasks, such as tokenization, part-of-speech tagging, parsing, sentiment analysis, and more.

The library was initially developed at the University of Pennsylvania in the late 1990s and has since become one of the most popular and widely used NLP libraries in academia and industry. NLTK has been instrumental in democratizing access to NLP tools and techniques, making it possible for researchers, developers, and hobbyists to experiment with and build NLP applications with ease.

spaCy

spaCy is an open-source software library for advanced natural language processing (NLP) in Python. It was developed with the goal of making NLP faster and more efficient while maintaining accuracy and ease of use. spaCy is designed to help developers build NLP applications with pre-trained models for named entity recognition, part-of-speech tagging, dependency parsing, and more. It also allows developers to train their own custom models to suit specific use cases.

One of the standout features of spaCy is its speed. It was designed to process large volumes of text quickly, making it well-suited for tasks like web scraping, data mining, and information extraction. It accomplishes this speed by implementing several optimizations such as Cython integration, hash-based lookups, and multithreading.

spaCy is also highly customizable and extensible. Developers can train their own models using spaCy’s machine learning framework or integrate other third-party libraries into their workflows. Additionally, it offers a wide range of language support, including English, German, Spanish, Portuguese, Italian, French, Dutch, and others.

genism

Gensim is a popular open-source library for natural language processing (NLP) tasks, including topic modeling, document similarity analysis, and word vector representations. It was developed by Radim Řehůřek in 2008 and is written in Python. Gensim is designed to handle large amounts of text data and provides efficient tools for working with text corpora.

One of the key features of Gensim is its ability to perform topic modeling, which is the process of identifying the main themes or topics in a set of documents. Gensim provides several implementations of topic models, including Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), and Hierarchical Dirichlet Process (HDP).

In addition to topic modeling, Gensim also provides tools for creating word vector representations, which are numerical representations of words that capture their meaning and context. These word vectors can be used for a variety of NLP tasks, such as text classification, sentiment analysis, and named entity recognition.

transformers

The Transformer library is an open-source library for natural language processing (NLP) tasks that uses the Transformer model architecture, which was introduced in the paper “Attention is All You Need” by Vaswani et al. (2017). The Transformer model has become a popular choice for NLP tasks due to its ability to handle long-range dependencies and its parallelizability.

The Transformer library is built on top of PyTorch and provides pre-trained models for a wide range of NLP tasks, such as classification, question-answering, and text generation. It also allows for fine-tuning pre-trained models on specific downstream tasks with just a few lines of code.

In addition to pre-trained models, the Transformer library also provides a range of utilities for NLP tasks, such as tokenization, data processing, and evaluation metrics. It also supports multi-GPU training and inference, making it easy to scale up models to handle large datasets and complex tasks.

Overall, the Transformer library is a powerful tool for NLP tasks that provides access to state-of-the-art pre-trained models and utilities, as well as the flexibility to fine-tune models for specific tasks.

Code example

# TEXT PREPROCESSING

import spacy
nlp = spacy.load('en_core_web_sm')

raw_text= 'Amazon Alexa, also known simply as Alexa, is a virtual assistant AI technology developed by Amazon, first used in the Amazon Echo smart speakers developed by Amazon Lab126. It is capable of voice interaction, music playback, making to-do lists, setting alarms, streaming podcasts, playing audiobooks, and providing weather, traffic, sports, and other real-time information, such as news. Alexa can also control several smart devices using itself as a home automation system. Users are able to extend the Alexa capabilities by installing "skills" additional functionality developed by third-party vendors, in other settings more commonly called apps such as weather programs and audio features.Most devices with Alexa allow users to activate the device using a wake-word (such as Alexa or Amazon); other devices (such as the Amazon mobile app on iOS or Android and Amazon Dash Wand) require the user to push a button to activate Alexa listening mode, although, some phones also allow a user to say a command, such as "Alexa" or "Alexa wake". Currently, interaction and communication with Alexa are available only in English, German, French, Italian, Spanish, Portuguese, Japanese, and Hindi. In Canada, Alexa is available in English and French with the Quebec accent).(truncated...).'
text_doc=nlp(raw_text)

# Tokenization
token_count=0
for token in text_doc:
    print(token.text)
    token_count+=1    
print('No of tokens originally',token_count)

# Remove stopwords
stopwords = spacy.lang.en.stop_words.STOP_WORDS
list_stopwords=list(stopwords)
for word in list_stopwords[:7]:
    print(word)

token_count_without_stopwords=0
filtered_text= [token for token in text_doc if not token.is_stop]
for token in filtered_text:
    token_count_without_stopwords+=1    
print('No of tokens after removing stopwords', token_count_without_stopwords)

# Remove punctuations
filtered_text=[token for token in filtered_text if not token.is_punct]
token_count_without_stop_and_punct=0
for token in filtered_text:
    print(token)
    token_count_without_stop_and_punct += 1    
print('No of tokens after removing stopwords and punctuations', token_count_without_stop_and_punct)
    
# Lemmatize
lemmatized_list = [token.lemma_ for token in filtered_text]
lemmatized = " ".join(lemmatized_list)
print(lemmatized)

# ANALYSIS
import collections
from collections import Counter

data='It is my birthday today. I could not have a birthday party. I felt sad'
data_doc=nlp(data)

list_of_tokens=[token.text for token in data_doc if not token.is_stop and not token.is_punct]

# Word frequency
token_frequency=Counter(lemmatized_list)
print(token_frequency)

most_frequent_tokens=token_frequency.most_common(6)
print(most_frequent_tokens)

for token in filtered_text:
    print(token.text.ljust(10),'-----',token.pos_, '----', token.lemma_, '---', ps.stem(token.text))

## Part of speech tagging (POS)
all_tags = {token.pos: token.pos_ for token in filtered_text}
print(all_tags)

nouns=[]
verbs=[]

for token in filtered_text:
    if token.pos_ =='NOUN':
        nouns.append(token)
    if token.pos_ =='VERB':
        verbs.append(token)

print('List of Nouns in the text\n',nouns)
print('List of verbs in the text\n',verbs)

## Remove junk POS
numbers=[]
for token in filtered_text:
    if token.pos_=='X':
        print(token.text)
junk_pos=['X','SCONJ']
def remove_pos(word):
    flag=False
    if word.pos_ in junk_pos:
        flag=True
    return flag

revised_robot_doc=[token for token in filtered_text if remove_pos(token)==False]

all_tags = {token.pos: token.pos_ for token in revised_robot_doc}
print(all_tags)

Counter({‘Alexa’: 11, ‘Amazon’: 7, ‘device’: 4, ‘user’: 4, ‘develop’: 3, ‘smart’: 2, ‘interaction’: 2, ‘weather’: 2, ‘app’: 2, ‘allow’: 2, ‘activate’: 2, ‘wake’: 2, ‘available’: 2, ‘English’: 2, ‘know’: 1, ‘simply’: 1, …})

[(‘Alexa’, 11), (‘Amazon’, 7), (‘device’, 4), (‘user’, 4), (‘develop’, 3), (‘smart’, 2)]

Amazon —– PROPN —- Amazon — amazon

Alexa —– PROPN —- Alexa — alexa

known —– VERB —- know — known

simply —– ADV —- simply — simpli

Alexa —– PROPN —- Alexa — alexa

virtual —– ADJ —- virtual — virtual

assistant —– NOUN —- assistant — assist

AI —– PROPN —- AI — ai

technology —– NOUN —- technology — technolog

developed —– VERB —- develop — develop

Amazon —– PROPN —- Amazon — amazon

Echo —– PROPN —- Echo — echo

smart —– ADJ —- smart — smart …

## Word dependency
for token in filtered_text:
    print(token.text,'---',token.dep_)
    

Amazon — compound

Alexa — nsubj

known — acl

simply — advmod

Alexa — pobj

virtual — amod

assistant — compound

AI — compound

technology — attr

developed — acl

Amazon — pobj

Amazon — nmod

…

from spacy import displacy
displacy.render(text_doc,style='dep',jupyter=True)

Screen Shot 2023-04-12 at 21 36 04

for token in filtered_text:
    print(token.text,'---',token.head.text)

Amazon — Alexa

Alexa — is

known — Alexa

simply — as

Alexa — as

virtual — assistant

assistant — technology

AI — technology

technology — is

developed — technology

Amazon — by

Amazon — Echo

Echo — speakers

smart — speakers

speakers — in

developed — speakers

…

sentences=list(text_doc.sents)

for sentence in sentences:
    print(sentence.root)

control

are

require

are

## Named entity recognition (NER)
from nltk import word_tokenize, pos_tag
nltk.download('averaged_perceptron_tagger')
raw_text= 'Amazon Alexa, also known simply as Alexa, is a virtual assistant AI technology developed by Amazon, first used in the Amazon Echo smart speakers developed by Amazon Lab126.'

tokens=word_tokenize(raw_text)
tokens_pos=pos_tag(tokens)
print(tokens_pos)

[(‘Amazon’, ‘NNP’), (‘Alexa’, ‘NNP’), (‘,’, ‘,’), (‘also’, ‘RB’), (‘known’, ‘VBN’), (‘simply’, ‘RB’), (‘as’, ‘IN’), (‘Alexa’, ‘NNP’), (‘,’, ‘,’), (‘is’, ‘VBZ’), (‘a’, ‘DT’), (‘virtual’, ‘JJ’), (‘assistant’, ‘NN’), (‘AI’, ‘NNP’), (‘technology’, ‘NN’), (‘developed’, ‘VBN’), (‘by’, ‘IN’), (‘Amazon’, ‘NNP’), (‘,’, ‘,’), (‘first’, ‘RB’), (‘used’, ‘VBN’), (‘in’, ‘IN’), (‘the’, ‘DT’), (‘Amazon’, ‘NNP’), (‘Echo’, ‘NNP’), (‘smart’, ‘JJ’), (‘speakers’, ‘NNS’), (‘developed’, ‘VBN’), (‘by’, ‘IN’), (‘Amazon’, ‘NNP’), (‘Lab126’, ‘NNP’), (‘.’, ‘.’)]

from nltk import ne_chunk
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

named_entity=ne_chunk(tokens_pos)
print(named_entity)

(S (PERSON Amazon/NNP) (GPE Alexa/NNP) ,/, also/RB known/VBN simply/RB as/IN (PERSON Alexa/NNP) ,/, is/VBZ a/DT virtual/JJ assistant/NN AI/NNP technology/NN developed/VBN by/IN (PERSON Amazon/NNP) ,/, first/RB used/VBN in/IN the/DT (ORGANIZATION Amazon/NNP Echo/NNP) smart/JJ speakers/NNS developed/VBN by/IN (PERSON Amazon/NNP Lab126/NNP) ./.)

### spacy
for entity in text_doc.ents:
  print(entity.text,'--',entity.label_)

Amazon Alexa – ORG

Alexa – ORG

AI – ORG

Amazon – ORG

Amazon Echo – ORG

Amazon Lab126 – ORG

Alexa – ORG

third – ORDINAL

Alexa – ORG

from spacy import displacy
displacy.render(text_doc,style='ent',jupyter=True)

Screen Shot 2023-04-12 at 21 43 22

# print out list of tagged organizations
list_of_org=[]

for token in filtered_text:
  if token.ent_type_=='ORG':
    list_of_org.append(token.text)

print(list_of_org)

[‘Amazon’, ‘Alexa’, ‘Alexa’, ‘AI’, ‘Amazon’, ‘Amazon’, ‘Echo’, ‘Amazon’, ‘Lab126’, ‘Alexa’…]

# TASK
## Extractive summary is when the machine reuses the sentence and phrases
## in the text to summarize it

import gensim
from gensim.summarization import summarize
article_text='''Artificial Intelligence (AI) is a sub-field of computer science focused on creating intelligent programs that can perform tasks generally done by humans, including perception, learning, reasoning, pattern recognition, and decision-making. The term AI covers machine learning, predictive analytics, natural language processing, and robotics. AI is already part of our everyday lives, but its opportunities come with challenges for society and the law. There are different types of AI, including narrow AI (programmed to be competent in one specific area), artificial general intelligence (tasks across multiple fields), and artificial superintelligence (AI that exceeds human levels of intelligence).
In general, Artificial Intelligence (AI) develops so rapidly and it has the potential to revolutionize the way we live and work. However, along with the tremendous benefits come significant ethical concerns. AI algorithms are trained on large amounts of data, and if that data is biased or flawed, it can lead to biased outcomes that disproportionately affect certain groups of people. Additionally, the increasing use of AI systems raises questions about privacy and security, as these systems collect vast amounts of personal data, with or without consent.
To understand more on this, we will explore the ethical implications of AI, including the impact of bias, privacy, and security concerns. We will also discuss potential solutions to these issues and the importance of ethical considerations in the development and deployment of AI systems. To see a curated list of scary AI, visit aweful-ai.'''
short_summary=summarize(article_text)
print(short_summary)

Artificial Intelligence (AI) is a sub-field of computer science focused on creating intelligent programs that can perform tasks generally done by humans, including perception, learning, reasoning, pattern recognition, and decision-making. To understand more on this, we will explore the ethical implications of AI, including the impact of bias, privacy, and security concerns.

summary_by_ratio=summarize(article_text,ratio=0.1)
print(summary_by_ratio)

summary_by_word_count=summarize(article_text,word_count=30)
print(summary_by_word_count)

### spacy
## In spacy, we score keywords and key sentences
## Then we print out the most important sentences, to summarize the text

keywords_list = []

desired_pos = ['PROPN', 'ADJ', 'NOUN', 'VERB']

from string import punctuation

for token in filtered_text: 
  if(token.text in nlp.Defaults.stop_words or token.text in punctuation):
    continue
  if(token.pos_ in desired_pos):
    keywords_list.append(token.text)
    
from collections import Counter
dictionary = Counter(keywords_list) 
print(dictionary)

highest_frequency = Counter(keywords_list).most_common(1)[0][1] 

for word in dictionary:
    dictionary[word] = (dictionary[word]/highest_frequency) 
print(dictionary)

score={}

for sentence in text_doc.sents: 
    for token in sentence:
        if token.text in dictionary.keys():
            if sentence in score.keys():
                score[sentence]+=dictionary[token.text]
            else:
                score[sentence]=dictionary[token.text]
print(score)

sorted_score = sorted(score.items(), key=lambda kv: kv[1], reverse=True)

text_summary=[]

no_of_sentences=4

total = 0
for i in range(len(sorted_score)):
    text_summary.append(str(sorted_score[i][0]).capitalize()) 
    total += 1
    if(total >= no_of_sentences):
        break 

print(text_summary)

[‘Most devices with alexa allow users to activate the device using a wake-word (such as alexa or amazon); other devices (such as the amazon mobile app on ios or android and amazon dash wand) require the user to push a button to activate alexa listening mode, although, some phones also allow a user to say a command, such as “alexa” or “alexa wake”.’, ‘Amazon alexa, also known simply as alexa, is a virtual assistant ai technology developed by amazon, first used in the amazon echo smart speakers developed by amazon lab126.’, ‘Users are able to extend the alexa capabilities by installing “skills” additional functionality developed by third-party vendors, in other settings more commonly called apps such as weather programs and audio features.’, ‘Currently, interaction and communication with alexa are available only in english, german, french, italian, spanish, portuguese, japanese, and hindi.’]

## Generative summary
## is when we generate original sentences and phrases to summarize the text

import torch
from transformers import pipeline
raw_text='''You can notice that in the extractive method, the sentences of the summary are all taken from the original text. There is no change in structure of any sentence.
Generative text summarization methods overcome this shortcoming. The concept is based on capturing the meaning of the text and generating entitrely new sentences to best represent them in the summary.
These are more advanced methods and are best for summarization. Here, I shall guide you on implementing generative text summarization using Hugging face .'''
summarizer = pipeline("summarization", model="stevhliu/my_awesome_billsum_model", max_length=50)
summarizer(raw_text)

[{‘summary_text’: ‘Generative text summarization methods overcome this shortcoming. The concept is based on capturing the meaning of the text and generating entitrely new sentences to best represent them in the summary.’}]

## Language translation
from transformers import pipeline
translator_model=pipeline(task='translation_en_to_ro')
translator_model
translator_model('Hello, good morning, welcome to this country!')

[{‘translation_text’: ‘Salut, bună dimineaţă, bine aţi venit în această ţară!’}]

## Text generation
from transformers import GPT2Tokenizer
from transformers import GPT2DoubleHeadsModel

tokenizer=GPT2Tokenizer.from_pretrained('gpt2-medium')
model=GPT2DoubleHeadsModel.from_pretrained('gpt2-medium')
my_text='Nowadays people are'
ids=tokenizer.encode(my_text)
ids
my_tensor=torch.tensor([ids])
model.eval()
result=model(my_tensor)
predictions=result[0]
predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_index
predicted_text = tokenizer.decode(ids + [predicted_index])

print(predicted_text)

Nowsaday people are not

no_of_words_to_generate=30

text = 'Nowsaday people are not'

for i in range(no_of_words_to_generate):
    input_ids = tokenizer.encode(text)
    input_tensor = torch.tensor([input_ids])
    result = model(input_tensor)
    predictions = result[0]
    predicted_index = torch.argmax(predictions[0,-1,:]).item()
    text = tokenizer.decode(input_ids + [predicted_index])

print(text)

Nowsaday people are not the only ones who have been affected by the recent spate of violence. “I’ve been in the area for a while now and I

## Chatbot

def take_last_tokens(inputs, note_history, history):
    """Filter the last 128 tokens"""
    if inputs['input_ids'].shape[1] > 128:
        inputs['input_ids'] = torch.tensor([inputs['input_ids'][0][-128:].tolist()])
        inputs['attention_mask'] = torch.tensor([inputs['attention_mask'][0][-128:].tolist()])
        note_history = ['</s> <s>'.join(note_history[0].split('</s> <s>')[2:])]
        history = history[1:]

    return inputs, note_history, history


def add_note_to_history(note, note_history):
    """Add a note to the historical information"""
    note_history.append(note)
    note_history = '</s> <s>'.join(note_history)
    return [note_history]


def chat(message, history):
    history = history or []
    if history: 
        history_useful = ['</s> <s>'.join([str(a[0])+'</s> <s>'+str(a[1]) for a in history])]
    else:
        history_useful = []
    
    history_useful = add_note_to_history(message, history_useful)
    inputs = tokenizer(history_useful, return_tensors="pt")
    inputs, history_useful, history = take_last_tokens(inputs, history_useful, history)
    
    reply_ids = model.generate(**inputs)
    response = tokenizer.batch_decode(reply_ids, skip_special_tokens=True)[0]
    history_useful = add_note_to_history(response, history_useful)
    
    list_history = history_useful[0].split('</s> <s>')
    history.append((list_history[-2], list_history[-1]))
    
    return history, history

chat("hello",['how are you'])

([‘how are you’, (‘hello’, ‘ Hi there, how are you? I just got back from a trip to the beach.’)])

## Question answering
## In this task, the machine receives a paragraph with the information
## it then extracts the words to answer the given question

from datasets import load_dataset

squad = load_dataset("squad", split="train[:5000]")
squad = squad.train_test_split(test_size=0.2)
squad["train"][0]

{‘id’: ‘56ce6f81aab44d1400b8878b’, ‘title’: ‘To_Kill_a_Mockingbird’, ‘context’: ‘Scholars have characterized To Kill a Mockingbird as both a Southern Gothic and coming-of-age or Bildungsroman novel. The grotesque and near-supernatural qualities of Boo Radley and his house, and the element of racial injustice involving Tom Robinson contribute to the aura of the Gothic in the novel. Lee used the term “Gothic” to describe the architecture of Maycomb's courthouse and in regard to Dill's exaggeratedly morbid performances as Boo Radley. Outsiders are also an important element of Southern Gothic texts and Scout and Jem's questions about the hierarchy in the town cause scholars to compare the novel to Catcher in the Rye and Adventures of Huckleberry Finn. Despite challenging the town's systems, Scout reveres Atticus as an authority above all others, because he believes that following one's conscience is the highest priority, even when the result is social ostracism. However, scholars debate about the Southern Gothic classification, noting that Boo Radley is in fact human, protective, and benevolent. Furthermore, in addressing themes such as alcoholism, incest, rape, and racial violence, Lee wrote about her small town realistically rather than melodramatically. She portrays the problems of individual characters as universal underlying issues in every society.’, ‘question’: ‘What genre of book is To Kill a Mockingbird typically called?’, ‘answers’: {‘text’: [‘Southern Gothic and coming-of-age or Bildungsroman novel’], ‘answer_start’: [60]}}

## Text classification
## In this example, we classify the sentiment of movie reviews
## with a pretrained BERT

from datasets import load_dataset

imdb = load_dataset("imdb")
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)
tokenized_imdb = imdb.map(preprocess_function, batched=True)
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)

id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}
from transformers import create_optimizer
import tensorflow as tf

batch_size = 16
num_epochs = 5
batches_per_epoch = len(tokenized_imdb["train"]) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)
optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)

from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

!pip install datasets

tf_train_set = model.prepare_tf_dataset(
    tokenized_imdb["train"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_validation_set = model.prepare_tf_dataset(
    tokenized_imdb["test"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

import tensorflow as tf

model.compile(optimizer=optimizer)
from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)

callbacks = [metric_callback]

# model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3, callbacks=callbacks)

text = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."

from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="stevhliu/my_awesome_model")
classifier(text)

[{‘label’: ‘LABEL_1’, ‘score’: 0.9994940757751465}]

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_model")
inputs = tokenizer(text, return_tensors="tf")

from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained("stevhliu/my_awesome_model")
logits = model(**inputs).logits
predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0])
model.config.id2label[predicted_class_id]

‘LABEL_1’

An NLP Recipe

TOC

Introduction

Libraries

nltk

spaCy

genism

transformers

Code example