An NLP Recipe
TOC
Introduction
Natural Language Processing (NLP) is an exciting and rapidly growing field of study that combines linguistics, computer science, and artificial intelligence. With the increasing availability of large amounts of text data and the growing need for intelligent communication with machines, NLP has become more important than ever. In this blog post, we will explore some recent academic papers on NLP and the advancements that they have made in the field. We will discuss topics such as language modeling, sentiment analysis, machine translation, and more.
Libraries
Developing NLP applications can be challenging and time-consuming task without the right tools and resources. This is where NLP libraries come in, providing pre-built functions and modules that simplify the development process. In this article, we will explore some of the most popular NLP libraries and their features, highlighting their benefits and use cases.
nltk
NLTK (Natural Language Toolkit) is a powerful open-source Python library that provides easy-to-use interfaces to numerous natural language processing (NLP) tasks, such as tokenization, part-of-speech tagging, parsing, sentiment analysis, and more.
The library was initially developed at the University of Pennsylvania in the late 1990s and has since become one of the most popular and widely used NLP libraries in academia and industry. NLTK has been instrumental in democratizing access to NLP tools and techniques, making it possible for researchers, developers, and hobbyists to experiment with and build NLP applications with ease.
spaCy
spaCy is an open-source software library for advanced natural language processing (NLP) in Python. It was developed with the goal of making NLP faster and more efficient while maintaining accuracy and ease of use. spaCy is designed to help developers build NLP applications with pre-trained models for named entity recognition, part-of-speech tagging, dependency parsing, and more. It also allows developers to train their own custom models to suit specific use cases.
One of the standout features of spaCy is its speed. It was designed to process large volumes of text quickly, making it well-suited for tasks like web scraping, data mining, and information extraction. It accomplishes this speed by implementing several optimizations such as Cython integration, hash-based lookups, and multithreading.
spaCy is also highly customizable and extensible. Developers can train their own models using spaCy’s machine learning framework or integrate other third-party libraries into their workflows. Additionally, it offers a wide range of language support, including English, German, Spanish, Portuguese, Italian, French, Dutch, and others.
genism
Gensim is a popular open-source library for natural language processing (NLP) tasks, including topic modeling, document similarity analysis, and word vector representations. It was developed by Radim Řehůřek in 2008 and is written in Python. Gensim is designed to handle large amounts of text data and provides efficient tools for working with text corpora.
One of the key features of Gensim is its ability to perform topic modeling, which is the process of identifying the main themes or topics in a set of documents. Gensim provides several implementations of topic models, including Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), and Hierarchical Dirichlet Process (HDP).
In addition to topic modeling, Gensim also provides tools for creating word vector representations, which are numerical representations of words that capture their meaning and context. These word vectors can be used for a variety of NLP tasks, such as text classification, sentiment analysis, and named entity recognition.
transformers
The Transformer library is an open-source library for natural language processing (NLP) tasks that uses the Transformer model architecture, which was introduced in the paper “Attention is All You Need” by Vaswani et al. (2017). The Transformer model has become a popular choice for NLP tasks due to its ability to handle long-range dependencies and its parallelizability.
The Transformer library is built on top of PyTorch and provides pre-trained models for a wide range of NLP tasks, such as classification, question-answering, and text generation. It also allows for fine-tuning pre-trained models on specific downstream tasks with just a few lines of code.
In addition to pre-trained models, the Transformer library also provides a range of utilities for NLP tasks, such as tokenization, data processing, and evaluation metrics. It also supports multi-GPU training and inference, making it easy to scale up models to handle large datasets and complex tasks.
Overall, the Transformer library is a powerful tool for NLP tasks that provides access to state-of-the-art pre-trained models and utilities, as well as the flexibility to fine-tune models for specific tasks.
Code example
# TEXT PREPROCESSING
import spacy
nlp = spacy.load('en_core_web_sm')
raw_text= 'Amazon Alexa, also known simply as Alexa, is a virtual assistant AI technology developed by Amazon, first used in the Amazon Echo smart speakers developed by Amazon Lab126. It is capable of voice interaction, music playback, making to-do lists, setting alarms, streaming podcasts, playing audiobooks, and providing weather, traffic, sports, and other real-time information, such as news. Alexa can also control several smart devices using itself as a home automation system. Users are able to extend the Alexa capabilities by installing "skills" additional functionality developed by third-party vendors, in other settings more commonly called apps such as weather programs and audio features.Most devices with Alexa allow users to activate the device using a wake-word (such as Alexa or Amazon); other devices (such as the Amazon mobile app on iOS or Android and Amazon Dash Wand) require the user to push a button to activate Alexa listening mode, although, some phones also allow a user to say a command, such as "Alexa" or "Alexa wake". Currently, interaction and communication with Alexa are available only in English, German, French, Italian, Spanish, Portuguese, Japanese, and Hindi. In Canada, Alexa is available in English and French with the Quebec accent).(truncated...).'
text_doc=nlp(raw_text)
# Tokenization
token_count=0
for token in text_doc:
print(token.text)
token_count+=1
print('No of tokens originally',token_count)
# Remove stopwords
stopwords = spacy.lang.en.stop_words.STOP_WORDS
list_stopwords=list(stopwords)
for word in list_stopwords[:7]:
print(word)
token_count_without_stopwords=0
filtered_text= [token for token in text_doc if not token.is_stop]
for token in filtered_text:
token_count_without_stopwords+=1
print('No of tokens after removing stopwords', token_count_without_stopwords)
# Remove punctuations
filtered_text=[token for token in filtered_text if not token.is_punct]
token_count_without_stop_and_punct=0
for token in filtered_text:
print(token)
token_count_without_stop_and_punct += 1
print('No of tokens after removing stopwords and punctuations', token_count_without_stop_and_punct)
# Lemmatize
lemmatized_list = [token.lemma_ for token in filtered_text]
lemmatized = " ".join(lemmatized_list)
print(lemmatized)
# ANALYSIS
import collections
from collections import Counter
data='It is my birthday today. I could not have a birthday party. I felt sad'
data_doc=nlp(data)
list_of_tokens=[token.text for token in data_doc if not token.is_stop and not token.is_punct]
# Word frequency
token_frequency=Counter(lemmatized_list)
print(token_frequency)
most_frequent_tokens=token_frequency.most_common(6)
print(most_frequent_tokens)
for token in filtered_text:
print(token.text.ljust(10),'-----',token.pos_, '----', token.lemma_, '---', ps.stem(token.text))
## Part of speech tagging (POS)
all_tags = {token.pos: token.pos_ for token in filtered_text}
print(all_tags)
nouns=[]
verbs=[]
for token in filtered_text:
if token.pos_ =='NOUN':
nouns.append(token)
if token.pos_ =='VERB':
verbs.append(token)
print('List of Nouns in the text\n',nouns)
print('List of verbs in the text\n',verbs)
## Remove junk POS
numbers=[]
for token in filtered_text:
if token.pos_=='X':
print(token.text)
junk_pos=['X','SCONJ']
def remove_pos(word):
flag=False
if word.pos_ in junk_pos:
flag=True
return flag
revised_robot_doc=[token for token in filtered_text if remove_pos(token)==False]
all_tags = {token.pos: token.pos_ for token in revised_robot_doc}
print(all_tags)
Counter({‘Alexa’: 11, ‘Amazon’: 7, ‘device’: 4, ‘user’: 4, ‘develop’: 3, ‘smart’: 2, ‘interaction’: 2, ‘weather’: 2, ‘app’: 2, ‘allow’: 2, ‘activate’: 2, ‘wake’: 2, ‘available’: 2, ‘English’: 2, ‘know’: 1, ‘simply’: 1, …})
[(‘Alexa’, 11), (‘Amazon’, 7), (‘device’, 4), (‘user’, 4), (‘develop’, 3), (‘smart’, 2)]
Amazon —– PROPN —- Amazon — amazon
Alexa —– PROPN —- Alexa — alexa
known —– VERB —- know — known
simply —– ADV —- simply — simpli
Alexa —– PROPN —- Alexa — alexa
virtual —– ADJ —- virtual — virtual
assistant —– NOUN —- assistant — assist
AI —– PROPN —- AI — ai
technology —– NOUN —- technology — technolog
developed —– VERB —- develop — develop
Amazon —– PROPN —- Amazon — amazon
Amazon —– PROPN —- Amazon — amazon
Echo —– PROPN —- Echo — echo
smart —– ADJ —- smart — smart …
## Word dependency
for token in filtered_text:
print(token.text,'---',token.dep_)
Amazon — compound
Alexa — nsubj
known — acl
simply — advmod
Alexa — pobj
virtual — amod
assistant — compound
AI — compound
technology — attr
developed — acl
Amazon — pobj
Amazon — nmod
…
from spacy import displacy
displacy.render(text_doc,style='dep',jupyter=True)
for token in filtered_text:
print(token.text,'---',token.head.text)
Amazon — Alexa
Alexa — is
known — Alexa
simply — as
Alexa — as
virtual — assistant
assistant — technology
AI — technology
technology — is
developed — technology
Amazon — by
Amazon — Echo
Echo — speakers
smart — speakers
speakers — in
developed — speakers
…
sentences=list(text_doc.sents)
for sentence in sentences:
print(sentence.root)
is
is
control
are
require
are
is
## Named entity recognition (NER)
from nltk import word_tokenize, pos_tag
nltk.download('averaged_perceptron_tagger')
raw_text= 'Amazon Alexa, also known simply as Alexa, is a virtual assistant AI technology developed by Amazon, first used in the Amazon Echo smart speakers developed by Amazon Lab126.'
tokens=word_tokenize(raw_text)
tokens_pos=pos_tag(tokens)
print(tokens_pos)
[(‘Amazon’, ‘NNP’), (‘Alexa’, ‘NNP’), (‘,’, ‘,’), (‘also’, ‘RB’), (‘known’, ‘VBN’), (‘simply’, ‘RB’), (‘as’, ‘IN’), (‘Alexa’, ‘NNP’), (‘,’, ‘,’), (‘is’, ‘VBZ’), (‘a’, ‘DT’), (‘virtual’, ‘JJ’), (‘assistant’, ‘NN’), (‘AI’, ‘NNP’), (‘technology’, ‘NN’), (‘developed’, ‘VBN’), (‘by’, ‘IN’), (‘Amazon’, ‘NNP’), (‘,’, ‘,’), (‘first’, ‘RB’), (‘used’, ‘VBN’), (‘in’, ‘IN’), (‘the’, ‘DT’), (‘Amazon’, ‘NNP’), (‘Echo’, ‘NNP’), (‘smart’, ‘JJ’), (‘speakers’, ‘NNS’), (‘developed’, ‘VBN’), (‘by’, ‘IN’), (‘Amazon’, ‘NNP’), (‘Lab126’, ‘NNP’), (‘.’, ‘.’)]
from nltk import ne_chunk
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
named_entity=ne_chunk(tokens_pos)
print(named_entity)
(S (PERSON Amazon/NNP) (GPE Alexa/NNP) ,/, also/RB known/VBN simply/RB as/IN (PERSON Alexa/NNP) ,/, is/VBZ a/DT virtual/JJ assistant/NN AI/NNP technology/NN developed/VBN by/IN (PERSON Amazon/NNP) ,/, first/RB used/VBN in/IN the/DT (ORGANIZATION Amazon/NNP Echo/NNP) smart/JJ speakers/NNS developed/VBN by/IN (PERSON Amazon/NNP Lab126/NNP) ./.)
### spacy
for entity in text_doc.ents:
print(entity.text,'--',entity.label_)
Amazon Alexa – ORG
Alexa – ORG
AI – ORG
Amazon – ORG
Amazon Echo – ORG
Amazon Lab126 – ORG
Alexa – ORG
Alexa – ORG
third – ORDINAL
Alexa – ORG
Alexa – ORG
from spacy import displacy
displacy.render(text_doc,style='ent',jupyter=True)
# print out list of tagged organizations
list_of_org=[]
for token in filtered_text:
if token.ent_type_=='ORG':
list_of_org.append(token.text)
print(list_of_org)
[‘Amazon’, ‘Alexa’, ‘Alexa’, ‘AI’, ‘Amazon’, ‘Amazon’, ‘Echo’, ‘Amazon’, ‘Lab126’, ‘Alexa’…]
# TASK
## Extractive summary is when the machine reuses the sentence and phrases
## in the text to summarize it
import gensim
from gensim.summarization import summarize
article_text='''Artificial Intelligence (AI) is a sub-field of computer science focused on creating intelligent programs that can perform tasks generally done by humans, including perception, learning, reasoning, pattern recognition, and decision-making. The term AI covers machine learning, predictive analytics, natural language processing, and robotics. AI is already part of our everyday lives, but its opportunities come with challenges for society and the law. There are different types of AI, including narrow AI (programmed to be competent in one specific area), artificial general intelligence (tasks across multiple fields), and artificial superintelligence (AI that exceeds human levels of intelligence).
In general, Artificial Intelligence (AI) develops so rapidly and it has the potential to revolutionize the way we live and work. However, along with the tremendous benefits come significant ethical concerns. AI algorithms are trained on large amounts of data, and if that data is biased or flawed, it can lead to biased outcomes that disproportionately affect certain groups of people. Additionally, the increasing use of AI systems raises questions about privacy and security, as these systems collect vast amounts of personal data, with or without consent.
To understand more on this, we will explore the ethical implications of AI, including the impact of bias, privacy, and security concerns. We will also discuss potential solutions to these issues and the importance of ethical considerations in the development and deployment of AI systems. To see a curated list of scary AI, visit aweful-ai.'''
short_summary=summarize(article_text)
print(short_summary)
Artificial Intelligence (AI) is a sub-field of computer science focused on creating intelligent programs that can perform tasks generally done by humans, including perception, learning, reasoning, pattern recognition, and decision-making. To understand more on this, we will explore the ethical implications of AI, including the impact of bias, privacy, and security concerns.
summary_by_ratio=summarize(article_text,ratio=0.1)
print(summary_by_ratio)
Artificial Intelligence (AI) is a sub-field of computer science focused on creating intelligent programs that can perform tasks generally done by humans, including perception, learning, reasoning, pattern recognition, and decision-making.
summary_by_word_count=summarize(article_text,word_count=30)
print(summary_by_word_count)
Artificial Intelligence (AI) is a sub-field of computer science focused on creating intelligent programs that can perform tasks generally done by humans, including perception, learning, reasoning, pattern recognition, and decision-making.
### spacy
## In spacy, we score keywords and key sentences
## Then we print out the most important sentences, to summarize the text
keywords_list = []
desired_pos = ['PROPN', 'ADJ', 'NOUN', 'VERB']
from string import punctuation
for token in filtered_text:
if(token.text in nlp.Defaults.stop_words or token.text in punctuation):
continue
if(token.pos_ in desired_pos):
keywords_list.append(token.text)
from collections import Counter
dictionary = Counter(keywords_list)
print(dictionary)
highest_frequency = Counter(keywords_list).most_common(1)[0][1]
for word in dictionary:
dictionary[word] = (dictionary[word]/highest_frequency)
print(dictionary)
score={}
for sentence in text_doc.sents:
for token in sentence:
if token.text in dictionary.keys():
if sentence in score.keys():
score[sentence]+=dictionary[token.text]
else:
score[sentence]=dictionary[token.text]
print(score)
sorted_score = sorted(score.items(), key=lambda kv: kv[1], reverse=True)
text_summary=[]
no_of_sentences=4
total = 0
for i in range(len(sorted_score)):
text_summary.append(str(sorted_score[i][0]).capitalize())
total += 1
if(total >= no_of_sentences):
break
print(text_summary)
[‘Most devices with alexa allow users to activate the device using a wake-word (such as alexa or amazon); other devices (such as the amazon mobile app on ios or android and amazon dash wand) require the user to push a button to activate alexa listening mode, although, some phones also allow a user to say a command, such as “alexa” or “alexa wake”.’, ‘Amazon alexa, also known simply as alexa, is a virtual assistant ai technology developed by amazon, first used in the amazon echo smart speakers developed by amazon lab126.’, ‘Users are able to extend the alexa capabilities by installing “skills” additional functionality developed by third-party vendors, in other settings more commonly called apps such as weather programs and audio features.’, ‘Currently, interaction and communication with alexa are available only in english, german, french, italian, spanish, portuguese, japanese, and hindi.’]
## Generative summary
## is when we generate original sentences and phrases to summarize the text
import torch
from transformers import pipeline
raw_text='''You can notice that in the extractive method, the sentences of the summary are all taken from the original text. There is no change in structure of any sentence.
Generative text summarization methods overcome this shortcoming. The concept is based on capturing the meaning of the text and generating entitrely new sentences to best represent them in the summary.
These are more advanced methods and are best for summarization. Here, I shall guide you on implementing generative text summarization using Hugging face .'''
summarizer = pipeline("summarization", model="stevhliu/my_awesome_billsum_model", max_length=50)
summarizer(raw_text)
[{‘summary_text’: ‘Generative text summarization methods overcome this shortcoming. The concept is based on capturing the meaning of the text and generating entitrely new sentences to best represent them in the summary.’}]
## Language translation
from transformers import pipeline
translator_model=pipeline(task='translation_en_to_ro')
translator_model
translator_model('Hello, good morning, welcome to this country!')
[{‘translation_text’: ‘Salut, bună dimineaţă, bine aţi venit în această ţară!’}]
## Text generation
from transformers import GPT2Tokenizer
from transformers import GPT2DoubleHeadsModel
tokenizer=GPT2Tokenizer.from_pretrained('gpt2-medium')
model=GPT2DoubleHeadsModel.from_pretrained('gpt2-medium')
my_text='Nowadays people are'
ids=tokenizer.encode(my_text)
ids
my_tensor=torch.tensor([ids])
model.eval()
result=model(my_tensor)
predictions=result[0]
predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_index
predicted_text = tokenizer.decode(ids + [predicted_index])
print(predicted_text)
Nowsaday people are not
no_of_words_to_generate=30
text = 'Nowsaday people are not'
for i in range(no_of_words_to_generate):
input_ids = tokenizer.encode(text)
input_tensor = torch.tensor([input_ids])
result = model(input_tensor)
predictions = result[0]
predicted_index = torch.argmax(predictions[0,-1,:]).item()
text = tokenizer.decode(input_ids + [predicted_index])
print(text)
Nowsaday people are not the only ones who have been affected by the recent spate of violence. “I’ve been in the area for a while now and I
## Chatbot
def take_last_tokens(inputs, note_history, history):
"""Filter the last 128 tokens"""
if inputs['input_ids'].shape[1] > 128:
inputs['input_ids'] = torch.tensor([inputs['input_ids'][0][-128:].tolist()])
inputs['attention_mask'] = torch.tensor([inputs['attention_mask'][0][-128:].tolist()])
note_history = ['</s> <s>'.join(note_history[0].split('</s> <s>')[2:])]
history = history[1:]
return inputs, note_history, history
def add_note_to_history(note, note_history):
"""Add a note to the historical information"""
note_history.append(note)
note_history = '</s> <s>'.join(note_history)
return [note_history]
def chat(message, history):
history = history or []
if history:
history_useful = ['</s> <s>'.join([str(a[0])+'</s> <s>'+str(a[1]) for a in history])]
else:
history_useful = []
history_useful = add_note_to_history(message, history_useful)
inputs = tokenizer(history_useful, return_tensors="pt")
inputs, history_useful, history = take_last_tokens(inputs, history_useful, history)
reply_ids = model.generate(**inputs)
response = tokenizer.batch_decode(reply_ids, skip_special_tokens=True)[0]
history_useful = add_note_to_history(response, history_useful)
list_history = history_useful[0].split('</s> <s>')
history.append((list_history[-2], list_history[-1]))
return history, history
chat("hello",['how are you'])
([‘how are you’, (‘hello’, ‘ Hi there, how are you? I just got back from a trip to the beach.’)])
## Question answering
## In this task, the machine receives a paragraph with the information
## it then extracts the words to answer the given question
from datasets import load_dataset
squad = load_dataset("squad", split="train[:5000]")
squad = squad.train_test_split(test_size=0.2)
squad["train"][0]
{‘id’: ‘56ce6f81aab44d1400b8878b’, ‘title’: ‘To_Kill_a_Mockingbird’, ‘context’: ‘Scholars have characterized To Kill a Mockingbird as both a Southern Gothic and coming-of-age or Bildungsroman novel. The grotesque and near-supernatural qualities of Boo Radley and his house, and the element of racial injustice involving Tom Robinson contribute to the aura of the Gothic in the novel. Lee used the term “Gothic” to describe the architecture of Maycomb's courthouse and in regard to Dill's exaggeratedly morbid performances as Boo Radley. Outsiders are also an important element of Southern Gothic texts and Scout and Jem's questions about the hierarchy in the town cause scholars to compare the novel to Catcher in the Rye and Adventures of Huckleberry Finn. Despite challenging the town's systems, Scout reveres Atticus as an authority above all others, because he believes that following one's conscience is the highest priority, even when the result is social ostracism. However, scholars debate about the Southern Gothic classification, noting that Boo Radley is in fact human, protective, and benevolent. Furthermore, in addressing themes such as alcoholism, incest, rape, and racial violence, Lee wrote about her small town realistically rather than melodramatically. She portrays the problems of individual characters as universal underlying issues in every society.’, ‘question’: ‘What genre of book is To Kill a Mockingbird typically called?’, ‘answers’: {‘text’: [‘Southern Gothic and coming-of-age or Bildungsroman novel’], ‘answer_start’: [60]}}
## Text classification
## In this example, we classify the sentiment of movie reviews
## with a pretrained BERT
from datasets import load_dataset
imdb = load_dataset("imdb")
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
def preprocess_function(examples):
return tokenizer(examples["text"], truncation=True)
tokenized_imdb = imdb.map(preprocess_function, batched=True)
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")
import numpy as np
def compute_metrics(eval_pred):
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
return accuracy.compute(predictions=predictions, references=labels)
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}
from transformers import create_optimizer
import tensorflow as tf
batch_size = 16
num_epochs = 5
batches_per_epoch = len(tokenized_imdb["train"]) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)
optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)
from transformers import TFAutoModelForSequenceClassification
model = TFAutoModelForSequenceClassification.from_pretrained(
"distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)
!pip install datasets
tf_train_set = model.prepare_tf_dataset(
tokenized_imdb["train"],
shuffle=True,
batch_size=16,
collate_fn=data_collator,
)
tf_validation_set = model.prepare_tf_dataset(
tokenized_imdb["test"],
shuffle=False,
batch_size=16,
collate_fn=data_collator,
)
import tensorflow as tf
model.compile(optimizer=optimizer)
from transformers.keras_callbacks import KerasMetricCallback
metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)
callbacks = [metric_callback]
# model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3, callbacks=callbacks)
text = "This was a masterpiece. Not completely faithful to the books, but enthralling from beginning to end. Might be my favorite of the three."
from transformers import pipeline
classifier = pipeline("sentiment-analysis", model="stevhliu/my_awesome_model")
classifier(text)
[{‘label’: ‘LABEL_1’, ‘score’: 0.9994940757751465}]
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_model")
inputs = tokenizer(text, return_tensors="tf")
from transformers import TFAutoModelForSequenceClassification
model = TFAutoModelForSequenceClassification.from_pretrained("stevhliu/my_awesome_model")
logits = model(**inputs).logits
predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0])
model.config.id2label[predicted_class_id]
‘LABEL_1’