NLP Pipelines for Beginners¶

Basic NLP using SpaCy models.

Imports and Configuration¶

In [1]:

Copied!





from collections import Counter
from typing import List

import spacy
from spacy import Language
from spacy.matcher import Matcher
from spacy.tokens.doc import Doc
from collections import Counter
from typing import List

import spacy
from spacy import Language
from spacy.matcher import Matcher
from spacy.tokens.doc import Doc

Load a test Document¶

In [2]:

Copied!

test_file = "data/fidelity_1.txt"

with open(test_file, "r") as file:
    test_doc = file.read()
    
test_doc[:100]
test_file = "data/fidelity_1.txt"

with open(test_file, "r") as file:
    test_doc = file.read()
    
test_doc[:100]

Out[2]:

'Fidelity International acquires LGIM’s UK personal investing arm\nBy Michael Klimes 23rd October 2020'

Load the SpaCy Language Model¶

A SpaCy language model can be though of as a pipeline of text processing stages, that maps documents into tokens and their annotations (attributes of token objects). For full details, see the SpaCy docs.

In [3]:

Copied!

nlp = spacy.load("en_core_web_sm")
test_doc_ = nlp(test_doc)
nlp = spacy.load("en_core_web_sm")
test_doc_ = nlp(test_doc)

Sentences¶

Documents can be processed on a sentence-by-sentence basis.

In [4]:

Copied!

sentences = list(test_doc_.sents)
print(f"There are {len(sentences)} sentences in the document.")
sentences = list(test_doc_.sents)
print(f"There are {len(sentences)} sentences in the document.")

There are 33 sentences in the document.

This is based on using a full-stop as a delimiter. We can use other tokens as sentence delimiters, by adding a new text processing stage to the SpaCy NLP pipeline.

In [5]:

Copied!





@Language.component("custom_sentence_delimiters")
def custom_sentence_delimiters(doc: Doc) -> Doc:
    delimiters = ["..."]
    for token in doc[:-1]:
        if token.text in delimiters:
            doc[token.i+1].is_sent_start = True
    return doc

nlp.add_pipe("custom_sentence_delimiters", before="parser")

test_text = "This is a sentence... with... customized ... delimiters."
[sent for sent in nlp(test_text).sents]
@Language.component("custom_sentence_delimiters")
def custom_sentence_delimiters(doc: Doc) -> Doc:
    delimiters = ["..."]
    for token in doc[:-1]:
        if token.text in delimiters:
            doc[token.i+1].is_sent_start = True
    return doc

nlp.add_pipe("custom_sentence_delimiters", before="parser")

test_text = "This is a sentence... with... customized ... delimiters."
[sent for sent in nlp(test_text).sents]

Out[5]:

[This is a sentence..., with..., customized, ..., delimiters.]

Tokens¶

After sentences are detected they are broken down into tokens.

In [6]:

Copied!

tokens = [token for token in nlp(test_text)]
tokens
tokens = [token for token in nlp(test_text)]
tokens

Out[6]:

[This, is, a, sentence, ..., with, ..., customized, ..., delimiters, .]

A token is an object with many attributes.

In [7]:

Copied!

first_token = tokens[0]
print(f"token type = {type(first_token)}")
print(f"token index = {first_token.idx}")
first_token = tokens[0]
print(f"token type = {type(first_token)}")
print(f"token index = {first_token.idx}")

token type = <class 'spacy.tokens.token.Token'>
token index = 0

Custom tokenizers can be created via nlp,tokenizer = spacy.tokenizer.Tokenizer(...) - more info.

Removing stop words and punctuation using tokens.

In [8]:

Copied!

[token for token in tokens if not (token.is_stop or token.is_punct)]
[token for token in tokens if not (token.is_stop or token.is_punct)]

Out[8]:

[sentence, customized, delimiters]

Tokens also contain an attribute for the lemma of a word.

In [9]:

Copied!

[token.lemma_ for token in tokens if not token.is_stop and not token.is_punct]
[token.lemma_ for token in tokens if not token.is_stop and not token.is_punct]

Out[9]:

['sentence', 'customize', 'delimiter']

From Tokens to Word Counts¶

In [10]:

Copied!





doc_tokens = list(token for token in test_doc_)

word_freq = Counter(
    [token.lemma_.lower() for token in doc_tokens
     if not token.is_stop and token.is_alpha]
)
word_freq.most_common(10)
doc_tokens = list(token for token in test_doc_)

word_freq = Counter(
    [token.lemma_.lower() for token in doc_tokens
     if not token.is_stop and token.is_alpha]
)
word_freq.most_common(10)

Out[10]:

[('fidelity', 10),
 ('investment', 9),
 ('personal', 7),
 ('investing', 7),
 ('lgim', 6),
 ('customer', 6),
 ('international', 4),
 ('uk', 4),
 ('business', 4),
 ('platform', 4)]

Part of Speech Tagging¶

All documents that have been through the SpaCy pipleine have been Part of Speech (POS) tagged, the results of which can be accessed via a token's attributes.

In [11]:

Copied!

for token in doc_tokens[:5]:
    print(f"{token.text}|{token.tag_}|{token.pos_}|{spacy.explain(token.tag_)}|")
for token in doc_tokens[:5]:
    print(f"{token.text}|{token.tag_}|{token.pos_}|{spacy.explain(token.tag_)}|")

Fidelity|NNP|PROPN|noun, proper singular|
International|NNP|PROPN|noun, proper singular|
acquires|VBZ|VERB|verb, 3rd person singular present|
LGIM|NNP|PROPN|noun, proper singular|
’s|POS|PART|possessive ending|

Rules-Based Matching¶

You could think of this as an enhanced regex that can use token attributes, such as POS tags, etc.

In [12]:

Copied!





def extract_full_name(doc: Doc) -> List[str]:
    matcher = Matcher(nlp.vocab)
    patterns = [[{"POS": "PROPN"}, {"POS": "PROPN"}]]
    matcher.add("FULL_NAME", patterns)
    return [doc[start:end].text for match_id, start, end in matcher(doc)]


Counter(extract_full_name(test_doc_))
def extract_full_name(doc: Doc) -> List[str]:
    matcher = Matcher(nlp.vocab)
    patterns = [[{"POS": "PROPN"}, {"POS": "PROPN"}]]
    matcher.add("FULL_NAME", patterns)
    return [doc[start:end].text for match_id, start, end in matcher(doc)]


Counter(extract_full_name(test_doc_))

Out[12]:

Counter({'Fidelity International': 3,
         'Michael Klimes': 1,
         'General Investment': 1,
         'Investment Management': 1,
         'Personal Investing': 2,
         'Stuart Welch': 1,
         'Cavendish Online': 1,
         'Online Investments': 1,
         'Investments Limited': 1,
         'Michelle Scrimgeour': 1})

Phrase Detection¶

Noun phrases can be automatically processed by SpaCy.

In [13]:

Copied!

noun_chunks = [chunk for chunk in test_doc_.noun_chunks]
noun_chunks[:5]
noun_chunks = [chunk for chunk in test_doc_.noun_chunks]
noun_chunks[:5]

Out[13]:

[Fidelity International,
 LGIM’s UK personal investing arm,
 Michael Klimes,
 23rd,
 October]

Named Entity Recognition¶

In [14]:

Copied!

for ent in test_doc_.ents:
    print(f"{ent.text}|{ent.label_}|{spacy.explain(ent.label_)}")
for ent in test_doc_.ents:
    print(f"{ent.text}|{ent.label_}|{spacy.explain(ent.label_)}")

Fidelity International|ORG|Companies, agencies, institutions, etc.
LGIM|ORG|Companies, agencies, institutions, etc.
UK|GPE|Countries, cities, states
Michael Klimes|PERSON|People, including fictional
Fidelity International|ORG|Companies, agencies, institutions, etc.
Legal & General Investment Management’s|ORG|Companies, agencies, institutions, etc.
UK|GPE|Countries, cities, states
Fidelity’s|ORG|Companies, agencies, institutions, etc.
UK|GPE|Countries, cities, states
almost 300,000|CARDINAL|Numerals that do not fall under another type
5.8bnin|MONEY|Monetary values, including unit
Fidelity’s Personal Investing|ORG|Companies, agencies, institutions, etc.
280,000|CARDINAL|Numerals that do not fall under another type
20.3bn|MONEY|Monetary values, including unit
the next 12 months|DATE|Absolute or relative dates or periods
Fidelity|ORG|Companies, agencies, institutions, etc.
today|DATE|Absolute or relative dates or periods
over 3,000|CARDINAL|Numerals that do not fall under another type
Isa, Sipp|ORG|Companies, agencies, institutions, etc.
Fidelity’s|ORG|Companies, agencies, institutions, etc.
daily|DATE|Absolute or relative dates or periods
Fidelity’s|ORG|Companies, agencies, institutions, etc.
Android|ORG|Companies, agencies, institutions, etc.
June|DATE|Absolute or relative dates or periods
Fidelity|ORG|Companies, agencies, institutions, etc.
Fidelity International|ORG|Companies, agencies, institutions, etc.
Stuart Welch|PERSON|People, including fictional
Cavendish Online Investments Limited|ORG|Companies, agencies, institutions, etc.
UK|GPE|Countries, cities, states
LGIM|ORG|Companies, agencies, institutions, etc.
Michelle Scrimgeour|PERSON|People, including fictional
LGIM|ORG|Companies, agencies, institutions, etc.
two|CARDINAL|Numerals that do not fall under another type
Fidelity International’s|ORG|Companies, agencies, institutions, etc.
LGIM|ORG|Companies, agencies, institutions, etc.