NLP Pipelines for Beginners¶
Basic NLP using SpaCy models.
Imports and Configuration¶
from collections import Counter
from typing import List
import spacy
from spacy import Language
from spacy.matcher import Matcher
from spacy.tokens.doc import Doc
Load a test Document¶
test_file = "data/fidelity_1.txt"
with open(test_file, "r") as file:
test_doc = file.read()
test_doc[:100]
'Fidelity International acquires LGIM’s UK personal investing arm\nBy Michael Klimes 23rd October 2020'
Load the SpaCy Language Model¶
A SpaCy language model can be though of as a pipeline of text processing stages, that maps documents into tokens and their annotations (attributes of token objects). For full details, see the SpaCy docs.
nlp = spacy.load("en_core_web_sm")
test_doc_ = nlp(test_doc)
Sentences¶
Documents can be processed on a sentence-by-sentence basis.
sentences = list(test_doc_.sents)
print(f"There are {len(sentences)} sentences in the document.")
There are 33 sentences in the document.
This is based on using a full-stop as a delimiter. We can use other tokens as sentence delimiters, by adding a new text processing stage to the SpaCy NLP pipeline.
@Language.component("custom_sentence_delimiters")
def custom_sentence_delimiters(doc: Doc) -> Doc:
delimiters = ["..."]
for token in doc[:-1]:
if token.text in delimiters:
doc[token.i+1].is_sent_start = True
return doc
nlp.add_pipe("custom_sentence_delimiters", before="parser")
test_text = "This is a sentence... with... customized ... delimiters."
[sent for sent in nlp(test_text).sents]
[This is a sentence..., with..., customized, ..., delimiters.]
Tokens¶
After sentences are detected they are broken down into tokens.
tokens = [token for token in nlp(test_text)]
tokens
[This, is, a, sentence, ..., with, ..., customized, ..., delimiters, .]
A token is an object with many attributes.
first_token = tokens[0]
print(f"token type = {type(first_token)}")
print(f"token index = {first_token.idx}")
token type = <class 'spacy.tokens.token.Token'> token index = 0
Custom tokenizers can be created via nlp,tokenizer = spacy.tokenizer.Tokenizer(...)
- more info.
Removing stop words and punctuation using tokens.
[token for token in tokens if not (token.is_stop or token.is_punct)]
[sentence, customized, delimiters]
Tokens also contain an attribute for the lemma of a word.
[token.lemma_ for token in tokens if not token.is_stop and not token.is_punct]
['sentence', 'customize', 'delimiter']
From Tokens to Word Counts¶
doc_tokens = list(token for token in test_doc_)
word_freq = Counter(
[token.lemma_.lower() for token in doc_tokens
if not token.is_stop and token.is_alpha]
)
word_freq.most_common(10)
[('fidelity', 10), ('investment', 9), ('personal', 7), ('investing', 7), ('lgim', 6), ('customer', 6), ('international', 4), ('uk', 4), ('business', 4), ('platform', 4)]
Part of Speech Tagging¶
All documents that have been through the SpaCy pipleine have been Part of Speech (POS) tagged, the results of which can be accessed via a token's attributes.
for token in doc_tokens[:5]:
print(f"{token.text}|{token.tag_}|{token.pos_}|{spacy.explain(token.tag_)}|")
Fidelity|NNP|PROPN|noun, proper singular| International|NNP|PROPN|noun, proper singular| acquires|VBZ|VERB|verb, 3rd person singular present| LGIM|NNP|PROPN|noun, proper singular| ’s|POS|PART|possessive ending|
Rules-Based Matching¶
You could think of this as an enhanced regex that can use token attributes, such as POS tags, etc.
def extract_full_name(doc: Doc) -> List[str]:
matcher = Matcher(nlp.vocab)
patterns = [[{"POS": "PROPN"}, {"POS": "PROPN"}]]
matcher.add("FULL_NAME", patterns)
return [doc[start:end].text for match_id, start, end in matcher(doc)]
Counter(extract_full_name(test_doc_))
Counter({'Fidelity International': 3, 'Michael Klimes': 1, 'General Investment': 1, 'Investment Management': 1, 'Personal Investing': 2, 'Stuart Welch': 1, 'Cavendish Online': 1, 'Online Investments': 1, 'Investments Limited': 1, 'Michelle Scrimgeour': 1})
Phrase Detection¶
Noun phrases can be automatically processed by SpaCy.
noun_chunks = [chunk for chunk in test_doc_.noun_chunks]
noun_chunks[:5]
[Fidelity International, LGIM’s UK personal investing arm, Michael Klimes, 23rd, October]
Named Entity Recognition¶
for ent in test_doc_.ents:
print(f"{ent.text}|{ent.label_}|{spacy.explain(ent.label_)}")
Fidelity International|ORG|Companies, agencies, institutions, etc. LGIM|ORG|Companies, agencies, institutions, etc. UK|GPE|Countries, cities, states Michael Klimes|PERSON|People, including fictional Fidelity International|ORG|Companies, agencies, institutions, etc. Legal & General Investment Management’s|ORG|Companies, agencies, institutions, etc. UK|GPE|Countries, cities, states Fidelity’s|ORG|Companies, agencies, institutions, etc. UK|GPE|Countries, cities, states almost 300,000|CARDINAL|Numerals that do not fall under another type 5.8bnin|MONEY|Monetary values, including unit Fidelity’s Personal Investing|ORG|Companies, agencies, institutions, etc. 280,000|CARDINAL|Numerals that do not fall under another type 20.3bn|MONEY|Monetary values, including unit the next 12 months|DATE|Absolute or relative dates or periods Fidelity|ORG|Companies, agencies, institutions, etc. today|DATE|Absolute or relative dates or periods over 3,000|CARDINAL|Numerals that do not fall under another type Isa, Sipp|ORG|Companies, agencies, institutions, etc. Fidelity’s|ORG|Companies, agencies, institutions, etc. daily|DATE|Absolute or relative dates or periods Fidelity’s|ORG|Companies, agencies, institutions, etc. Android|ORG|Companies, agencies, institutions, etc. June|DATE|Absolute or relative dates or periods Fidelity|ORG|Companies, agencies, institutions, etc. Fidelity International|ORG|Companies, agencies, institutions, etc. Stuart Welch|PERSON|People, including fictional Cavendish Online Investments Limited|ORG|Companies, agencies, institutions, etc. UK|GPE|Countries, cities, states LGIM|ORG|Companies, agencies, institutions, etc. Michelle Scrimgeour|PERSON|People, including fictional LGIM|ORG|Companies, agencies, institutions, etc. two|CARDINAL|Numerals that do not fall under another type Fidelity International’s|ORG|Companies, agencies, institutions, etc. LGIM|ORG|Companies, agencies, institutions, etc.