NLP

NLP基礎學習(二) - Text Preprocessing with spaCy

文本預處理的基本概念

Posted by PCLiu on April 25, 2024

在本文中,我們將探討 Spacy 函式庫的基本概念,以及如何使用 Spacy 進行文本預處理。Spacy 是一個用於自然語言處理的Python函式庫,它提供了許多功能,包括詞性標註、命名實體識別、句法分析等。在本文中,我們將使用 Spacy 來進行文本預處理,包括分詞、詞性標註、命名實體識別等。

Installation of spaCy

先安裝spaCy函式庫,可以使用以下命令:

1
pip install spacy

安裝完成後,我們需要下載spaCy的模型,可以使用以下命令:

1
python -m spacy download en_core_web_sm

The Doc Object for Processed Text

我們首先創建一個叫做nlp的spaCy物件,這個物件幾乎包含了spaCy的所有功能。然後我們使用這個物件來處理文本,得到一個叫做doc的物件,這個物件包含了處理後的文本。en_core_web_sm則是spaCy的一個模型,它包含了英文的詞彙表和語法規則,如果你想要處理其他語言,可以到spaCy官網下載相應的模型。

1
2
3
nlp = spacy.load("en_core_web_sm")
introduction_doc = nlp("In 1991, the World Wide Web was born. It was a medium for sharing information.")
print([token.text for token in introduction_doc])

此時輸出的結果是:

1
['In', '1991', ',', 'the', 'World', 'Wide', 'Web', 'was', 'born', '.', 'It', 'was', 'a', 'medium', 'for', 'sharing', 'information', '.']

這裡scapy已經幫我們處將文本的內容切成了token的形式,我們可以通過token.text來獲取token的內容。

The Doc Object for Processed Text (Pass a .txt file)

用同樣的方式以可以處理.txt文件:

1
2
3
file_name = "introduction.txt"
introduction_doc = nlp(pathlib.Path(file_name).read_text(encoding="utf-8"))
print([token.text for token in introduction_doc])

Sentence Detection

spaCy還可以幫我們檢測句子,我們可以通過doc.sents來獲取句子:

1
2
3
4
5
6
7
8
9
about_text = (
      "Gus Proto is a Python developer currently"
      " working for a London-based Fintech"
      " company. He is interested in learning"
      " Natural Language Processing."
)
about_doc = nlp(about_text)
sentences = list(about_doc.sents)
print(sentences)

輸出結果:

1
2
[Gus Proto is a Python developer currently working for a London-based Fintech company.,  
 He is interested in learning Natural Language Processing.]

Tokens in spaCy

透過token.idx可以獲取token在文本中的位置:

1
2
3
print(f"{"Token":12} {"Start Index"}")
for token in about_doc:
    print(f"{token.text:12} {token.idx}")

輸出結果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Token        Start Index
Gus          0
Proto        4
is           10
a            13
Python       15
developer    22
currently    32
working      42
for          50
a            54
London       56
-            62
based        63
Fintech      69
company      77
.            84
He           86
is           89
interested   92
in           103
learning     106
Natural      115
Language     123
Processing   132
.            142

spaCy還可以幫我們檢測token是否為英文字母所組成,是否為標點符號,是否為Stop Word等:
我會在下文中介紹Stop Words

1
2
3
4
5
6
7
8
9
10
11
12
13
14
print(
f"{"Text with Whitespace":25}"
f"{"Is Alphanumeric?":20}"
f"{"Is Punctuation?":20}"
f"{"Is Stop Word?"}"
)

for token in about_doc:
    print(
         f"{str(token.text_with_ws):25}"
         f"{str(token.is_alpha):20}"
         f"{str(token.is_punct):20}"
         f"{str(token.is_stop)}"
    )

輸出結果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Text with Whitespace     Is Alphanumeric?    Is Punctuation?     Is Stop Word?
Gus                      True                False               False
Proto                    True                False               False
is                       True                False               True
a                        True                False               True
Python                   True                False               False
developer                True                False               False
currently                True                False               False
working                  True                False               False
for                      True                False               True
a                        True                False               True
London                   True                False               False
-                        False               True                False
based                    True                False               False
Fintech                  True                False               False
company                  True                False               False
.                        False               True                False
He                       True                False               True
is                       True                False               True
interested               True                False               False
in                       True                False               True
learning                 True                False               False
Natural                  True                False               False
Language                 True                False               False
Processing               True                False               False
.                        False               True                False

Stop Words

Stop Words是一些在文本處理中通常會被忽略的詞彙(且忽略之後對原本的語義沒有太大影響),比如isthea等。spaCy提供了一個STOP_WORDS的集合,我們可以通過這個集合來檢測token是否為Stop Word

這裡我們列出前10個Stop Words

1
2
3
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS
for stop_word in list(spacy_stopwords)[:10]:
    print(stop_word)

輸出結果:

1
2
3
4
5
6
7
8
9
10
made
we
will
nothing
afterwards
thru
alone
much
side
's

幾乎都是一些常見的英文Stop Words

透過token.is_stop可以檢測token是否為Stop Word,並且根據這個屬性來過濾掉Stop Words

1
2
3
4
5
6
7
8
9
10
custom_about_text = (
     "Gus Proto is a Python developer currently"
     " working for a London-based Fintech"
     " company. He is interested in learning"
     " Natural Language Processing."
 )

nlp = spacy.load("en_core_web_sm")
about_doc = nlp(custom_about_text)
print([token for token in about_doc if not token.is_stop])

輸出結果:

1
[Gus, Proto, Python, developer, currently, working, London, -, based, Fintech, company, ., interested, learning, Natural, Language, Processing, .]

可以看到isaStop Words都被過濾掉了。

Lemmatization

Lemmatization是將單詞還原為它的基本形式的過程,比如running還原為run。spaCy提供了一個lemma_屬性,我們可以通過這個屬性來獲取token的基本形式。

1
2
3
4
5
6
7
8
9
10
11
12
13
conference_help_text = (
     "Gus is helping organize a developer"
     " conference on Applications of Natural Language"
     " Processing. He keeps organizing local Python meetups"
     " and several internal talks at his workplace."
 )

nlp = spacy.load("en_core_web_sm")
conference_help_doc = nlp(conference_help_text)

for token in conference_help_doc:
    if str(token) != str(token.lemma_): # only print if lemma is different with the original token
        print(f"{str(token):>20} : {str(token.lemma_)}")

輸出結果:

1
2
3
4
5
6
                  is : be
                  He : he
               keeps : keep
          organizing : organize
             meetups : meetup
               talks : talk

可以看到is被還原為bekeeps被還原為keep等。

Word Frequency

我們可以通過Counter來計算文本中每個單詞的出現次數,在計算之前我們需要過濾掉Stop Words和標點符號:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from collections import Counter
nlp = spacy.load("en_core_web_sm")
complete_text = (
     "Gus Proto is a Python developer currently"
     " working for a London-based Fintech company. He is"
     " interested in learning Natural Language Processing."
     " There is a developer conference happening on 21 July"
     ' 2019 in London. It is titled "Applications of Natural'
     ' Language Processing". There is a helpline number'
     " available at +44-1234567891. Gus is helping organize it."
     " He keeps organizing local Python meetups and several"
     " internal talks at his workplace. Gus is also presenting"
     ' a talk. The talk will introduce the reader about "Use'
     ' cases of Natural Language Processing in Fintech".'
     " Apart from his work, he is very passionate about music."
     " Gus is learning to play the Piano. He has enrolled"
     " himself in the weekend batch of Great Piano Academy."
     " Great Piano Academy is situated in Mayfair or the City"
     " of London and has world-class piano instructors."
 )
complete_doc = nlp(complete_text)

words = [
     token.text
     for token in complete_doc
     if not token.is_stop and not token.is_punct # filter out stop words and punctuation
]

print(Counter(words).most_common(5))

輸出結果:

1
[('Gus', 4), ('London', 3), ('Natural', 3), ('Language', 3), ('Processing', 3)]

可以看到Gus出現了4次,LondonNaturalLanguageProcessing都出現了3次。
光是從這個結果就可以看出這段文本的主題是關於GusLondonNatural Language Processing

Part-of-Speech Tagging

POS標註是將文本中的每個token標註為詞性的過程,比如名詞、動詞、形容詞等。spaCy提供了一個pos_屬性,我們可以通過這個屬性來獲取token的詞性。

下面是8種常見的屬性:

1
2
3
4
5
6
7
8
Noun
Pronoun
Adjective
Verb
Adverb
Preposition
Conjunction
Interjection

下面是一個簡單的例子:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
about_text = (
     "Gus Proto is a Python developer currently"
     " working for a London-based Fintech"
     " company. He is interested in learning"
     " Natural Language Processing."
 )

about_doc = nlp(about_text)

for token in about_doc:
    print(
        f"""
        TOKEN: {str(token)}
        =====
        POS: {token.pos_}
        EXPLANATION: {spacy.explain(token.tag_)}"""
        )

輸出結果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
        TOKEN: Gus
        =====
        POS: PROPN
        EXPLANATION: noun, proper singular

        TOKEN: Proto
        =====
        POS: PROPN
        EXPLANATION: noun, proper singular

        TOKEN: is
        =====
        POS: AUX
        EXPLANATION: verb, 3rd person singular present

        TOKEN: a
        =====
        POS: DET
        EXPLANATION: determiner

        TOKEN: Python
        =====
        POS: PROPN
        EXPLANATION: noun, proper singular

Processing functions

下面是一個使用spaCy進行文本預處理的例子,包括分詞、過濾掉Stop Words和標點符號、還原詞形等等。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
nlp = spacy.load("en_core_web_sm")
complete_text = (
     "Gus Proto is a Python developer currently"
     " working for a London-based Fintech company. He is"
     " interested in learning Natural Language Processing."
     " There is a developer conference happening on 21 July"
     ' 2019 in London. It is titled "Applications of Natural'
     ' Language Processing". There is a helpline number'
     " available at +44-1234567891. Gus is helping organize it."
     " He keeps organizing local Python meetups and several"
     " internal talks at his workplace. Gus is also presenting"
     ' a talk. The talk will introduce the reader about "Use'
     ' cases of Natural Language Processing in Fintech".'
     " Apart from his work, he is very passionate about music."
     " Gus is learning to play the Piano. He has enrolled"
     " himself in the weekend batch of Great Piano Academy."
     " Great Piano Academy is situated in Mayfair or the City"
     " of London and has world-class piano instructors."
 )

complete_doc = nlp(complete_text)

def is_token_allowed(token):
    return bool( token and str(token).strip() and not token.is_stop and not token.is_punct )

def preprocess_token(token):
    return token.lemma_.strip().lower()

complete_filtered_tokens = [
     preprocess_token(token)
     for token in complete_doc
     if is_token_allowed(token)
]

print(complete_filtered_tokens)

輸出結果:

1
['gus', 'proto', 'python', 'developer', 'currently', 'work', 'london', 'base', 'fintech', 'company', 'interested', 'learn', 'natural', 'language', 'processing', 'developer', 'conference', 'happen', '21', 'july', '2019', 'london', 'title', 'application', 'natural', 'language', 'processing', 'helpline', 'number', 'available', '+44', '1234567891', 'gus', 'helping', 'organize', 'keep', 'organize', 'local', 'python', 'meetup', 'internal', 'talk', 'workplace', 'gus', 'present', 'talk', 'talk', 'introduce', 'reader', 'use', 'case', 'natural', 'language', 'processing', 'fintech', 'apart', 'work', 'passionate', 'music', 'gus', 'learn', 'play', 'piano', 'enrol', 'weekend', 'batch', 'great', 'piano', 'academy', 'great', 'piano', 'academy', 'situate', 'mayfair', 'city', 'london', 'world', 'class', 'piano', 'instructor']

Rule-based Matching

Rule-based Matching是一種通過定義規則來匹配文本的方法,spaCy提供了一個Matcher類,我們可以通過這個類來定義匹配規則。下面是一個簡單的例子:

假如說我們想要將Gus Proto match 成一個叫做FULL_NAME的pattern,我們可以這樣做:
因為GusProto都是PROPN,所以我們可以通過這個特性來定義我們的pattern

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")

about_text = (
     "Gus Proto is a Python developer currently"
     " working for a London-based Fintech"
     " company. He is interested in learning"
     " Natural Language Processing."
 )

about_doc = nlp(about_text)

matcher = Matcher(nlp.vocab)

def extract_full_name(nlp_doc):
    
    name = []
    pattern = [{"POS": "PROPN"}, {"POS": "PROPN"}] # PROPN: proper noun, 連續兩個PROPN
    matcher.add("FULL_NAME", [pattern])
    matches = matcher(nlp_doc)
    for _, start, end in matches:
        span = nlp_doc[start:end]
        name.append(span.text)
    
    return name
        
extract_full_name(about_doc)

輸出結果:

1
['Gus Proto', 'Natural Language', 'Language Processing']

可以看到Gus ProtoNatural LanguageLanguage Processing都被成功匹配了,因為他們都是由兩個PROPN組成的。
可是還是有一些問題,比如Natural LanguageLanguage Processing被匹配了兩次,這是因為NaturalLanguage都是PROPN,所以他們也被匹配了。
所以如果只是想要匹配Gus Proto,我們可以利用接下來要介紹的Named Entity Recognition

Named Entity Recognition

命名實體識別是將文本中的命名實體識別出來的過程,比如人名、地名、組織名等。spaCy提供了一個ent_type_屬性,我們可以通過這個屬性來獲取token的命名實體類型。

透過doc.ents可以獲取文本中的所有命名實體:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
nlp = spacy.load("en_core_web_sm")

piano_class_text = (
     "Great Piano Academy is situated"
     " in Mayfair or the City of London and has"
     " world-class piano instructors."
 )

piano_class_doc = nlp(piano_class_text)
for ent in piano_class_doc.ents:
    print(
        f"""
        {ent.text = }
        {ent.start_char = }
        {ent.end_char = }
        {ent.label_ = }
        spacy.explain('{ent.label_}') = {spacy.explain(ent.label_)}"""
        )

輸出結果:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
        ent.text = 'Great Piano Academy'
        ent.start_char = 0
        ent.end_char = 19
        ent.label_ = 'ORG'
        spacy.explain('ORG') = Companies, agencies, institutions, etc.

        ent.text = 'Mayfair'
        ent.start_char = 35
        ent.end_char = 42
        ent.label_ = 'GPE'
        spacy.explain('GPE') = Countries, cities, states

        ent.text = 'the City of London'
        ent.start_char = 46
        ent.end_char = 64
        ent.label_ = 'GPE'
        spacy.explain('GPE') = Countries, cities, states

可以看到Great Piano Academy被識別為ORGMayfairthe City of London被識別為GPE
其中ORG代表Companies, agencies, institutions, etc.GPE代表Countries, cities, states,可以用spacy.explain()來獲取這些標籤的解釋。

以下是一個簡單的例子,讓我們可以便釋出文本中的人名,然後將他們屏蔽掉:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
survey_text = (
    "Out of 5 people surveyed, James Robert,"
    " Julie Fuller and Benjamin Brooks like"
    " apples. Kelly Cox and Matthew Evans"
    " like oranges."
)

def replace_person_names(token):
    if token.ent_iob != 0 and token.ent_type_ == "PERSON":
        return "[REDACTED] "
    return token.text_with_ws

def redact_names(nlp_doc):
    with nlp_doc.retokenize() as retokenizer:
        for ent in nlp_doc.ents:
            retokenizer.merge(ent)
    tokens = map(replace_person_names, nlp_doc)
    return "".join(tokens)

survey_doc = nlp(survey_text)
print(redact_names(survey_doc))

輸出結果:

1
Out of 5 people surveyed, [REDACTED] , [REDACTED] and [REDACTED] like apples. [REDACTED] and [REDACTED] like oranges.

可以看到James RobertJulie FullerBenjamin BrooksKelly CoxMatthew Evans都被屏蔽掉了。

References

[1] Natural Language Processing With spaCy in Python by Taranjeet Singh
[2] spaCy 官方網站

如果您喜歡本篇文章,請繼續關注我的Blog :)