Natural Language Processing - Chapter 3: Basic principles for NLP
Lexicography is a study of linguistics faculty – Thesaurus – Lexicon – Dictionaries – Encyclopedia Syntax Semantics
Bạn đang xem trước 20 trang tài liệu Natural Language Processing - Chapter 3: Basic principles for NLP, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
N.L.P.
NATURAL LANGUAGE PROCESSING
Teacher: Lê Ngọc Tấn
Email: letan.dhcn@gmail.com
Blog:
Trường Đại học Công nghiệp Tp. HCM
Khoa Công nghệ thông tin
(Faculty of Information Technology)
Chapter 3
Basic principles for NLP
NLP. p.2
POS – part of speech tagging (1)
Perhaps starting with Aristotle in the West
(384-322 BCE), there was the idea of having
parts of speech
It comes from Dionysius Thrax of Alexandria
(c. 100 BCE), the idea that is still with us that
there are 8 traditional parts of speech:
– Thrax: noun, verb, article, adverb, preposition,
conjunction, participle, pronoun
– School grammar: noun, verb, adjective, adverb,
preposition, conjunction, interjection, pronoun
NLP. p.3
POS – part of speech examples for English
N noun chair, bandwidth, pacing
V verb study, debate, munch
ADJ adj purple, tall, ridiculous
ADV adverb unfortunately, slowly
P preposition of, by, to
PRO pronoun I, me, mine
DET determiner the, a, that, those
CONJ conjunction and, or
NLP. p.4
Open vs. Closed Classes
Open vs. Closed classes
– Closed:
• determiners: a, an, the
• pronouns: she, he, I
• prepositions: on, under, over, near, by,
• Why “closed”?
– Open:
• Nouns, Verbs, Adjectives, Adverbs.
NLP. p.5
POS – part of speech tagging (2)
Words often have more than one POS:
Ex:
the back door : JJ (Adjective)
on my back : NN (noun)
promised to back the bill : VB (verb, base form)
But in one sentence, one word has only one POS
tag.
To do POS tagging, we need to choose a standard
set of tags to work with, e.g. Penn TreeBank,
Brown,
NLP. p.6
POS tagsets
There are many POS tagsets
MINIPAR
PENN TREE BANK: 36 tags without punctuations
VCL – VIETNAMESE COMPUTATIONAL LINGUISTICS
(A.3.1, page 373 & A.4, page 378 SGK)
NLP. p.7
Penn TreeBank POS tagset
NLP. p.8
POS – part of speech tagging (3)
Definition: The process of assigning a part-of-speech or
lexical class marker to each word in a corpus.
The POS tagging problem is to decide on the correct
POS tag for a particular instance of a word
Example:
Input: I can can a can
Ambiguity: pronoun/ auxiliary | verb | noun/ auxiliary |
verb | noun/ determiner /auxiliary | verb | noun
Output: I/pronoun can/auxiliary can/verb a/determiner
can/noun
NLP. p.9
POS – Methods of tagging
HMM: Hidden Markov Model
Rules-based
Maximum entropy
Neural network
Decision tree
Transformation Based Learning (the most efficient
and the most used method)
– fast-TBL method
– fTBL-toolkit
NLP. p.10
POS tagging – Evaluation
So once you have you POS tagger running how
do you evaluate it?
– Overall error rate with respect to a gold-standard test
set.
– Error rates on particular tags
– Error rates on particular words
– Tag confusions...
NLP. p.11
POS tagging – Evaluation
The result is compared with a manually coded
“Gold Standard”
– Typically accuracy reaches 96-97%
– This may be compared with result for a baseline
tagger (one that uses no context).
Important: 100% is impossible even for human
annotators.
NLP. p.12
Parsing
Definition : Parsing is the process of analyzing a text,
made of a sequence of tokens, to determine its
grammatical structure with respect to a given formal
grammar
Syntactic analysis
There are two techniques in parsing:
– Top–Down
– Bottom–Up
In the parsing step, grammar is used to be examined:
– Context Free Grammar (CFG)
– Probabilistic Context Free Grammar (PCFG)
– Lexical Functional Grammar (LFG)
NLP. p.13
Grammar
NLP. p.14
Simple Grammar
NLP. p.15
Sentence types
Declaratives: A plane left.
S NP VP
Imperatives: Leave!
S VP
Yes-No Questions: Did the plane leave?
S Aux NP VP
WH Questions: When did the plane leave?
S WH-NP Aux NP VP
NLP. p.16
Derivations
A derivation is a sequence of rules applied to a
string that accounts for that string
– Covers all the elements in the string
– Covers only the elements in the string
NLP. p.17
Top-Down Search
Since we’re trying to find trees rooted with an S
(Sentences), why not start with the rules that give
us an S.
Then we can work our way down from there to
the words.
NLP. p.18
Sample grammar
NLP. p.19
Parsing “Book that flight”
NLP. p.20
Bottom-Up Parsing
Of course, we also want trees that cover the input
words. So we might also start with trees that link
up with the words in the right way.
Then work your way up from there to larger and
larger trees.
NLP. p.21
Parsing “Book that flight”
NLP. p.22
Ambiguity
NLP. p.23
Morphology, morphemes, stems and lemma
Morphology is the relationship between a language unit
and its form
Ex: book, books, house, house-hold
Morpheme is the smallest meaningful units of a word
(syllable)
Ex:
information inform-ation
reading read-ing
NLP. p.24
Morphology, morphemes, stems and lemma
Stemmatization : to group one or more morphemes of a
word
Ex:
pretties pretty
useful use
Stems
Stemma
NLP. p.25
Morphology, morphemes, stems and lemma
Lemmatization : to group together the different inflected
forms of a word so they can be analyzed as a single item
Ex:
going go
reading read
Lemma
NLP. p.26
Lexicology, syntax and semantics
Lexicography is a study of linguistics faculty
– Thesaurus
– Lexicon
– Dictionaries
– Encyclopedia
Syntax
Semantics
NLP. p.27
Etymology
Study of words and of the structure of words
Example :
anti (morpheme) + poison (morpheme)
= antipoison (morpheme)
Classification of words structure:
– Simple word. Ex: book, boy, sister
– Complex/derived word. Ex: babysister
– Compound word. Ex: tall-boy , swimming-pool
NLP. p.28
Các file đính kèm theo tài liệu này:
- 3_chapter_basic_principles_for_nlp_v2_6339_2009062.pdf