Natural Language Processing - Chapter 3: Basic principles for NLP

Lexicography is a study of linguistics faculty – Thesaurus – Lexicon – Dictionaries – Encyclopedia  Syntax  Semantics

pdf28 trang | Chia sẻ: dntpro1256 | Lượt xem: 524 | Lượt tải: 0download
Bạn đang xem trước 20 trang tài liệu Natural Language Processing - Chapter 3: Basic principles for NLP, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
N.L.P. NATURAL LANGUAGE PROCESSING  Teacher: Lê Ngọc Tấn  Email: letan.dhcn@gmail.com  Blog: Trường Đại học Công nghiệp Tp. HCM Khoa Công nghệ thông tin (Faculty of Information Technology) Chapter 3 Basic principles for NLP NLP. p.2 POS – part of speech tagging (1)  Perhaps starting with Aristotle in the West (384-322 BCE), there was the idea of having parts of speech  It comes from Dionysius Thrax of Alexandria (c. 100 BCE), the idea that is still with us that there are 8 traditional parts of speech: – Thrax: noun, verb, article, adverb, preposition, conjunction, participle, pronoun – School grammar: noun, verb, adjective, adverb, preposition, conjunction, interjection, pronoun NLP. p.3 POS – part of speech examples for English  N noun chair, bandwidth, pacing  V verb study, debate, munch  ADJ adj purple, tall, ridiculous  ADV adverb unfortunately, slowly  P preposition of, by, to  PRO pronoun I, me, mine  DET determiner the, a, that, those  CONJ conjunction and, or NLP. p.4 Open vs. Closed Classes Open vs. Closed classes – Closed: • determiners: a, an, the • pronouns: she, he, I • prepositions: on, under, over, near, by, • Why “closed”? – Open: • Nouns, Verbs, Adjectives, Adverbs. NLP. p.5 POS – part of speech tagging (2)  Words often have more than one POS: Ex: the back door : JJ (Adjective) on my back : NN (noun) promised to back the bill : VB (verb, base form)  But in one sentence, one word has only one POS tag.  To do POS tagging, we need to choose a standard set of tags to work with, e.g. Penn TreeBank, Brown, NLP. p.6 POS tagsets  There are many POS tagsets  MINIPAR  PENN TREE BANK: 36 tags without punctuations  VCL – VIETNAMESE COMPUTATIONAL LINGUISTICS (A.3.1, page 373 & A.4, page 378 SGK) NLP. p.7 Penn TreeBank POS tagset NLP. p.8 POS – part of speech tagging (3)  Definition: The process of assigning a part-of-speech or lexical class marker to each word in a corpus.  The POS tagging problem is to decide on the correct POS tag for a particular instance of a word  Example: Input: I can can a can Ambiguity: pronoun/ auxiliary | verb | noun/ auxiliary | verb | noun/ determiner /auxiliary | verb | noun Output: I/pronoun can/auxiliary can/verb a/determiner can/noun NLP. p.9 POS – Methods of tagging  HMM: Hidden Markov Model  Rules-based  Maximum entropy  Neural network  Decision tree  Transformation Based Learning (the most efficient and the most used method) – fast-TBL method – fTBL-toolkit NLP. p.10 POS tagging – Evaluation  So once you have you POS tagger running how do you evaluate it? – Overall error rate with respect to a gold-standard test set. – Error rates on particular tags – Error rates on particular words – Tag confusions... NLP. p.11 POS tagging – Evaluation  The result is compared with a manually coded “Gold Standard” – Typically accuracy reaches 96-97% – This may be compared with result for a baseline tagger (one that uses no context).  Important: 100% is impossible even for human annotators. NLP. p.12 Parsing  Definition : Parsing is the process of analyzing a text, made of a sequence of tokens, to determine its grammatical structure with respect to a given formal grammar  Syntactic analysis  There are two techniques in parsing: – Top–Down – Bottom–Up  In the parsing step, grammar is used to be examined: – Context Free Grammar (CFG) – Probabilistic Context Free Grammar (PCFG) – Lexical Functional Grammar (LFG) NLP. p.13 Grammar  NLP. p.14 Simple Grammar NLP. p.15 Sentence types  Declaratives: A plane left. S  NP VP  Imperatives: Leave! S  VP  Yes-No Questions: Did the plane leave? S  Aux NP VP  WH Questions: When did the plane leave? S  WH-NP Aux NP VP NLP. p.16 Derivations  A derivation is a sequence of rules applied to a string that accounts for that string – Covers all the elements in the string – Covers only the elements in the string NLP. p.17 Top-Down Search  Since we’re trying to find trees rooted with an S (Sentences), why not start with the rules that give us an S.  Then we can work our way down from there to the words. NLP. p.18 Sample grammar NLP. p.19 Parsing “Book that flight” NLP. p.20 Bottom-Up Parsing  Of course, we also want trees that cover the input words. So we might also start with trees that link up with the words in the right way.  Then work your way up from there to larger and larger trees. NLP. p.21 Parsing “Book that flight” NLP. p.22 Ambiguity NLP. p.23 Morphology, morphemes, stems and lemma  Morphology is the relationship between a language unit and its form Ex: book, books, house, house-hold  Morpheme is the smallest meaningful units of a word (syllable) Ex: information  inform-ation reading  read-ing NLP. p.24 Morphology, morphemes, stems and lemma  Stemmatization : to group one or more morphemes of a word Ex: pretties pretty useful use  Stems  Stemma NLP. p.25 Morphology, morphemes, stems and lemma  Lemmatization : to group together the different inflected forms of a word so they can be analyzed as a single item Ex: going go reading  read  Lemma NLP. p.26 Lexicology, syntax and semantics  Lexicography is a study of linguistics faculty – Thesaurus – Lexicon – Dictionaries – Encyclopedia  Syntax  Semantics NLP. p.27 Etymology  Study of words and of the structure of words  Example : anti (morpheme) + poison (morpheme) = antipoison (morpheme)  Classification of words structure: – Simple word. Ex: book, boy, sister – Complex/derived word. Ex: babysister – Compound word. Ex: tall-boy , swimming-pool NLP. p.28

Các file đính kèm theo tài liệu này:

  • pdf3_chapter_basic_principles_for_nlp_v2_6339_2009062.pdf
Tài liệu liên quan