Natural Language Processing - Chapter 4: Computational Linguistics

Normalization – To process a corpus into one standard format  Lemmatization – To determine the lemma for a given word – To group together the different inflected forms of a word so they can be analyzed as a single item  Tokenization – To break a text into words or symbols or phrases or other meaningful – Token

pdf15 trang | Chia sẻ: dntpro1256 | Lượt xem: 538 | Lượt tải: 0download
Bạn đang xem nội dung tài liệu Natural Language Processing - Chapter 4: Computational Linguistics, để tải tài liệu về máy bạn click vào nút DOWNLOAD ở trên
N.L.P. NATURAL LANGUAGE PROCESSING  Teacher: Lê Ngọc Tấn  Email: letan.dhcn@gmail.com  Blog: Trường Đại học Công nghiệp Tp. HCM Khoa Công nghệ thông tin (Faculty of Information Technology) Chapter 4 Computational Linguistics NLP. p.2 What is computational linguistics?  It is an interdisciplinary field dealing with the statistical or rule-based modeling of natural language from a computational perspective  Corpus, Corpora  Pre-processing : normalization, tokenization,  Alignment Methods  Programming NLP. p.3 Corpus Definitions  What is a corpus? – It contains an important number of texts – Corpora : a set of corpus  Golden corpus – Brown Corpus – Susanne Corpus – EUROPARL Corpus  Corpus can be annotated or POS tagged NLP. p.4 Corpus Categories (1)  Schema of corpus evolution NLP. p.5 Corpus Categories (2)  What is a comparable corpus? A corpus which contains data which are not parallel but still closely related by conveying the same information NLP. p.6 Corpus Categories (2)  What is a noisy parallel corpus? It is a corpus containing each two sentences stowed in pairs in two languages but the type of alignment is not really 1-1 NLP. p.7 Corpus Categories (3)  What is a parallel corpus? A corpus which contains each two sentences stowed in pairs in two languages NLP. p.8 Corpus Categories (4)  What is a bilingual parallel corpus? In multilingual corpora, a corpus of a language is a translation from another corpus of another language and there are only two languages NLP. p.9 Parallel corpora application  Teaching second languages  Translation didactics  Terminology studies  Multilingual edition  Product internationalization  Automatic translation  Multilingual information retrieval NLP. p.10 Alignment Methods (1)  Approaches – The alignment techniques make the corpora useful and exploitable  Methods – Text alignment – Sentence alignment – Word alignment  How to evaluate the alignment methods? NLP. p.11 Alignment Methods (2)  The alignment types between the language 1 and the language 2: 1 – 0 : omission 0 – 1 : addition 1 – 1 : exactly correspondence m – n : fusion, with m >1 and n > 1 NLP. p.12 Alignment Methods (3)  Difficulty about alignment methods: Syntactic analysis Structure between languages NLP. p.13 How to evaluate alignment methods?  How to evaluate the alignment methods? – Calculate the precision, the recall and the F-measure – Calculate the error rates – Calculate the metrics such as BLEU, NIST, TER, PER, PER*,  F-measure or F1 score is used for – A measure or a score of a test’s accuracy – A weighted average of the precision and the call NLP. p.14 Normalization, Lemmatization and Tokenization  Normalization – To process a corpus into one standard format  Lemmatization – To determine the lemma for a given word – To group together the different inflected forms of a word so they can be analyzed as a single item  Tokenization – To break a text into words or symbols or phrases or other meaningful – Token NLP. p.15

Các file đính kèm theo tài liệu này:

  • pdf4_chapter_computational_linguistics_v1_8914_2009063.pdf