Normalization
– To process a corpus into one standard format
Lemmatization
– To determine the lemma for a given word
– To group together the different inflected forms of a word so
they can be analyzed as a single item
Tokenization
– To break a text into words or symbols or phrases or other
meaningful
– Token
N.L.P.
NATURAL LANGUAGE PROCESSING
Teacher: Lê Ngọc Tấn
Email:
[email protected]
Blog:
Trường Đại học Công nghiệp Tp. HCM
Khoa Công nghệ thông tin
(Faculty of Information Technology)
Chapter 4
Computational Linguistics
NLP. p.2
What is computational linguistics?
It is an interdisciplinary field dealing with the statistical
or rule-based modeling of natural language from a
computational perspective
Corpus, Corpora
Pre-processing : normalization, tokenization,
Alignment Methods
Programming
NLP. p.3
Corpus Definitions
What is a corpus?
– It contains an important number of texts
– Corpora : a set of corpus
Golden corpus
– Brown Corpus
– Susanne Corpus
– EUROPARL Corpus
Corpus can be annotated or POS tagged
NLP. p.4
Corpus Categories (1)
Schema of corpus evolution
NLP. p.5
Corpus Categories (2)
What is a comparable corpus?
A corpus which contains data which are not parallel but still
closely related by conveying the same information
NLP. p.6
Corpus Categories (2)
What is a noisy parallel corpus?
It is a corpus containing each two sentences stowed in pairs in two
languages but the type of alignment is not really 1-1
NLP. p.7
Corpus Categories (3)
What is a parallel corpus?
A corpus which contains each two sentences stowed in pairs in
two languages
NLP. p.8
Corpus Categories (4)
What is a bilingual parallel corpus?
In multilingual corpora, a corpus of a language is a translation
from another corpus of another language and there are only two
languages
NLP. p.9
Parallel corpora application
Teaching second languages
Translation didactics
Terminology studies
Multilingual edition
Product internationalization
Automatic translation
Multilingual information retrieval
NLP. p.10
Alignment Methods (1)
Approaches
– The alignment techniques make the corpora useful and
exploitable
Methods
– Text alignment
– Sentence alignment
– Word alignment
How to evaluate the alignment methods?
NLP. p.11
Alignment Methods (2)
The alignment types between the language 1 and the
language 2:
1 – 0 : omission
0 – 1 : addition
1 – 1 : exactly correspondence
m – n : fusion, with m >1 and n > 1
NLP. p.12
Alignment Methods (3)
Difficulty about alignment methods:
Syntactic analysis
Structure between languages
NLP. p.13
How to evaluate alignment methods?
How to evaluate the alignment methods?
– Calculate the precision, the recall and the F-measure
– Calculate the error rates
– Calculate the metrics such as BLEU, NIST, TER,
PER, PER*,
F-measure or F1 score is used for
– A measure or a score of a test’s accuracy
– A weighted average of the precision and the call
NLP. p.14
Normalization, Lemmatization and
Tokenization
Normalization
– To process a corpus into one standard format
Lemmatization
– To determine the lemma for a given word
– To group together the different inflected forms of a word so
they can be analyzed as a single item
Tokenization
– To break a text into words or symbols or phrases or other
meaningful
– Token
NLP. p.15