Extracting Multilingual Lexicons from Parallel Corpora

June 24, 2017 | Autor: Dan Ioan Tufis | Categoría: Cognitive Science, Parallel Corpora, Data Format

Descripción

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

1

Extracting multilingual lexicons from parallel corpora DAN TUFIŞ1, ANA MARIA BARBU2 AND RADU ION3 1,2,3

Romanian Academy (RACAI),13, “13 Septembrie”, 050711, Bucharest 5, Romania 1 2 3 [email protected] , [email protected] , [email protected]

Abstract. The paper describes our recent developments in automatic extraction of translation equivalents from parallel corpora. We describe three increasingly complex algorithms: a simple baseline iterative method, and two non-iterative more elaborated versions. While the baseline algorithm is mainly described for illustrative purposes, the non-iterative algorithms outline the use of different working hypotheses which may be motivated by different kinds of applications and to some extent by the languages concerned. The first two algorithms rely on cross-lingual POS preservation, while with the third one POS invariance is not an extraction condition. The evaluation of the algorithms was conducted on three different corpora and several pairs of languages. Keywords: alignment, evaluation, lemmatization, tagging, translation equivalence

1

INTRODUCTION

Automatic Extraction of bilingual lexicons from parallel texts might seem a futile task, given that more and more bilingual lexicons are printed nowadays and they can be easily turned into machine-readable lexicons. However, if one considers only the possibility of automatic enriching the presently available electronic lexicons, with very limited manpower and lexicographic expertise, the problem reveals a lot of potential. The scientific and technological advancement in many domains is a constant source of new-term coinage and therefore keeping up with multilingual lexicography in such areas

1 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

2

is very difficult unless computational means are used. On the other hand, translation bilingual lexicons appear to be quite different from the corresponding printed lexicons, meant for human users. The marked difference between printed bilingual lexicons and bilingual lexicons as needed for automatic translation is not really surprising. Traditional lexicography deals with translation equivalence (the underlying concept of bilingual lexicography) in an inherently discrete way. What is to be found in a printed dictionary or lexicon (bi- or multilingual) is just a set of general basic translations. In the case of specialised registers, general lexicons are usually not very useful. The recent interest in semantic markup of texts, motivated by the Semantic Web technologies, raises the issue of exploiting the markup existing in one language text to automatically generate the semantic annotations in the second language parallel text. Finding the lexical correspondencies in a parallel text creates the possiblity of bidirectional import of semantic annotations that might exist in either of the two parallel texts. The basic concept in extracting translation lexicons is the notion of translation equivalence relation (Gale and Church, 1991). One of the widely accepted definitions (Melamed, 2001) of the translation equivalence defines it as a (symmetric) relation that holds between two different language texts, such that expressions appearing in corresponding parts of the two texts are reciprocal translations. These expressions are called translation equivalents. A parallel text, or a bitext, having its translation equivalents linked is an aligned bitext. Translation equivalence may be defined at various granularity levels: paragraph, sentence, lexical. Automatic detection of the translation equivalents in a bitext is increasingly more difficult as the granularity becomes finer. Here we are concerned with the finest alignment granularity, namely the lexical one. If not stated otherwise, in the rest of the paper by translation equivalents we will mean lexical translation equivalents.

2 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

3

Most approaches to automatic extraction of translation equivalents roughly fall into two categories. The hypotheses-testing methods such as (Gale and Church, 1991; Smadja et al., 1996) rely on a generative device that produces a list of translation equivalence candidates (TECs), each of them being subject to an independence statistical test. The TECs that show an association measure higher than expected under the independence assumption are assumed to be translation-equivalence pairs (TEPs). The TEPs are extracted independently one of another and therefore the process might be characterised as a local maximisation (greedy) one. The estimating approach (e.g. Brown et al., 1993; Kupiec, 1993; Hiemstra, 1997) is based on building a statistical bitext model from data, the parameters of which are to be estimated according to a given set of assumptions. The bitext model allows for global maximisation of the translation equivalence relation, considering not individual translation equivalents but sets of translation equivalents (sometimes called assignments). There are pros and cons for each type of approach, some of them discussed in (Hiemstra, 1997). Our method comes closer to the hypotheses-testing approach. It generates first a list of translation equivalent candidates and then successively extracts the most likely translation-equivalence pairs. The extraction process does not need a pre-existing bilingual lexicon for the considered languages. Yet, if such a lexicon exists, it can be used to eliminate spurious candidate translation-equivalence pairs and thus to speed up the process and increase its accuracy.

2

CORPUS ENCODING

In our experiments, we used three parallel corpora. The largest one, henceforth “NAACL2003”, is bilingual (Romanian and English), contains 866,036 words in the English part and 770635 words in the Romanian part, and consists mainly of journalistic texts. The raw texts in this corpus have been collected and provided by Rada Mihalcea from the University of North Texas for the purpose of the Shared Task on word-

3 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

4

alignment organised by Rada Mihalcea and Ted Pedersen at the HLT-NAACL2003 workshop on “Building and Using Parallel Texts: Data Driven Machine Translation and Beyond“ (see http://www.cs.unt.edu/~rada/wpt/). The smallest parallel text, henceforth “VAT”, is 3-lingual (French, Dutch and English), contains about 42,000 words per language), and is a legal text (the EEC Sixth VAT Directive -77/388 EEC VAT). It was built

for

the

FF-POIROT

European

project

(http://www.starlab.vub.ac.be/

research/projects/poirot/), as a testbed for multilingual term extraction and alignment. The third corpus, henceforth “1984”, is the result of the MULTEXT-EAST and CONCEDE European projects and it is based on Orwell’s novel Nineteen Eighty-Four, translated in 6 languages (Bulgarian, Czech, Estonian, Hungarian, Romanian, Slovene) with the English original included as the hub. Each translation was aligned to the English hub, thus yielding 6 bitexts. From these 6 integral bitexts, containing on average 100,000 words per language, we selected only the sentences that were 1:1 aligned with the English original, thus obtaining the 7-language parallel corpus with an average of 80,000 words per language. Out of the three corpora this is the most accurate, being hand validated (Erjavec & Ide, 1998; Tufiş et al. 1998). The input to our algorithms is represented by a parallel corpus encoded according to a simplified version of XCES specification (http://www.cs.vassar.edu/XCES). This encoding requires preliminary pre-processing of each monolingual part of the parallel corpus and, afterwards, the sentence alignment of all monolingual texts. The aligned fragments of text, in two or more languages present in the parallel corpus, make a translation unit. Each translation unit consists of several segments, one per language. A segment is made of one uniquely identified sentence. Each sentence is made up of one or more tokens for which the lemma and the morpho-syntactic code are explicitly encoded as tag attributes. More often than not, a token corresponds to what is generally called a word, but this is not always the case. Depending on the lexical resources used in the (monolingual) text segmentation, a multiword expression may be treated as a single 4 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

5

lexical token and encoded as such. As an example of the encoding used by our algorithms, Figure 1 shows the translation unit "Ozz.42" of the “1984” corpus: There were no windows in it at all . Nu avea deloc ferestre . Oken na njem sploh ni bilo . Nemělo vůbec okna . То изобщо нямаше прозорци

5 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

6

. Sel ei olnud ühtki akent . Egyáltalán nem voltak rajta ablakok .

Figure 1: Corpus encoding for the translation equivalence extraction algorithms The next section briefly describes the pre-processing steps, used by the corpus generation module that provides the input for the translation equivalence extraction algorithms.

3 3.1

PRELIMINARY PROCESSING

SEGMENTATION; WORDS AND MULTIWORD LEXICAL TOKENS

A lexical item is usually considered to be a space- or punctuation-delimited string of characters. However, especially in multilingual studies, it is convenient, and frequently linguistically motivated, to consider some sequences of traditional words as making up a single lexical unit. For translation purposes considering multiword expressions as single lexical units is a regular practice justified both by conceptual and computational reasons. The recognition of multiword expressions as single lexical tokens, and the splitting of

6 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

7

single words into multiple lexical tokens (when it is the case) are generically called text segmentation and the program that performs this task is called a segmenter or a tokenizer. In the following we will refer to words and multiword expressions as lexical tokens or, simply, tokens. The

multilingual

segmenter

we

used,

MtSeg1,

is

a

public

domain

tool

(http://tokww.lpl.univ-aix.fr/projects/multext/MtSeg/) and was developed by Philippe di Cristo within the MULTEXT project. The segmenter is able to recognise dates, numbers, various fixed phrases, to split clitics or contractions, etc. We implemented a collocation

extractor,

based

on

NSP,

an

n-gram

statistical

package

(http://www.d.umn.edu/~tpederse/nsp.html) developed by Ted Pedersen. The list of generated n-grams is subject to a regular expression filtering that considers languagespecific constituency restrictions. After validation, the new multi-word expressions may be added to the segmenter’s resources. A complementary approach to overcome the inherent incompleteness of the language specific tokenization resources is largely described in (Tufiş, 2001). 3.2

SENTENCE ALIGNMENT

We used a slightly modified version of Gale and Church’s CharAlign sentence aligner (Gale and Church, 1993). In general, sentence alignments of all bitexts of our multilingual corpora are of the type 1:1, i.e. in most cases (more than 95%) one sentence is translated as one sentence. In the following we will refer to the alignment units as translation units (TU). In general, sentence alignment is a highly accurate process, but in our corpora, alignment is error-free, either because of manual validation and correction (“1984”) or because the raw texts were published already aligned by their authors (“VAT” and “NACL2003”).

7 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

3.3

8

TAGGING AND LEMMATIZATION

For highly inflectional languages, morphological variation may generate diffusion of the statistical evidence for translation equivalents. In order to avoid data sparseness we added a tagging and lemmatization phase as a front-end pre-processing of the parallel corpus. For instance, the English adjective “unacceptable”, occurring nine times in one of our corpora, has been translated in Romanian by nine different word-forms, representing inflected forms (singular/plural, masculine/feminine) of three adjectival lemmas (inacceptabil, inadmisibil, intolerabil): inacceptabil, inacceptabile, inadmisibil, inadmisibile, inadmisibilă, inadmisibilului, intolerabil, intolerabilă and intolerabilei. Without lemmatization all translation pairs would be “hapax-legomena” pairs and thus their statistical recognition and extraction would be hampered.

The lemmatization

ensured sufficient evidence for the algorithm to extract all the three translations of the English word. The monolingual lexicons developed within MULTEXT-EAST contain, for each wordform, its lemma and the morpho-syntactic codes that apply for the current word-form. With these monolingual resources lematisation is a by-product of tagging: knowing the word-form and its associated tag, the lemma extraction, for those words that are in the lexicon, is just a matter of lexicon lookup; for unknown words, the lemma is implicitly set to the word-form itself, unless a lemmatiser is available. Erjavec and Ide (1998) provide a description of the MULTEXT-EAST lexicon encoding principles. A detailed presentation of their application to Romanian is given in (Tufiş et al., 1997). For morpho-syntactic disambiguation we use a tiered-tagging approach with combined language models (Tufiş, 1999) based on TnT - a trigram HMM tagger (Brants, 2000). For Romanian, this approach has been shown to provide an average accuracy of more than 98.5%. The tiered-tagging model relies on two different tagsets. The first one, which is best suited for statistical processing, is used internally while the latter (used in a 8 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

9

morpho-syntactic lexicon and in most cases more linguistically motivated) is used in the tagger’s output. The mapping between the two tagsets is in most cases deterministic (via a lexicon lookup) or, in the rare cases where it is not, a few regular expressions may solve the non-determinism. The idea of tiered tagging works not only for very finegrained tagsets, but also for very low-information tagsets, such as those containing only part of speech. In such cases the mapping from the hidden tagset to the coarse-grained tagset is strictly deterministic. In (Tufiş, 2000) we showed that using the coarse-grained tagset directly, (14 non-punctuation tags) gave a 93% average accuracy, while using a tiered tagging and combined language model approach (92 non-punctuation tags in the hidden tagset) the accuracy was never below 99.5%.

4 4.1

LEXICONS EXTRACTION

UNDERLYING ASSUMPTIONS

Extracting translation equivalents from parallel corpora is a very complex task that can easily turn into a computationally intractable enterprise. Fortunately, there are several assumptions one can consider in order to simplify the problem and lower the computational complexity of its solution. Yet, we have to mention that these empirical simplifications usually produce information loss and/or noisy results. Post-processing, as we will describe in section 5.3, may significantly improve both precision and recall by eliminating some wrong translation equivalence pairs and finding some good ones, previously undiscovered. The assumptions we used in our basic algorithm are the following: •

a lexical token in one half of the TU corresponds to at most one non-empty lexical unit in the other half of the TU; this is the 1:1 mapping assumption which underlies the work of many other researchers (e.g. Kay and Röscheisen, 1993; Melamed,

9 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

10

2001; Ahrenberg et al. 2000; Hiemstra, 1997; Brew and McKelvie, 1996). When this hypothesis does not hold, the result is a partial translation. However, remember that a lexical token could be a multiple-word expression tokenized as such by an adequate segmenter; non-translated tokens are not of interest here. •

a lexical token, if used several times in the same TU, is used with the same meaning; this assumption is explicitly used also by (Melamed, 2001) and implicitly by all the previously mentioned authors; the rationale for this assumption comes from the pragmatics of regular natural language communication: the reuse of a lexical token, in the same sentence and with a different meaning, generates extra cognitive load on the recipient and thus is usually avoided; exceptions from this communicative behavior, more often than not, represent either bad style or a game of words.

•

a lexical token in one part of a TU can be aligned to a lexical token in the other part of the TU only if the two tokens have the same part-of-speech; this is one very efficient way to cut off the combinatorial complexity and avoid dealing with irregular ways of cross-POS translations; as we will show in the section 4.4 this assumption can be nicely circumvented without too high a price in computational performance;

•

although word order is not an invariant of translation, it is not random either; when two or more candidate translation pairs are equally scored, the one containing tokens which are closer in relative position are preferred. This preference is also used in (Ahrenberg et al. 2000).

Based on sentence alignment, POS tagging and lemmatisation, the first step is to compute a list of translation equivalence candidates (TECL). By collecting all the tokens of the same POSk (in the order they appear in the text and removing duplicates) in each

10 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

11

part of TUj (the jth translation unit) one builds the ordered sets LSjPOSk and LTjPOSk. For each POSi let TUjPOSi be defined as LSjPOSi⊗LTjPOSi, with ‘⊗’ representing the Cartesian product operator. Then, CTUj (candidates in the jth TU) is defined as follows: CTUj=

no.of . pos

Υ

i =1

j TU POSi . With these notations and considering that there are n

n

translation units in the whole bitext, TECL is defined as: TECL = Υ CTU j . j =1

TECL contains a lot of noise and many TECs are very improbable so that a filtering is necessary. Any filtering would eliminate many wrong TECs but also some good ones. The ratio between the number of good TECs rejected and the number of wrong TECs rejected is the criterion we used in deciding which test to use and what would be the threshold score below which any TEC will be removed from TECL. After various empirical tests we decided to use the loglikelihood test (LL) with the threshold set to 9. 4.2

THE BASELINE ALGORITHM (BASE)

Our baseline is a simple iterative algorithm and has some similarities to the algorithm presented in (Ahrenberg et al. 2000) but unlike it, our algorithm avoids computing various probabilities (or, more precisely, probability estimates) and scores (t-score). Based on the TECL, an initial Sm* Tn contingency table (TBL0) is constructed for each POS (see Figure 2), with Sm the number of token types in the first part of the bitext and Tn the number of token types in the other part of the bitext.

11 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

TS1

TT1

…

TTn

n11

…

n1n

n1*

… TSm

12

… …

…

…

nm1 n*1

… …

nmn n*n

nm* n**

Figure 2. Contingency table with counts for TECs at step K

The rows of the table are indexed by the distinct source tokens and the columns are indexed by the distinct target tokens (of the same POS). Each cell (i,j) contains the number of occurrences in TECL of . All the pairs that at step k satisfy the equation below (EQ1) are recorded as TEPs and removed from the contingency table TBLk (the cells (i,j) are zeroed) thus obtaining a new contingency table TBLk+1.

{

}

(EQ1) TP k = < TSi TTj >∈ TBLk | ∀p, q (n ij ≥ n iq ) ∧ (n ij ≥ n pj )

Equation (EQ1) expresses the common intuition that in order to select as a translation equivalence pair, the number of associations of TSi with TTj must be higher than (or at least equal to) any other TTp (p≠j). The same holds the other way around. One of the main deficiencies of the BASE algorithm is that it is quite sensitive to what (Melamed, 2001) calls indirect associations. If has a high association score and TTj collocates with TTk, it might very well happen that also gets a high association score. Although, as observed by Melamed, the indirect associations generally have lower scores than the direct (correct) ones, they could receive higher scores than many correct pairs and this not only generates wrong translation equivalents, but also eliminates from further considerations several correct pairs. To weaken this sensitivity, we had to additionally impose an occurrence threshold for the selected pairs, so that the equation (EQ1) became:

12 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

{

13

}

(EQ2) TP k = < TSi TTj >∈ TBLk | ∀p, q (n ij ≥ n iq ) ∧ (n ij ≥ n pj ) ∧ ( n ij ≥ 3)

This modification significantly improved the precision (more than 98%) but seriously degraded the recall, more than 75% of correct pairs being missed. The BASE algorithm’s sensitivity to the indirect associations, and thus the necessity of an occurrence threshold, is explained by the fact that it looks at the association scores globally, not checking whether the tokens in a TEC are both in the same TU. 4.3

A BETTER EXTRACTION ALGORITHM (BETA)

To diminish the influence of indirect associations and thus remove the occurrence threshold, we modified the BASE algorithm so that the maximum score is not considered globally but within each of the TUs. This brings BETA closer to the competitive linking algorithm described in (Melamed, 2001). The competing pairs are only the TECs generated from the current TU out of which the pair with the best LLscore (computed, as before, from the entire corpus) is the first selected. Based on the 1:1 mapping hypothesis, any TEC containing the tokens in the winning pair is discarded. Then, the next best scored TEC in the current TU is selected and again the remaining pairs that include one of the two tokens in the selected pair are discarded. The multiplestep control in BASE, where each TU was scanned several times (once in each iteration), is not necessary anymore. The BETA algorithm will see each TU unit only once but the TU is processed until no further TEPs can be reliably extracted or the TU is emptied. This modification improves both the precision and the recall as compared to the BASE algorithm. When two or more TEC pairs of the same TU share the same token, and they are equally scored, the algorithm has to make a decision and choose only one of them, in accordance with the 1:1 mapping hypothesis. We used two heuristics: string similarity scoring and relative distance.

13 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

14

The similarity measure, COGN(TS, TT), is very similar to the XXDICE score described in (Brew and McKelvie, 1996). If TS is a string of k characters α1α2 . . . αk and TT is a string of m characters β1β2 . . . βm then we construct two new strings T’S and T’T by inserting special displacement characters into TS and TT where necessary. The displacement characters will cause both T’S and T’T to have the same length p (max (k, m)≤p 2 if q ≤ 2

Using the COGN test as a filtering device is a heuristic based on the cognate conjecture which says that when the two tokens of a translation pair are orthographically similar, they are very likely to have similar meanings (i.e. they are cognates). The threshold for the COGN(TS, TT) test was empirically set to 0.42. This value depends on the pair of languages in the bitext. The actual implementation of the COGN test includes a language-dependent normalisation step, which strips some suffixes, discards the diacritics, reduces some consonant doubling, etc. This normalisation step was hand written, but, based on available lists of cognates, it could be automatically induced. The second filtering condition, DIST(TS, TT) considers relative distance between the tokens in a pair and is defined as follows (where n and m are indexes of TS and TT in the considered TU): 14 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

15

if ((∈ LSjposk ⊗LTjposk)&(TS is the n-th element in LSjposk)&(TT is the m-th element in LTjposk)) then DIST(TS, TT)=|n-m| The COGN(TS, TT) test is a more reliable heuristic than DIST(TS, TT), so that the TEC with the highest similarity score is the preferred one. If the similarity score is irrelevant, the weaker filter DIST(TS, TT) gives priority to the pairs with the smallest relative distance between the constituent tokens. The main use, up to now, of the BETA algorithm was in the European project BALKANET (Stamou et al. 2002) aimed at building a EuroWordNet-like lexical ontology. We used this algorithm for automatic acquisition of bilingual RomanianEnglish resources and also for consistency checking of the interlingual projection of the consortium monolingual wordnets. The multilinguality of EuroWordNet and its BALKANET extension is ensured by linking monolingual synsets to interlingual records that correspond to the Princeton Wordnet synsets. If two or more monolingual wordnets are consistently projected over the interlingual index, then translation equivalents extracted from a parallel corpus should be (ideally) projected over the same interlingual record, or, (more realistically) onto interlingual records that correspond to closely related meanings (according to a given metric). For this particular use, POS identity of the translation equivalents was a definite requirement. However, in general, imposing POS identity on the translation equivalents is too restrictive for a series of multilingual applications. On the other hand, in the vast majority of cases, the crosslingual variation of the POS for translation equivalents is not arbitrary. This observation led us to the implementation of TREQ, an improved translation-equivalents extractor, more general than BASE.

15 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

4.4

16

A FURTHER ENHANCED EXTRACTION ALGORITHM (TREQ)

Besides algorithmic developments to be discussed in this section, TREQ has been equipped with a graphical user interface which integrates additional functionality for exploiting parallel corpora (editing the parallel corpora, generating word alignment maps, multi-word term extraction, building multi-lingual and multi-word terminological glossaries, etc.). In section 4.1 we described four simplifying assumptions used in the implementation of the translation-equivalents extraction procedures. The implementation of TREQ dispenses with two of them, namely the assumption that the translation equivalence preserves the POS and the assumption that repeated tokens in a sentence have the same meaning. 4.4.1 Meta-categories As noted before, when translation equivalents have different parts of speech this alternation is not arbitrary and it can be generalized. TREQ allows the user to define for each language pair the possible POS alternations. A set of grammar categories in one language that could be mapped by the translation-equivalence relation over one or more categories in the other language is called a meta-category. The user defines for each language the meta-categories and then specifies their interlingual correspondence. For instance, English participles and gerunds are often translated with Romanian nouns or adjectives and vice versa. So, for this pair of languages we defined, in both languages, the meta-category MC1 subsuming common nouns, adjectives and (impersonal) verbs, and stipulated that if the source lexical token belongs to the MC1, than its translation equivalent should belong to the same meta-category. Another example of a metacategory we found useful, MC2, subsumes the following pronominal adjectives: demonstrative, indefinite and negative. These types of adjectives are used differently in the two languages (e.g. a negative adjective allowed in Romanian by the negative 16 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

17

concord phenomenon has as equivalent an indefinite or even demonstrative adjective in English). For uniformity, any category not explicitly included in a user-defined metacategory is considered the single subsumed category of a meta-category automatically generated. The cross-lingual mapping of these meta-categories is equivalent to the POS identity. For instance, the abbreviations, which in our multilingual corpora are labeled with the tag X, are subsumed in this way by the MC30 meta-category. In order not to lose information from the tagged parallel corpora, TREQ adds the meta-category (actually a number) as a prefix to the actual tag of each token. The search space (TECL) is computed as described in section 4.1, the only modification being that instead of POS the meta-category prefix is used. Figure 3 shows the English and Romanian segments from Figure 2 with the meta-category prefix added to the token tags. There were no windows in it at all . Nu avea deloc ferestre .

Figure 3: Corpus encoding using meta-categories for the POS tagging

17 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

18

As the TECL becomes much larger with the introduction of meta-categories, the memory-based book-keeping mechanisms were optimized to release unnecessarily occupied memory and take advantage, in case of large parallel corpora, of virtual memory (disk resident). Besides accounting for real POS alternations in translation, the meta-category has the advantage that it overcomes some tagging errors which could also result in POS alternations. But probably the most important advantage of the meta-category mechanism is the possibility of working with very different tagsets. In (Tufiş et al. 2003) we describe a system (based on TREQ) participating in a shared task on RomanianEnglish word-alignment. The English parts of the training and evaluation data were tagged using the Penn TreeBank tagset while the Romanian parts were tagged using the MULTEXT-EAST tagset. Using meta-categories was a very convenient way of coping with the different encodings and granularities of the two tagsets. Finally, we should observe that the algorithm by no means requires that the metacategories with the same cross-lingual identifier subsume the same grammatical categories in the two languages; and also, that defining a meta-category that subsumes all the categories in the languages considered is equivalent to completely ignoring the POS information (thus tagging becomes unnecessary). 4.4.2 Repeated tokens The second simplifying hypothesis which was dropped in the TREQ implementation was to assume that the same token (with the same POS tag), used several times in the same sentence, has the same meaning. Based on this assumption, in the previous versions, only one occurrence of the token was preserved. As this hypothesis didn’t save significant computational resources we decided to keep all the repeated tokens. This modification slightly improved the precision of the algorithm allowing the extraction of translation pairs that appeared only in one translation unit, but several times. Also, when

18 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

19

the tokens repeated in one language were translated differently (by synonyms) in the other language, not purging the duplicates allowed extraction of translation pairs (synonymic) which otherwise were lost. 4.4.3 Other improvements We evaluated the cognate conjecture for Romanian-English pair of languages and found it to be correct in more than 98% of cases when the similarity threshold was 0.68. We also noted that many candidates, rejected either because of low loglikelihood score or because they occurred only once, were cognates. Therefore, we modified the algorithm to include also in the list of extracted translation equivalents all the candidates which, in spite of failing the loglikelihood test, have a cognate score above the 0.68 threshold. This change improved both precision and recall (see next section). 4.4.4 The Graphical User Interface The Graphical User Interface has been developed mainly for the purpose of validation and correction (in context) of the translation equivalents, a task committed to linguists without (too much) computer training. Besides the lexical translation equivalents extraction, the Graphical User Interface incorporates several other useful corpus management and mining utilities: a) selecting a corpus from a collection of several corpora; b) editing and correcting the tokenization, tagging or lemmatization; c) updating accordingly the extracted lexicons; d) extracting compound-term translations in one language based on an inventory of compound terms in the other language; e) extracting multi-word collocations (monolingually) for updating the segmenter’s resources for the languages concerned. Figure 4 exemplifies parameters setting for the extraction process: the parallel corpus, the language pairs, the statistic method used for independence-hypothesis testing, the

19 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

20

test threshold, the type of alignment (either by POS or by meta-categories), and, in the case of POS alignment, which grammatical categories are of interest for the extracted lexicon.

Figure 4: Parameters setting for a GUI-TREQ translation extraction session

Figure 5 displays the results of the extraction process. By displaying the running texts as pairs of aligned sentences in two languages, the Graphical User Interface facilitates evaluation in context of the extracted translation equivalents. If you point to a word in either language, its translation equivalent in the other language is displayed.

20 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

21

Figure 5: The “1984” corpus: Romanian-English translation equivalents extracted from the OZZ.113 translation unit

A detailed presentation of the facilities and operation procedures are given in the TREQ user manual (Mititelu, 2003).

5

EXPERIMENTS AND EVALUATION

We conducted translation equivalents extraction experiments on the three corpora mentioned before (“1984”, “VAT” and “NAACL2003”) and for various pairs of languages.

21 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

22

The bilingual lexicons extracted from the integral bitexts for English-Estonian, EnglishHungarian, English-Romanian and English-Slovene were evaluated by native speakers of the languages paired to English and having a good command of English. The evaluation protocol specified that all the translation pairs are to be judged in context, so that if one pair is found to be correct in at least one context, then it should be judged as correct. The evaluation was done for both the BASE and BETA algorithms but on different scales. The BASE algorithm was run on all the 6 integral bitexts with the English hub and 4 of out of the 6 bilingual lexicons were hand-validated. The lexicons contained all parts of speech defined in the MULTEXT-EAST lexicon specifications except for interjections, particles and residuals. The BETA and TREQ algorithms were run on the Romanian-English partial bitext extracted from the “1984” 7-language parallel corpus and we validated only the noun pairs. We also re-ran the BASE algorithm, for comparison reasons, on the Romanian-English partial bitext. The translation equivalents extracted from the “VAT” corpus by means of TREQ were not explicitly evaluated, but were used in a multilingual term-extraction experiment for the purposes of the FF-POIROT European project. The preliminary comparative evaluation conducted by native speakers of French and Dutch, with excellent command of English, showed that both precision (approx. 80%) and recall (approx. 75%) of our results are significantly better than those of other extractors used in the comparison. Since we don’t yet have the details of this evaluation, we will not go into further details. The bilingual lexicon extracted from the “NAACL2003” corpus by TREQ has been evaluated based on the test data used by the organisers of the HLT-NAACL2003 Shared Task on word-alignment. The test text has been manually aligned at word level. This valuable data and the program that computes precision, recall and F-measure of any alignment against a gold standard have been graciously made public after the closing of the shared task competition. From the word-aligned bitext used for evaluation we

22 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

23

removed the null alignments (words not translated in either part of the bitext) and purged the duplicate translation pairs, and thus obtained the gold standard RomanianEnglish lexicon. The evaluation considered all the words. The tables below give an overview of the corpora and the gold standard alignment text we used for the evaluation of the translation-equivalents extractors.

*

Language

BU

CZ

EN

ET

HU

RO

SI

No. of tokens*

72020

66909

87232

66058

68195

85569

76177

No. of word forms*

15093

17659

9192

16811

19250

14023

16402

No. of lemmas*

8225

8677

6871

8403

9729

6987

7157

Figure 6. The "1984" corpus overview the counts refer only to 1:1 aligned sentences and do not include interjections, particles and residuals

Language

EN

FR

NL

No. of occurrences

41722

45458

40594

No. of word forms*

3473

3961

3976

No. of lemmas*

2641

2755

3165

Figure 7. The "VAT" corpus overview

Language

EN

No. of tokens

RO

Language

EN

RO

866036 770653

No. of tokens

4940

4563

No. of word forms*

27598

48707

No. of word forms*

1517

1787

No. of lemmas*

19139

23134

No. of lemmas*

1289

1370

Figure 8. Overview of the "NAACL2003" corpus (left) and the word-aligned bitext (right)

5.1

THE EVALUATION OF THE BASE ALGORITHM

For validation purposes we limited the number of iteration steps to 4. The extracted lexicons contain adjectives (A), conjunctions (C), determiners (D), numerals (M), nouns

23 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

24

(N), pronouns (P), adverbs (R), prepositions (S) and verbs (V). Figure 9 shows the evaluation results provided by human evaluators2. The precision (Prec) was computed as the number of correct TEPs divided by the total number of extracted TEPs. The recall (considered for the non-English language in the bitext) was computed two ways: the first one, Rec*, took into account only the tokens processed by the algorithm (those that appeared at least three times). The second one, Rec, took into account all the tokens irrespective of their frequency counts. Rec* is defined as the number of source lemma types in the correct TEPs divided by the number of lemma types in the source language with at least 3 occurrences. Rec is defined as the number of source lemma types in the correct TEPs divided by the number of lemma types in the source language. The Fmeasure is defined as 2*Prec*Rec/(Prec+Rec) and we consider it to be the most informative score. The rationale for showing Rec* is to estimate the proportion of the missed tokens out of the considered ones. This might be of interest when precision is of the utmost importance. Bitext 4 Steps

ET-EN

HU-EN

RO-EN

SI-EN

Entries

1911

1935

2227

1646

Prec/Rec/ F-measure

96.18/18.79/ 31.16

96.89/19.27/ 32.14

98.38/25.21/ 40.13

98.66/22.69/ 36.89

Rec*

57.86

56.92

58.75

57.92

Figure 9. “1984” integral bitexts; partial evaluation of the BASE algorithm after 4 iteration steps with the occurrence threshold set to 3

The lexicons evaluation was fully performed for Estonian, Hungarian and Romanian and partially for Slovene (the first step was fully evaluated while the rest were evaluated from randomly selected pairs). As one can see in Figure 9, the precision is higher than 98% for Romanian and Slovene, almost 97% for Hungarian and more than 96% for Estonian. The Rec* measure ranges from 50.92% (Slovene) to 63.90% (Estonian). The 24 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

25

standard recall Rec varies between 19.27% and 32.46% (quite modest, since on average, the BASE algorithm did not consider 60% of the lemmas). Due to the low Rec value, the composite F-measure is also low (ranging between 31.16% and 41.13%) in spite of the very good precision. Our analysis showed that depending on the part of speech the extracted entries have different accuracy. The noun extraction had the second worst accuracy (the worst was the adverb), and therefore we considered that an in-depth evaluation of this case would be more informative than a global evaluation. Moreover, to facilitate the comparison between the BASE and BETA algorithms, we set no limit for the number of steps, lowered the occurrence threshold to 2 and extracted only the noun pairs from the partial Romanian-English bitext included into the “1984” 7language parallel corpus. The BASE program stopped after 10 steps with a number of 1673 extracted noun translation pairs, out of which 112 were wrong (see Figure 10). Compared with the 4 steps run the precision decreased to 93.30%, but both Rec (36.45%) and F-measure significantly increased showing that the occurrence threshold set to 2 leads to a better Precision/Recall compromise than 3. Noun types in text Entries Correct entries 3116

1673

Noun types in correct entries

Prec/Rec/F-measure

1136

93.30/36.45/52.42

1561

Figure 10. “1984” corpus; evaluation of the BASE algorithm with the noun lexicon extracted from the Romanian-English partial bitext; 10 iteration steps, the occurrence threshold set to 2

If the occurrence threshold is removed, because of the indirect association sensitivity the precision of BASE degrades too much for the lexicon to be really useful.

5.2

THE EVALUATION OF THE BETA ALGORITHM

The BETA algorithm preserves the simplicity of the BASE algorithm but significantly improves its global performance (F-measure) due to a much better recall (Rec) obtained at the expense of some loss in precision (Prec). Keeping the occurrence threshold set at 25 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

26

two (that is, ignoring hapax-legomena translation-equivalence candidates) the results of BETA evaluation on the same data are shown in Figure 11 below: Noun types in text Entries Correct entries 3116

2291

Noun types in correct entries

Prec/Rec/F-measure

1735

95.28/55.68/70.28

2183

Figure 11. “1984” corpus; partial evaluation of the BETA algorithm (noun lexicon extracted from the partial Romanian-English bitext), the occurrence threshold set to 2

Moreover, indirect association sensitivity is very much reduced so that removing the occurrence threshold shows even better global results: Noun types in text Entries Correct entries 3116

3128

Noun types in correct entries

Prec/Rec/F-measure

2114

80.43/67.84/73.60

2516

Figure 12. “1984” corpus; partial evaluation of the BETA algorithm (noun lexicon extracted from the partial Romanian-English bitext), no occurrence threshold

Besides the occurrence threshold, the BETA algorithm offers another way to trade off Prec for Rec: the COGN similarity score. In the experiments evaluated in Figure 12, the threshold was set to 0.42. We should mention that in spite of the general practice in computing recall for bilingual lexicon-extraction tasks (be it Rec*, or Rec), this is only an approximation of the real recall. The reason for this approximation is that in order to compute the real recall one should have a gold standard with all the words aligned by human evaluators. Usually such a gold standard bitext is not available and the recall is either approximated as above, or is evaluated on a small sample and the result is taken to be more or less true for the whole bitext.

26 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

5.3

27

THE EVALUATION OF THE TREQ ALGORITHM

To facilitate comparison with the BASE and BETA algorithms we ran TREQ on the same data and used the same evaluation procedure for the extracted noun-translation pairs. The results are shown in Figure 13 and (as expected) they are superior to those provided by BETA. Noun types in text Entries Correct entries 3116

3001

Noun types in correct entries

Prec/Rec/F-measure

2084

84.14/66.88/74.52

2525

Figure 13. “1984” corpus; evaluation of the TREQ algorithm (noun lexicon extracted from the Romanian-English partial bitext)

All the previous evaluations were based on an approximation of the recall measure, motivated by the lack of a gold standard lexicon. As mentioned before, for the purpose of the shared task on word alignment at NAACL2003 workshop, the organisers created a short hand-aligned Romanian-English bitext (248 sentences) which was made public after the competition. We used this word-alignment data to extract a Gold Standard Romanian-English Lexicon allowing a precise evaluation of the recall. The complete set of links in the word-aligned bitext contains 7149 links. Each token in either language is bi-directionally linked to a token representing its translation in the other language or to the empty string if it was not translated. Removing the empty links we were left with 6195 links representing pairs of translation equivalents: . Deleting the links for punctuation, purging the links corresponding to identical lexical pairs and eliminating the pairs not preserving the meta-category3 we obtained a Gold Standard Lexicon containing 1706 entries. Out of these entries 1547 are POS-preserving translation pairs, the rest being legitimate alternations. The Gold Standard Lexicon includes all the grammatical categories defined in the revised MULTEXT-EAST specifications for lexicon encoding (Erjavec, 2001). Figure 14 shows the exact evaluation of the TREQ performances.

27 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

No. of entries in

No. of entries

No. of correct

the Gold Standard

extracted

entries

1706

1308

1041

28

Prec/Rec/F-measure

79.58/61.01/69.06

Figure 14. “NAACL2003” word-aligned bitext; exact evaluation of the TREQ algorithm

The scores of the exact evaluation are significantly lower than expected, compared to the approximate evaluation procedure used before on the 1984 corpus. Given the scarcity of statistical evidence data (the NAACL evaluation bitext is almost 20 times smaller than the bitext extracted from the “1984” corpus) the performance decrease is not surprising. On the other hand, the exact calculation of the recall shows that considering only lemma types in one part of the bitext and of the lexicon (as the approximate recall calculation does) is slightly over-estimating the real recall by ignoring the multiple senses a lemma might have. If we compute the recall as we did before, it will show an increase of more than 2% (63.08%) and thus a better F-measure (70.67%). We mentioned at the beginning of the paper that by adding a post-processing phase to the basic translation-equivalence extraction procedure, one may further improve the accuracy and coverage of the extracted lexicons. In the next section we give an overview of such a post-processing phase, and show how the performance of the translationequivalence extraction was improved. 5.3.1 TREQ-AL and word-alignment maps In (Tufiş et al. 2003) we described our TREQ-AL system which participated in the Shared Task proposed by the organizers of the workshop on “Building and Using Parallel Texts: Data Driven Machine Translation and Beyond” at the HLT-NAACL 2003 conference (http://www.cs.unt.edu/~rada/wpt). TREQ-AL builds on TREQ and generates a word-alignment map for a parallel text (a bitext). The word alignment as it

28 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

29

was defined in the shared task is different and harder than the problem of translation equivalence as previously addressed. In a lexicon extraction task one translation pair is considered correct if there is at least one context in which it has been correctly observed. A multiply-occurring pair would count only once for the final lexicon. This is in sharp contrast with the alignment task where each occurrence of the same pair counts equally. The word-alignment task requires that each word (irrespective of its POS) and punctuation mark in both parts of the bitext be paired to a translation in the other part (or the null translation if the case). Such a pair is called a link. In a non-null link both elements of the link are non-empty words from the bi-text. If either the source word or the target word is not translated in the other language this is represented by a null link. Finally, the evaluations of the two tasks, even if both use the same measures as precision or recall, have to be differently judged. The null links in a lexicon extraction task have no significance, while in a word alignment task they play an important role (in the Romanian-English gold standard data the null links represent 13.35% of the total number of links). Being built on TREQ, any improvement in the precision and recall of the extracted lexicons will have a crucial impact on the precision and recall of the alignment links produces by TREQ-AL. This is also true the other way around: as described in (Tufiş et al., 2003), several wrong translation pairs extracted by TREQ are disregarded by TREQ-AL and moreover, many translation pairs unfound by TREQ are generated by the alignment of TREQ-AL. This is clearly shown by the scores in Figure 15 as compared to those in Figure 13. Noun types in text Entries Correct entries 3116

3724

Noun types in correct entries

Prec/Rec/F-measure

2450

87.62/75.08/80.87

3263

Figure 15. “1984” corpus; evaluation of the TREQ-AL algorithm (noun lexicon extracted from the Romanian-English partial bitext)

29 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

30

The first three columns in Figure 16 give the initial evaluation of TREQ-AL on the shared-task data. Non-null links only Null links included TREQ-AL lexicon Precision

81.38%

60.43%

84.42%

Recall

60.71%

62.80%

77.72%

F-measure

69.54%

61.59%

80.93%

Figure 16. Evaluation of TREQ-AL in the “NAACL2003” shared task on word alignment

The error analysis pinpointed some minor programming errors and we were able to fix them in a short period of time. We also decided to see how an external resource, namely a bilingual seed lexicon, would improve the performances of TREQ and TREQ-AL. We used our Romanian WordNet, under development, as a source for a seed bilingual lexicon. The Romanian WordNet contains 11,000 verb and noun synsets which are linked to the Princeton Wordnet. From one Romanian synset SRO, containing M literals, and its equivalent synset in English SEN, containing N literals, we generated M*N translation pairs, thus producing a bilingual seed lexicon containing about 40,000 entries. This lexicon contains some noise since not all M*N translation pairs obtained from two linked synsets are expected to be real translation-equivalence pairs4. In Figure 17 we give the new evaluation results (using the official programs and evaluation data) of the new versions of TREQ and TREQ-AL. Non-null links only Null links included TREQ-AL lexicon TREQ lexicon Precision

84.43%

65.58%

86.68%

79.58

Recall

64.34%

66.08%

81.96%

61.01

F-measure

73.03%

65.83%

84.26%

69.06

Figure 17. Re-evaluation of TREQ-AL and TREQ on the “NAACL2003” shared task without a seed lexicon

30 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

31

As shown in Figure 17, TREQ-AL dramatically improves the performance of TREQ: precision increased with more than 7% while the recall of TREQ-AL is more than 20% better when compared to the recall of TREQ. The evaluation of TREQ-AL when TREQ started with a seed lexicon showed no improvement of the final extracted dictionary. However, the results for the wordalignment shared-task improved (apparently the frequency of what was found versus what was lost made the difference, which is anyway not statistically significant). Non-null links only Null links included TREQ-AL lexicon Precision

84.72%

66.07%

86.56%

Recall

64.73%

66.43%

81.85%

F-measure

73.39%

66.25%

84.13%

Figure 18. Re-evaluation of TREQ-AL and TREQ on the “NAACL2003” shared task with an initial bilingual dictionary

Figures 19a and 19b show the performance of all participating teams on the RomanianEnglish word alignment shared task. There were two distinct evaluations: the NONNULL-alignments only considered the links that represented non-null translations while the NULL-alignments took into account both the non-null and the null translations. RACAI.RE.2 is the evaluation of TREQ-AL with an initial seed lexicon and RACAI.RE.1 is the evaluation of TREQ-AL without an initial seed lexicon. The systems were evaluated in terms of 3 figures of merit: Fs-measure, Fp-measure, and AER=Alignment Error Rate. Since the Romanian Gold Standard contains only sure alignments AER reduces to 1 - Fpmeasure. For all systems that assigned only sure alignments Fpmeasure = Fsmeasure (see Mihalcea & Pedersen (2003) for further details).

31 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

32

XR C E.

N

R

R

AC

AI .R

AC E.2 A ol em I.R E. -5 6K 1 .R Pr E. o XR ali gn 2 C .R E. E. XR Tril 1 e x. C E. R E. T 3 XR rile x. C R E. Ba E.4 se .R R E. al 1 ig n. R E. Bi 1 Br .R E. Bi 1 Br .R E. Bi 3 Br .R E. U 2 M D .R E. U 2 M D Fo .R E. ur 1 da y. R E. 1

80,00% 70,00% 60,00% 50,00% 40,00% 30,00% 20,00% 10,00% 0,00%

F-measure Sure

F-measure Probable

AER

Figure 19a. NAACL2003 Shared Task: ranked results of Romanian-English non-NULL alignments

AC

XR

C

R

R

AC

AI

.R

E.

2

AI .R E. E. Tr 1 ile x Pr .R o XR ali E.3 gn C .R E. E. XR Tril 1 e XR x. C R E C E. .B E. as 4 N ol e .R em E -5 6K .1 .R R E. al 2 ig n. R Bi E.1 Br .R Bi E.1 Br .R Bi E.3 Br .R E. U 2 M D .R E. U 2 M D Fo .R E. ur 1 da y. R E. 1

70,00% 60,00% 50,00% 40,00% 30,00% 20,00% 10,00% 0,00%

F-measure Sure

F-measure Probable

AER

Figure 19b. NAACL2003 Shared Task: ranked results of Romanian-English NULL alignments

32 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

6

33

IMPLEMENTATION, CONCLUSIONS AND FURTHER WORK

The extraction programs, BASE, BETA and TREQ, as well as TREQ-AL, run on both Windows and Unix machines5. Throughput is very fast: on a Pentium 4 (1.7 GHz) with 512 MB of RAM, extracting the noun bilingual lexicon from “1984” took 109 seconds (72 s. for TREQ plus 37 s. for TREQ-AL) while the full dictionary was generated in 285 seconds (204 s. for TREQ plus 81 s for TREQ-AL). These figures are comparable to those reported in (Tufiş and Barbu, 2002) for BETA although the machine on which those evaluations were conducted was a less powerful Pentium II (233 MHz) processor accessing 96MB of RAM. An approach quite similar to our BASE algorithm (also implemented in Perl) is presented in (Ahrenberg et al, 2000). They used a frequency threshold of 3 and the best results reported are 92.5% precision and 54.6% partial recall (what we called Rec*). The BETA and TREQ algorithms exploit the idea of competitive linking underling Melamed’s extractor (Melamed, 2001), although our program never returns to a visited translation unit. Melamed’s evaluation is made in terms of accuracy and coverage, where accuracy is more or less our precision and coverage is defined as percentage of tokens in the corpus for which a translation has been found. With the best 90% coverage, the accuracy of his lexicon was 92.8±1.1%. Coverage is a much weaker evaluation function than recall, especially for large corpora, since it favours frequent tokens to the detriment of hapax legomena. Melamed (2001) showed that the 4.5% most frequent translation pair types in the Hansard parallel corpus cover more than 61% of the tokens in a random sample of 800 sentences. Moreover, the approximation used by Melamed in computing coverage over-estimates, since it does not consider whether the translations found for the words in the corpus are correct or not. Based on the Gold Standard Lexicon, we could compute exact precision, recall, coverage and also the approximated coverage (Melamed’s way). As Figure 20 shows, in spite of a very small

33 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

34

text, there are significant differences between exact coverage and the approximated coverage. The differences are much more significant in case of a larger text. Exact Coverage

Estimated Coverage

Romanian

91.91%

96.92%

English

91.98%

97.21%

Precision 86.56%

Figure 20: Exact and Estimated Coverage for the lexicon extracted by TREQ-AL from the NACL2003 Gold Standard Alignment

We ran TREQ-AL (without the seed lexicon mentioned before) on the entire NAACL2003 corpus, extracting a 48287-entry lexicon. Following Melamed’s (2001) procedure, we took five random samples (with replacement) of 100 entries and validated them by hand. The average resulting precision was 91.67% with an estimated coverage of 95.21% for Romanian and 96.56% for English. However, as demonstrated in Figure 20, without a gold standard, such estimated evaluations should be regarded cautiously. All algorithms we presented are based on a 1:1 mapping hypothesis. We argued that in case a language-specific tokenizer is responsible for pre-processing the input to the extractor, the 1:1 mapping approach is not an important limitation anymore. Incompleteness of the segmenter’s resources may be accounted for by using a postprocessing phase for recovering the partial translations. In (Tufiş, 2001) such a recovering phase is presented that takes advantage of the already extracted entries. Additional means, such as collocation extraction based on n-gram statistics and partial grammar filtering (as included in the GUI-TREQ), are effective ways of continuously improving the segmenter’s resources and decrease to a large extent the restrictions imposed by the 1:1 mapping hypothesis. Finally, we should notice that although TREQ is quite mature, TREQ-AL is under further development and we have confidence that there is ample room for future performance improvements.

34 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

35

Acknowledgements The research on translation equivalents started as an AUPELF/UREF co-operation project with LIMSI/CNRS (CADBFR) and used multilingual corpus and multilingual lexical resources developed within MULTEXT-EAST, TELRI and CONCEDE EU projects. The continuous improvements of the methods and tools described in this paper were motivated and supported by two European projects we are currently involved in: FF-POIROT (IST-2001-38248) and BALKANET (IST-2000-29388). We are gratefull to the editor of this issue and to an annonymous reviewer who did a great job in improving the content and the readability of this paper. All the remaining problems are entirely ours.

References Ahrenberg, L., Andersson, M., Merkel, M. (2000). "A knowledge-lite approach to word alignment", in Véronis, J. (ed). Parallel Text Processing. Text, Speech and Language Technology Series, Kluwer Academic Publishers, Vol. 13, pp. 97-116. Brants, T.(2000) “TnT – A Statistical Part-of-Speech Tagger”, in Proceedings ANLP-2000, April 29 – May 3, Seattle, WA. Brew, C., McKelvie, D. (1996) “Word-pair extraction for lexicography.” Available at http:///tokww.ltg.ed.ac.uk/ ~chrisbr/papers/nemplap96. Brown, P., Pietra, Della Pietra, S. A., Della Pietra, V. J. and Mercer, R. L. (1993), "The mathematics of statistical machine translation: parameter estimation" in Computational Linguistics, 19/2, pp. 263-311. Dunning, T. (1993), “Accurate Methods for the Statistics of Surprise and Coincidence” in Computational Linguistics, 19/1, pp. 61-74.

35 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

36

Dimitrova, L, Erjavec, T., Ide, N., Kaalep, H., Petkevic, V. and Tufiş, D. (1998) "Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and East European Languages" in Proceedings ACL-COLING’1998, Montreal, Canada, pp. 315-319. Gale, W.A. and Church, K.W. (1991). "Identifying word correspondences in parallel texts". In Fourth DARPA Workshop on Speech and Natural Language, pp. 152-157. Gale, W.A.

and Church, K.W. (1993). “A Program for Aligning Sentences in Bilingual

Corpora”. In Computational Linguistics, 19/1, pp. 75-102. Erjavec, T. (ed.) (2001). “Specifications and Notations for MULTEXT-East Lexicon Encoding”. Edition

Multext-East/Concede

Edition,

March,

21,

210

pages,

available

at

[http://nl.ijs.si/ME/V2/msd/html/]. Erjavec, T., Lawson A., Romary L. (1998). East Meet West: A Compendium of Multilingual Resources.TELRI-MULTEXT EAST CD-ROM, ISBN: 3-922641-46-6. Erjavec T., Ide, N. (1998) “The Multext-East corpus”. In Proceedings LREC’1998, Granada, Spain, pp. 971-974. Hiemstra, D. (1997). "Deriving a bilingual lexicon for cross language information retrieval". In Proceedings of Gronics, pp. 21-26. Ide, N., Veronis, J. (1995). “Corpus Encoding Standard”, MULTEXT/EAGLES Report. Available at http//tokww.lpl.univ-aix.fr/projects/multext/CES/CES1.html. Kay, M., Röscheisen, M. (1993). “Text-Translation Alignment”. In Computational Linguistics, 19/1, pp. 121:142. Kupiec, J. (1993). "An algorithm for finding noun phrase correspondences in bilingual corpora". In Proceedings of the 31st Annual Meeting of the Association of Computational Linguistics, pp. 17:22. Melamed, D. (2001). Empirical Methods for Exploiting Parallel Texts. The MIT Press, Cambridge Massachusetts, London England, 195 pages.

36 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

37

Mihalcea R., Pedersen T.(2003) “An Evaluation Exercisefor Word Alignment”. Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond, May-June, Edmonton, Canada, pp. 1-10.

Mititelu, C. (2003) –TREQ User Manual, Technical Report, RACAI, May, 25 pages. Smadja, F., McKeown, K.R. and Hatzivassiloglou, V. (1996). "Translating collocations for bilingual lexicons: A statistical approach". Computational Linguistics, 22/1, pp. 1:38. Stamou, S., Oflazer K., Pala K., Christoudoulakis D., Cristea D., Tufis D., Koeva S., Totkov G., Dutoit D., Grigoriadou M. (2002) “BALKANET A Multilingual Semantic Network for the Balkan Languages”, in Proceedings of the International Wordnet Conference, Mysore, India, 21-25 January. Tufiş, D., Barbu, A.M., Pătraşcu, V., Rotariu, G., Popescu, C.(1997) ”Corpora and CorpusBased Morpho-Lexical Processing “ in D. Tufiş, P. Andersen (eds.) “Recent Advances in Romanian Language Technology”, Editura Academiei, pp. 35-56. Tufiş, D., Ide, N. Erjavec, T. (1998). “Standardized Specifications, Development and Assessment of Large Morpho-Lexical Resources for Six Central and Eastern European Languages”. Proceedings LREC’1998, Granada, Spain, pp. 233-240. Tufiş, D. (2000) “Using a Large Set of Eagles-compliant Morpho-Syntactic Descriptors as a Tagset for Probabilistic Tagging. Proceedings LREC’2000, Athens, pp. 1105-1112. Tufiş, D. (2001). “Partial translations recovery in a 1:1 word alignment approach”, RACAI Technical report, 2001(in Romanian), 18 pages. Tufiş, D. (2002) ”A cheap and fast way to build useful translation lexicons” in Proceedings of the 19th International Conference on Computational Linguistics, COLING2002, Taipei, 25-30 August, pp. 1030-1036.

Tufiş, D. Barbu, A.M. (2002): „Revealing translators knowledge: statistical methods in constructing practical translation lexicons for language and speech processing”, in International Journal of Speech Technology. Kluwer Academic Publs, no.5, pp. 199-209.

37 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

38

Tufiş, D., Barbu, A.M., Ion, R. (2003) “TREQ-AL: A word alignment system with limited language resources”, Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond, May-June, Edmonton, Canada, pp. 36-39.

38 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

TUFIŞ, BARBU, ION: EXTRACTING MULTILINGUAL LEXICONS

39

NOTES 1

MtSeg has tokenization resources for many Western European languages, further enhanced in

the MULTEXT-EAST project (Erjavec and Ide, 1998; Dimitrova et al., 1998; Tufiş et al., 1998) with corresponding resources for Bulgarian, Czech, Estonian, Hungarian, Romanian and Slovene. 2

The lexicons were evaluated by Heiki Kaalep of the Tartu University (ET-EN), Tamas Várády

of the Linguistic Institute of the Hungarian Academy (HU-EN), Ana Maria-Barbu of RACAI (RO-EN) and Tomaž Erjavec of the IJS Lubljana. All of them are gratefully acknowledged. 3

This was necessary because the way Gold Standard Alignment dealt with compounds: an

expression in Romanian having N words, aligned to its equivalent expression in English containing M words, was represented by N*M word links. We considered in this case only one lexicon entry instead of N*M. 4

The existing errors in our synsets definition might be the simplest explanation.

5

The programs are written in Perl and we tested them on Unix, Linux and Windows. The

graphical user interface of TREQ combines technologies like DHTML, XML, and XSL with the languages HTML, JavaScript, Perl, and PerlScript.

39 Computers and the Humanities Volume 38, Issue 2, May 2004, pp. 163-198 © 2004. Kluwer Academic Publishers. Printed in the Netherlands.

Lihat lebih banyak...

Extracting Multilingual Lexicons from Parallel Corpora

Descripción

Comentarios