Disambiguating discourse connectives using parallel corpora: senses vs. translations
Descripción
Disambiguating discourse connectives using parallel corpora: senses vs. translations
Thomas Meyer, Andrei Popescu-Belis Idiap Research Institute Bruno Cartoni, Sandrine Zufferey University of Geneva Charlotte Roze, Laurence Danlos Alpage Group, INRIA and University of Paris VII
Discourse and Corpus Linguistics 2011, Birmingham
Research questions Does the disambiguation of rhetorical relations signaled by discourse connectives help machine translation?
discourse connectives in source and target language textual coherence depends on rhetorical relations
How can discourse connectives be annotated?
sense annotation translation spotting using parallel corpora and contrastive analysis
How to build annotated resources?
2
Multilingual database of discourse connectives and its usage in NLP
Variability of connectives in translation Syntactical constructs in source and target language
EN: So what we want the European Patent Offce to do is something on the behalf of the European Commission while the Offce itself is not a Community institution.
FR: Aussi ce que nous souhaitons, c'est que l'Offce européen des brevets agisse au nom de la Commission européenne tout en n'étant pas une institution communautaire.
Ambiguity in the source language EN: I have been having fun since this conference started. FR: J'ai eu plaisier depuis cette conférence a commoncé. FR-MT: * J'ai eu plaisir car cette conférence a commencé.
Omission of the connective in the target language EN: Max fell because Peter pushed him. FR: Max tombait,[_] Peter l'a poussé.
3
Manual annotation of senses The Penn Discourse Treebank (PDTB) (Prasad et al., 2008)
100 types of explicit and implicit connectives annotated in the WSJ corpus with multiple senses and their arguments from a hierarchy of 129 possible sense combinations
Example: [ In addition, its machines are typically easier to operate Arg1], [ so [Contingency:Cause:result] customers require less assistance from software Arg2].
The LexConn database (Roze et al., 2010)
Database of 330 French connectives with 30 relations (~ RST) and examples
Example: Tandis que [Background] tu restais silencieux et contemplatif, ton frère et ta sœur se parlaient entre eux.
4
Manual sense annotation in Europarl
Annotation of the French connective alors que
423 occurrences Labels: Background, Contrast 2 annotators: agreement (kappa) = 0.43
Ambiguous example:
FR: La monnaie unique va entrer en vigueur au milieu de la tourmente fnancière, alors que de nombreux compléments...n'ont pas été apportés. ( EN: The single currency is going to come into force in the mids of fnancial turmoil, while a great many additional factors...have not been taken into consideration. )
Annotation of the English connective since
5
727 occurrences Labels: Temporal, Causal,Temporal/Causal 3 annotators, agreement (kappa) = 0.77
Translation spotting First step: manual annotation of the translation of discourse connectives in bi-phrases
two annotators highlight the translation (or ε or reformulations ) for all occurrences of a certain connective Bien que
While we have a duty to tackle this problem within EU waters, ultimately this is a problem which requires international action.
Bien que nous ayons le devoir de traiter ce problème au niveau des eaux de l'UE, il s'agit en dernier ressort d'un problème qui exige des actions au niveau international.
No wonder Richard Holbrooke recently boasted that Europe slept while President Clinton resolved a particular European crisis.
Il n'y a dès lors rien d'étonnant à ce que Pendant que M. Richard Holbrooke nous ait récemment nargués en disant que l'Europe dormait pendant que le président Clinton résolvait une crise européenne particulière.
6
Translation Spotting (2) Experiments on the Europarl Corpus (Koehn, 2005) Extraction of directional corpora (Cartoni and Meyer, 2011)
7
while (489), translation EN-FR
Clusters
tout en V-gerund (22 %), tant que (22 %), tandis que (11 %)
56 %
tandis que (56 %), alors que (40 %)
30 %
même si (100 %)
14 %
although (347), translation EN-DE
Clusters
obwohl (74 %), zwar (9 %), auch wenn (9 %)
76.7 %
obgleich (43 %), obwohl (29 %)
23.3 %
Translation spotting (3)
Second step: manual clustering of senses Finding a theory-independent minimal set of semantic labels necessary for a correct automated translation Example for while
French Translation
Frequency
Clustered Sense
alors que
18 %
Contrast/Temporal
si / même si / bien que / s'il est vrai que
25 %
Concession
tandis que / mais
9%
Contrast
tant que
2%
Temporal/Causal
pendant
1%
Temporal/Duration
puisque
1%
Temporal/Causal
lorsque
0.8 %
Temporal/Punctual
8
List of translation-spotted connectives (EN -> FR) Connectives
Senses
while
contrast/temporal, concession, contrast, temporal (dur, punct, causal)
although
contrast, concession
though even though
contrast, concession contrast, concession
Number of Created annotated ressources sentences (after correction) 499 197
294 bi-sentences 183 bi-sentences
200 212
155 bi-sentences 191 bi-sentences
since
causal1 + causal2 + causal 3 / temporal
423
423 bi-sentences (C, T, T/C) 414 bi-sentences (C, C1, C2, T)
as
preposition or connective (and then, causal, concession, comparison, temporal)
600
600 bi-sentences
1846 bisentences
9
List of translation-spotted connectives (FR -> EN)
Number of annotated Created ressources (after correction) sentences
Connectives
Senses
dans la mesure où
causal / explanation
175
150 bi-sentences
alors que
contrast, temporal, temp/contrast
423
366 bi-sentences (T,C, T/C)
bien que pourtant
concession/contrast contrast/concession
55 312
51 bi-sentences 250 bi-sentences
817 bi-sentences
10
Semi-automated translation spotting
see Danlos and Roze, 2011 use TransSearch (Huet et al., 2009) on bi-sentences from the Hansard Corpus (Roukos et al., 1995) manually correct its mistakes biggest problem: translations like ε → connective and connective → ε Example for the French connective en effet:
Correct automated transpots: FR: 68%, EN: 57%
Translation of/into “en effet'
Original French
Original English
indeed
713 (33%)
532 (26.9%)
574 (26.6%)
570 (28.8%)
669 (31%)
410 (20.7%)
ε in fact 11
Multilingual tables based on Europarl data EN connective
Sense
DE translation
Frequency
whereas
contrast
während
60%
1
whereas
contrast
[[]]
20%
2
FR connective Sense
EN translation
Frequency
tandis que
contrast
whereas
44%
1
tandis que
contrast
while
33%
2
tandis que
temporal
while
83%
1
12
Rank
Rank
EN connective
Sense
DE translation
Frequency
Rank
while
temporal
wenn
22.22%
1
while
temporal
[[]]
22.22%
1
Building a multilingual database of connectives
13
Using the multilingual database in NLP
Automated disambiguation of connectives
also of use in discourse parsing:
look-up list rhetorical relations signaled features from other languages
Machine learning of translationese (see Ilisei et al., 2010; Baroni and Bernardini, 2005)
Integration into SMT processes 14
modification of phrase tables training and testing with (automatically) annotated corpora
Conclusions
Discourse connectives are relevant in translation
The direction of translation should always be taken into account when using parallel corpora (Cartoni and Meyer, 2011)
Variation and explicitation of connectives in translation (Cartoni et al., 2011)
Contrastive analyses reveal interesting facts
Translation spotting is a more reliable approach than sense annotation (in the MT context)
The multilingual database of connectives is linguistically descriptive, useful for second language learning and can be integrated in several NLP tasks 15
References
Baroni M., Bernardini S. (2005). A new approach to the study of translationese: Machine-learning the difference between original and translated text. Literary and Linguistic Computing 21(3).
Cartoni, B., Meyer, T. (2011). Building "directional corpora" for unbiased contrastive analysis. Proceedings of Discourse and Corpus Linguistics 2011, Birmingham, UK.
Cartoni, B., Zufferey, S., Meyer, T., Popescu-Belis, A. (2011). How Comparable are Parallel Corpora? Measuring the Distribution of General Vocabulary and Connectives. Proceedings of 4th Workshop on Building and Using Comparable Corpora, Portland, OR.
Danlos, L., Roze, C. (2011). Traduction (automatique) des connecteurs de discours. In Actes de la 18è Conférence sur le Traitement Automatique des Langues Naturelles (TALN), Montpellier, FR.
Huet S., Julien, B., Langlais, P. (2009). Intégration de l’alignement de mots dans le concordancier bilingue TransSearch . In Actes de la 16è Conférence sur le Traitement Automatique des Langues Naturelles (TALN), Senlis, France.
Ilisei, I., Inkpen D., Corpas Pastor G., Mitkov, R. (2010). Identifcation of Translationese: A Machine Learning Approach. In Gelbukh, A. (Ed), Computational Linguistics and Intelligent Text Processing Lecture Notes in Computer Science. Springer, Berlin / Heidelberg.
Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., Webber, B. (2008). The Penn Discourse Treebank 2.0. Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC).
Roze, C., Danlos, L., Muller, P. (2010). LEXCONN: a French Lexicon of Discourse Connectives. Proceedings of Multidisciplinary Approaches to Discourse (MAD).
Roukos, S., Graff, D., Melamed, D. (1995): Hansard French/English. Linguistic Data Consortium, Philadelphia.
16
Lihat lebih banyak...
Comentarios