Disambiguating discourse connectives using parallel corpora: senses vs. translations

July 6, 2017 | Autor: Andrei Popescu-belis | Categoría: Corpus
Share Embed


Descripción

Disambiguating discourse connectives using parallel corpora: senses vs. translations

Thomas Meyer, Andrei Popescu-Belis Idiap Research Institute Bruno Cartoni, Sandrine Zufferey University of Geneva Charlotte Roze, Laurence Danlos Alpage Group, INRIA and University of Paris VII

Discourse and Corpus Linguistics 2011, Birmingham

Research questions Does the disambiguation of rhetorical relations signaled by discourse connectives help machine translation?



 

discourse connectives in source and target language textual coherence depends on rhetorical relations

How can discourse connectives be annotated?



 

sense annotation translation spotting  using parallel corpora and contrastive analysis

How to build annotated resources?





2

Multilingual database of discourse connectives and its usage in NLP

Variability of connectives in translation Syntactical constructs in source and target language





EN: So what we want the European Patent Offce to do is something on the behalf of the European Commission while the Offce itself is not a Community institution.



FR: Aussi ce que nous souhaitons, c'est que l'Offce européen des brevets agisse au nom de la Commission européenne tout en n'étant pas une institution communautaire.



Ambiguity in the source language  EN: I have been having fun since this conference started.  FR: J'ai eu plaisier depuis cette conférence a commoncé.  FR-MT: * J'ai eu plaisir car cette conférence a commencé.



Omission of the connective in the target language  EN: Max fell because Peter pushed him.  FR: Max tombait,[_] Peter l'a poussé.

3

Manual annotation of senses The Penn Discourse Treebank (PDTB) (Prasad et al., 2008)





100 types of explicit and implicit connectives annotated in the WSJ corpus with multiple senses and their arguments from a hierarchy of 129 possible sense combinations



Example: [ In addition, its machines are typically easier to operate Arg1], [ so [Contingency:Cause:result] customers require less assistance from software Arg2].

The LexConn database (Roze et al., 2010)





Database of 330 French connectives with 30 relations (~ RST) and examples



Example: Tandis que [Background] tu restais silencieux et contemplatif, ton frère et ta sœur se parlaient entre eux.

4

Manual sense annotation in Europarl 

Annotation of the French connective alors que



423 occurrences Labels: Background, Contrast 2 annotators: agreement (kappa) = 0.43



Ambiguous example:

 

FR: La monnaie unique va entrer en vigueur au milieu de la tourmente fnancière, alors que de nombreux compléments...n'ont pas été apportés. ( EN: The single currency is going to come into force in the mids of fnancial turmoil, while a great many additional factors...have not been taken into consideration. ) 

Annotation of the English connective since   

5

727 occurrences Labels: Temporal, Causal,Temporal/Causal 3 annotators, agreement (kappa) = 0.77

Translation spotting First step: manual annotation of the translation of discourse connectives in bi-phrases





two annotators highlight the translation (or ε or reformulations ) for all occurrences of a certain connective Bien que

While we have a duty to tackle this problem within EU waters, ultimately this is a problem which requires international action.

Bien que nous ayons le devoir de traiter ce problème au niveau des eaux de l'UE, il s'agit en dernier ressort d'un problème qui exige des actions au niveau international.

No wonder Richard Holbrooke recently boasted that Europe slept while President Clinton resolved a particular European crisis.

Il n'y a dès lors rien d'étonnant à ce que Pendant que M. Richard Holbrooke nous ait récemment nargués en disant que l'Europe dormait pendant que le président Clinton résolvait une crise européenne particulière.

6

Translation Spotting (2) Experiments on the Europarl Corpus (Koehn, 2005) Extraction of directional corpora (Cartoni and Meyer, 2011)

 

7

while (489), translation EN-FR

Clusters

tout en V-gerund (22 %), tant que (22 %), tandis que (11 %)

56 %

tandis que (56 %), alors que (40 %)

30 %

même si (100 %)

14 %

although (347), translation EN-DE

Clusters

obwohl (74 %), zwar (9 %), auch wenn (9 %)

76.7 %

obgleich (43 %), obwohl (29 %)

23.3 %

Translation spotting (3)   

Second step: manual clustering of senses Finding a theory-independent minimal set of semantic labels necessary for a correct automated translation Example for while

French Translation

Frequency

Clustered Sense

alors que

18 %

Contrast/Temporal

si / même si / bien que / s'il est vrai que

25 %

Concession

tandis que / mais

9%

Contrast

tant que

2%

Temporal/Causal

pendant

1%

Temporal/Duration

puisque

1%

Temporal/Causal

lorsque

0.8 %

Temporal/Punctual

8

List of translation-spotted connectives (EN -> FR) Connectives

Senses

while

contrast/temporal, concession, contrast, temporal (dur, punct, causal)

although  

contrast, concession

though even though

contrast, concession contrast, concession

Number of Created annotated ressources sentences (after correction)  499  197

294 bi-sentences 183 bi-sentences

 200  212

155 bi-sentences 191 bi-sentences

since

causal1 + causal2 + causal 3 / temporal

 423

423 bi-sentences (C, T, T/C) 414 bi-sentences (C, C1, C2, T)

 as

preposition or connective (and then, causal, concession, comparison, temporal)

 600

600 bi-sentences

1846 bisentences

9

List of translation-spotted connectives (FR -> EN)

Number of annotated Created ressources (after correction) sentences

Connectives

Senses

dans la mesure où

 causal / explanation

175

150 bi-sentences

alors que

contrast, temporal, temp/contrast

423

366 bi-sentences (T,C, T/C)

bien que pourtant

concession/contrast contrast/concession

55 312

51 bi-sentences 250 bi-sentences

817 bi-sentences

10

Semi-automated translation spotting     

see Danlos and Roze, 2011 use TransSearch (Huet et al., 2009) on bi-sentences from the Hansard Corpus (Roukos et al., 1995) manually correct its mistakes biggest problem: translations like ε → connective and connective → ε Example for the French connective en effet: 

Correct automated transpots: FR: 68%, EN: 57%

Translation of/into “en effet'

Original French

Original English

indeed

713 (33%)

532 (26.9%)

574 (26.6%)

570 (28.8%)

669 (31%)

410 (20.7%)

ε in fact 11

Multilingual tables based on Europarl data EN connective

Sense

DE translation

Frequency

whereas

contrast

während

60%

1

whereas

contrast

[[]]

20%

2

FR connective Sense

EN translation

Frequency

tandis que

contrast

whereas

44%

1

tandis que

contrast

while

33%

2

tandis que

temporal

while

83%

1

12

Rank

Rank

EN connective

Sense

DE translation

Frequency

Rank

while

temporal

wenn

22.22%

1

while

temporal

[[]]

22.22%

1

Building a multilingual database of connectives

13

Using the multilingual database in NLP 

Automated disambiguation of connectives 

also of use in discourse parsing:   

look-up list rhetorical relations signaled features from other languages



Machine learning of translationese (see Ilisei et al., 2010; Baroni and Bernardini, 2005)



Integration into SMT processes   14

modification of phrase tables training and testing with (automatically) annotated corpora

Conclusions 

Discourse connectives are relevant in translation 

The direction of translation should always be taken into account when using parallel corpora (Cartoni and Meyer, 2011)



Variation and explicitation of connectives in translation (Cartoni et al., 2011)



Contrastive analyses reveal interesting facts



Translation spotting is a more reliable approach than sense annotation (in the MT context)



The multilingual database of connectives is linguistically descriptive, useful for second language learning and can be integrated in several NLP tasks 15

References 

Baroni M., Bernardini S. (2005). A new approach to the study of translationese: Machine-learning the difference between original and translated text. Literary and Linguistic Computing 21(3).



Cartoni, B., Meyer, T. (2011). Building "directional corpora" for unbiased contrastive analysis. Proceedings of Discourse and Corpus Linguistics 2011, Birmingham, UK.



Cartoni, B., Zufferey, S., Meyer, T., Popescu-Belis, A. (2011). How Comparable are Parallel Corpora? Measuring the Distribution of General Vocabulary and Connectives. Proceedings of 4th Workshop on Building and Using Comparable Corpora, Portland, OR.



Danlos, L., Roze, C. (2011). Traduction (automatique) des connecteurs de discours. In Actes de la 18è Conférence sur le Traitement Automatique des Langues Naturelles (TALN), Montpellier, FR.



Huet S., Julien, B., Langlais, P. (2009). Intégration de l’alignement de mots dans le concordancier bilingue TransSearch . In Actes de la 16è Conférence sur le Traitement Automatique des Langues Naturelles (TALN), Senlis, France.



Ilisei, I., Inkpen D., Corpas Pastor G., Mitkov, R. (2010). Identifcation of Translationese: A Machine Learning Approach. In Gelbukh, A. (Ed), Computational Linguistics and Intelligent Text Processing Lecture Notes in Computer Science. Springer, Berlin / Heidelberg.



Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., Webber, B. (2008). The Penn Discourse Treebank 2.0. Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC).



Roze, C., Danlos, L., Muller, P. (2010). LEXCONN: a French Lexicon of Discourse Connectives. Proceedings of Multidisciplinary Approaches to Discourse (MAD).



Roukos, S., Graff, D., Melamed, D. (1995): Hansard French/English. Linguistic Data Consortium, Philadelphia.

16

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.