Why do computer scientists fail to produce an accurate Arabic lexical resource? - لماذا يفشل المهندسون الحاسوبيّون في إنتاج موارد لغوية دقيقة ؟

Share Embed


Descripción

Why do computer scientists fail to produce an accurate Arabic lexical resource ? Alexis Amid Neme and Eric Laporte ‫لماذا يفشل المهندسىن الحاسىبيىن في إنتاج‬ ‫مىارد لغىية عربية دقيقة ؟‬

In an automatic morphological analysis of English texts, a program provides the singular form related to each plural form. What about Arabic? It is not often the case for broken plurals. For example, if the word knives occurs in an English text, a morphological program automatically gives the analysis of knives = {singular=”knife”, number=”plural”}; similarly if OasoliHap (weapons) occurs in Arabic 1 , programs should give OasoliHap = ِ {singular=”silaAoH”, number=”broken_plural”} [‫أسلِ َحة‬ ْ ،‫]سالح‬. Morphological analysis is usually the first step in a Natural Language Processing (NLP) pipeline. In every 10 plural forms in Arabic texts, there are approximately 3 regular and 7 broken plural forms. Then, there is no way to avoid or even circumvent it. A broken plural form is a non-suffixal plural obtained by interdigitating a root into a pattern. The root consists of consonants only. The pattern is a combination of vowels and sometimes consonants too, with „slots‟ for the root consonants. Native speakers can relate intuitively singular and broken plural forms, for instance, silaAoHِ OasoliHap (weapon-weapons) [‫أسلِ َحة‬ ْ ،‫]سالْح‬. Most instructed native speakers are able to explain the generation of the plural form OasoliHap from its singular form based on basic knowledge of traditional morphology by the following: -

Extract the root consonants from the singular: silaAH => slH Apply to the root the plural pattern OaFoEiLap, where and “o” zero-vowel or sukuun or intersect the two strings: [slH & OaFoEiLap] => OasoliHap or [slH & Oa1o2i3ap ] => OasoliHap where the digits denote the position of the root consonants.

1

In this transliteration, upper-case and lower-case letters, e.g. E and e, denote distinct, independent consonants and “o” zero-vowel or sukuun: ‫ء‬, c; ‫آ‬, C; ‫أ‬, O; ‫ؤ‬, W; ‫ إ‬, I ; ‫ئ‬, e; ‫ا‬, A; ‫ب‬, b; ‫ة‬, p; ‫ث‬, t; ‫د‬, v; ‫ج‬, j; ‫ح‬, H; ‫خ‬, x; ‫د‬, d; ‫ر‬, J; ‫ر‬, r; ‫ز‬, z; ‫ش‬, s; ‫ظ‬, M; ‫ص‬, S; ‫ض‬, D; ‫ط‬, T; ‫ظ‬, Z; ‫ع‬, E; ‫غ‬, g; ‫ف‬, f; ‫ق‬, q; ‫ك‬, k; ‫ل‬, l; ‫و‬, m; , n; ‫ه‬, h; , w; , Y; , y; ً, F;ً , N;ً , K; ً, a; ً, u; ً, i; ً, G;ً , o.

However, Arabic broken plurals are one of the most difficult chapters of morphology. Among the 25 possible patterns of broken plurals, only instructed speakers can explain the reasons of the plural pattern choice, here OaFoEiLap, and even then, explanations of this choice are likely to be confused. Given the singular IimaAm (imam), native speakers immediately provide the broken plural OaeimGap (imams), and may also spot that singular and plural patterns are the same as in the previous example. Even instructed speakers cannot explain the form OaeimGap based on its singular form. Only experts in morpho-phonology can explain the sequence of operations to obtain the plural form [‫ أئِ َّمة‬،‫] َإماْم‬: -

Extract the root consonants from the singular: IimaAm => Imm => {h}mm where {h} represents the 5 possible orthographic forms of hamza or glottal stop.

-

Apply to the root the plural pattern OaFoEiLap: [Imm & OaFoEiLap] => Vowel shift before repeated consonant Gemination marker Selection of the hamza form

Oa{h}omimap Oa{h}immap Oa{h}imGap OaeimGap

These operations and the preconditions of their application involve plenty of details, furthermore, the sequence of rules need to be applied in a specific order, here, you can select the hamza grapheme only after you insert the gemination marker. It means that you should be acquainted with many recipes and many ingredients in each recipe, and you cannot fry the beef before putting and heating the butter. Moreover, a definiteness and case suffix must be added to the noun, which depends on 3x3 possibilities. Finally, a final letter called “teh marbutah”, pronounced “t” and noted {p}, must be replaced by a bare “t” when a possessive pronoun is added at the end, as for example the genitive case and the 3rd person_masculine_plural possessive pronoun in OasliHapiِ ]. Oasolihatihim [‫أسلِ َحة‬ ْ - ‫أسل َحتهم‬ ْ The endings of a plural surface form may be represented by the formula: OaeimGa{p}{definitness-case-suffix}[]. Facts like this are so many in Arabic morpho-phonology that no accurate orthographic resource can dispense with such detailed descriptions. An accurate Arabic lexical resource requires detailed morpho-phonological descriptions, hence a dedicated expert. The expert deals with surface and lexical representations, abstract morphemes such as patterns and roots, and related operations such as phoneme deletion, insertion, substitution, shift, and reduplication. Moreover, he has to deal with case suffixes, orthographic variations, spelling adjustments, agglutination grammars, etc. An accurate orthographic lexical resource should include variations for verbs, nouns, and adjectives in all their inflected forms: voice, tense, person, mode, gender, number, definiteness, and case.

Now, imagine managing 80 000 lexical entries in an Arabic dictionary, and 10 000 broken plural nouns among them. Back to our question in the beginning, why not relate a broken plural to its singular form by annotating the plural form like in English? It is not surprising that Arabic automatic morphological analysers usually do not associate a broken plural to the corresponding singular, since the matter is very tricky. Without a dedicated expert in morpho-phonology, computer scientists just cannot handle 80 000 lexical entries with full attention to morpho-phonological formalization, using syntactic and morpho-phonological definitions and concepts, software devices proper to the field, and sort them out in order to produce an accurate Arabic lexical resource. Next Article: Do computer scientists question the traditional model of Arabic morphology? (What to keep and what to drop from the model?)

References Neme, Alexis, Laporte Éric (2013). Pattern-and-root inflectional morphology: the Arabic broken plural. Language Sciences. Neme, Alexis (2011). A lexicon of Arabic verbs constructed on the basis of Semitic taxonomy and using finite-state transducers. In Proceedings of the International Workshop on Lexical Resources (WoLeR) at ESSLLI.

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.