A Trigger-based classifier

July 7, 2017 | Autor: Icnlsp Conf | Categoría: Arabic Language, Topic Identification, Texte Categorization, Arabic Text Categorization System, TR-Classifier

Share Embed

Laporkan tautan ini

Descripción

See discussions, stats, and author profiles for this publication at: http://www.researchgate.net/publication/274082027

A Trigger-based classifier CONFERENCE PAPER · APRIL 2009

DOWNLOADS

VIEWS

19

26

3 AUTHORS, INCLUDING: Mourad Abbas

Daoud Berkani

Centre de Recherche Scientifique et Techni…

National Polytechnic School of Algiers

18 PUBLICATIONS 24 CITATIONS

81 PUBLICATIONS 115 CITATIONS

SEE PROFILE

SEE PROFILE

Available from: Mourad Abbas Retrieved on: 17 July 2015

A Trigger-based classifier 1

Mourad Abbas, 2 Kamel Smaili and 3 Daoud Berkani 1

Speech Processing Laboratory, CRSTDLA 1 rue Djamel Eddine Alafghani, Algiers, 16200 Algeria [email protected] 2 INRIA-LORIA, Parole team B.P. 101-54602 Villers les Nancy France [email protected] 3 Signal and Communication Lab. PNS 10, rue Hassen Badi, 16200 Algiers Algeria [email protected] Abstract Topic identification is based on topic training corpora, which represents specificities of each topic. It consists in finding the topic(s) treated in a piece of text (paragraph, article,...), among a set of topics. In this paper, we present a new method of topic identification based on computing triggers pairs: TR-classifier (TRiggers-based classifier). The number of topics to be identified is six: Culture, religion, economy, local news, international news and sports. TR-classifier uses triggers pairs for the purpose to identify topics. Hence, the first step to be realized is the construction of a vocabulary for each topic. We note that vocabularies are composed of words ranked according to their frequencies from the maximum to the minimum. We note that the used vocabulary sizes are very reduced, in this case 100, 200 and 300. For each word of the vocabulary, average mutual information (AMI) is calculated. The used triggers are selected according to the highest AMI values. The corpus used in our experiments is extracted from downloaded texts of the Omani newspaper Alwatan.

Introduction Natural language can be viewed as a stochastic process, in which, every document or other contextual unit of text is considered as a random variable with some probability distribution. Another view of the information theory considers language as an information source which emits a sequence of symbols from a finite ensemble of elements (the vocabulary) (Rosenfeld, 1994). Among the several approaches which have been used for language modelling, there are those based on history. The choice of such a type of models is not arbitrary; in fact it is based on the document's history that contains potential information sources which can be exploited by models using short history, as n-grams and n-classes, or by those using long history which include both triggers and Cache model. The Long-Distance Bigrams experiment reported in (Huang et al., 1993) and the Shannon game program implemented at IBM showed that the information present in the longer-distance history is significant (Rosenfeld, 1994). We decided then to investigate in triggers. In order to realize our experiments we have downloaded Arabic texts from the website of the Omani newspaper: Alwatan. In the following sections, we expose the TR-classifier which is a novel method of topic identification (Abbas, 2008). Its main idea is to quantify the existing relationship between words in order to characterize each topic (Abbas, 2008; Haton et al., 2006; Rosenfeld, 1994; Langlois, 2002). The presence of a word in text could trigger another word. For example, the word "match" can trigger a multitude of words: "football", "basketball", "referee", etc. Average mutual information measures the correlation

between these words, which leads at last to represent each topic by its own characterizing triggers.

TR-classifier Triggers of a word wk are the ensemble of words that have a high degree of correlation with it. The main idea of the TR-classifier is based on computing the average mutual information of each couple of words which belong to the vocabulary Vi. Couples of words or "triggers" that are considered important for a topic identification task, are those which have the highest average mutual information (AMI) values (GuoDong & KimTeng, 1999; Tillman & Ney, 1996). Each topic is then endowed with a number of selected triggers M, calculated using training corpora of topic Ti. Identifying topics by using TR-method consists in: • Giving corresponding triggers for each word wk Є Vi, where Vi is the vocabulary of a topic Ti. • Selecting the best M triggers which characterize the topic Ti. • In test step, we extract for each word wk from the test document, its corresponding triggers. • Computing Qi values by using the TR-distance given by the equation (1):

å AMI (w , w k

Qi =

i ,k n -1

å (n - l )

i k

) (1)

l =0

Where i stands for the ith topic. The denominator presents a normalization of AMI computation.

wki are triggers included in the test document d , and

characterizing the topic Ti. • A Decision for labeling the test document with topic Ti is obtained by choosing arg max Qi.

Documents representation and vocabulary building For a topic identification task, large corpora are needed. This, we started by downloading Arabic texts from the archives of the Omani newspaper Alwatan of the year 2004, and stocking them in a suitable way. The size of the resulted corpus is about 10 millions terms corresponding to 9000 articles, distributed over the six topics that we have chosen to identify. 90 % of these articles are reserved to training and the rest to the evaluation. The studied topics are: Culture, religion, economy, local news, international news and sports. The numbers of articles that represent each topic, before and after removing insignificant words, are shown in table 1. In fact, we realized the elementary operations for the topic detection task, as eliminating insignificant words, and discarding words whose frequencies are less than a definite threshold. Topics Culture Religion Int. news Economy Loc. news Sports Total

N. words before 1.359.210 3.122.565 855.945 1.460.462 1.555.635 1.423.549 9.813.366

N. words after 1.013.703 2.133.577 630.700 1.111.246 1.182.299 1.067.281 7.139.486

Experiments and results TR-classifier uses topic vocabularies which are composed of words ranked according to their frequencies from the maximum to the minimum. In these experiments we used much reduced sizes of the six topic vocabularies, in this case 100, 200 and 300. The evaluation of the TR-classifier has been made by varying both topic vocabularies sizes and triggers number N.

Recall

Table 1: The size of the corpus Alwatan before and after discarding insignificant words

To build the vocabulary, we used the term frequency method which consists in calculating the frequency of each word in the corpus. It has been shown in (Yang & Pedesen, 1997) that it gives good results though it is simple. Mutual information (Seymore et al., 1998) and Document frequency are also good methods of terms selection. Some methods use a general vocabulary, whereas in the case of the TR-classifier, for each topic we built a vocabulary based on its corresponding training corpus. Documents are represented by using the well known Bag of Words method. Each word of the document is weighted by an adequate value. The weights are those commonly used in text categorization, particularly for the TFIDF classifier (Joachims, 1996). Hence, after removing non content words, we calculated both the frequency of each word, which is called Term Frequency, and the Document Frequency of a word, that means the number of documents in which the word w occurs at least once. The weight of each term results then from the product of Term Frequency and Inverse Document Frequency (Joachims, 1996; Salton, 1991; Seymore & Rosenfeld, 1997). A document d is then transformed to a compact vector form, and the dimension of the vector corresponds to the size of the vocabulary.

Triggers number

Figure 1: TR-Classifier performances using a vocabulary size 100

In order to enhance performances, we have conducted other experiments, in which we augmented N. Indeed, when taking N = 40, average recall rate attains 79.44 %, which represents an improvement of 8 %. Values of N = 60 and N = 80 conducted respectively to a Recall of 82.33 % and 83.11 %. In this case, the improvement of the performances is very slight, so we decided to use larger vocabularies: 200 and 300. Performances in terms of Recall by using a size of vocabulary 100 are presented in figure 1.

Recall

The choice of N = 20, with a topic vocabulary size 100, lead to an average recall rate equal to 71.55 %. For some topics, we achieved good results. The fact that the other topics had unsatisfactory Recall values is due to variety of these topics. Indeed, splitting them to subtopics turns out to be convenient, even necessary to have best performances. Otherwise, the three remaining topics are relatively easily identified, particularly topic "Sports" which had a Recall rate equal to 93.33 %.

Triggers number

Recall

Figure 2: TR-Classifier performances using a vocabulary size 200

Triggers number

Figure 3: TR-Classifier performances using a vocabulary size 300

Using a size of vocabulary 200 conducted to a slight improvement of Recall by nearly 1%. Nevertheless, this improvement necessitated to take N = 160 to reach 84 % in terms of Recall (See figure 2). Hence, since the choice of vocabulary size 200 has not brought significant improvement compared with previous experiments, we continued with changing vocabulary size, and varying triggers number. We obtained better Recall values when using a vocabulary size of 300. Indeed, for N = 250, the best Recall rate achieved is 89.69 %. To see more clearly we show performances of the method in figure 3. As shown above in the figures, we notice that if we augment the vocabulary size for a fixed triggers number, performances decrease continuously. Consequently, when increasing the vocabulary size, more triggers are necessary to enhance performances. Thus, increasing jointly the vocabulary size and triggers number is necessary and conducts to better performances.

Conclusion In this paper, we presented a new method for identifying topics. We have tested its performances on an Arabic corpus that we have extracted from an Arabic newspaper. The elementary operations on this corpus, as removing insignificant words, computing terms frequencies, document frequencies and Average Mutual Information have been achieved by using a tool that we have implemented by c++. We have seen that increasing vocabulary size improves the representativity of each topic, which is not enough to have best results. So it will be necessary to select a suitable triggers number in order to enhance the method performance. When considering the very small sizes of topic vocabularies that we have used, we can conclude that TRclassifier led to satisfactory results.

Bibliographical References Rosenfeld, R. (1994). Adaptive Statistical Language Modeling : A Maximum Entropy Approach. PhD thesis, Computer Science Department, Carnegie Mellon University. Huang, X., Alleva, F., Hwang, M.Y. and Ronald Rosenfeld. (1993). An Overview of the SPHINX-II Speech Recognition System. In Proceedings of the ARPA Human Language Technology Workshop, published as Human Language Technology (pages 81-86). Morgan Kaufmann. Abbas. M. (2008). Topic Identification for Automatic Speech Recognition. Phd thesis, Electrical and Computer Engineering Department, Polytechnic National School, Algiers. Haton, J.P., Cerisara, C., Fohr, D., Laprie, Y. & Smaili, K. (2006). Reconnaissance automatique de la parole Du signal son interprétation. France: Dunod. Langlois, D. (2002). Notions d'événements distants et d'événements impossibles en modélisation stochastique du langage: application aux modèles n-grammes de mots et de séquences. Phd thesis. Henri Poincaré University, Nancy1.

GuoDong, Z. & KimTeng, L. (1999). Interpolation of ngram and mutual information based trigger pair language models for Mandarin speech recognition. Computer Speech and Language, 13, 125--141. Tillman, C. & Ney, H. (1996). Selection criteria for word trigger pairs in language modeling. In Laurent Miclet and Colin de la Higuera, editors, Grammatical inference: Learning syntax from sentences. Lecture Notes in Artificial Intelligence, 1147, 95--106. Yang , Y. & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In 14th International Conference on Machine Learning (pp. 412--420). San Francisco, US. Seymore, K., Chen, S. & Rosenfeld, R. (1998). Nonlinear interpolation of topic models for language model adaptation. In Proceedings of the International Conference on Spoken Language Processing. Joachims, T. (1996). A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. Technical report, School of Computer Science Carnegie MellonUniversity Pittsburgh. Salton, G. (1991). Developments in Automatic Text Retrieval, Science, 253, 974--979. Seymore, K. & Rosenfeld. R. (1997). Using Story Topics for Language Model Adaptation. In Proceeding of the European Conference on Speech Communication and Technology.

Lihat lebih banyak...

A Trigger-based classifier

Descripción

Comentarios