A trigram hidden Markov model for metadata extraction from heterogeneous references

May 22, 2017 | Autor: Jian Tang | Categoría: Engineering, Bibliography, Hidden Markov Models, Mathematical Sciences, hidden Markov model, Viterbi algorithm, Second Order, Shrinkage, Information Sciences, Viterbi algorithm, Second Order, Shrinkage, Information Sciences

Share Embed

Laporkan tautan ini

Descripción

Information Sciences 181 (2011) 1538–1551

Contents lists available at ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

A trigram hidden Markov model for metadata extraction from heterogeneous references Bolanle Ojokoh a,b, Ming Zhang a,⇑, Jian Tang a a b

School of Electronic Engineering and Computer Science, Peking University, Beijing 100871, PR China Department of Computer Science, Federal University of Technology, P.M.B. 704 Akure, Nigeria

a r t i c l e

i n f o

Article history: Received 24 March 2010 Received in revised form 25 December 2010 Accepted 2 January 2011 Available online 11 January 2011 Keywords: Metadata extraction Hidden Markov models Bibliography Second order Shrinkage

a b s t r a c t Our objective was to explore an efﬁcient and accurate extraction of metadata such as author, title and institution from heterogeneous references, using hidden Markov models (HMMs). The major contributions of the research were the (i) development of a trigram, full second order hidden Markov model with more priority to words emitted in transitions to the same state, with a corresponding new Viterbi algorithm (ii) introduction of a new smoothing technique for transition probabilities and (iii) proposal of a modiﬁcation of back-off shrinkage technique for emission probabilities. The effect of the size of data set on the training procedure was also measured. Comparisons were made with other related works and the model was evaluated with three different data sets. The results showed overall accuracy, precision, recall and F1 measure of over 95% suggesting that the method outperforms other related methods in the task of metadata extraction from references. Ó 2011 Elsevier Inc. All rights reserved.

1. Introduction The dramatic growth of digital libraries in recent years has not only simpliﬁed access to existing information sources, but has also made the task of ﬁnding, extracting and aggregating relevant information difﬁcult. In the bibliographic research community, several researches are being conducted on citation analysis, grouping and social networks creation for subsequent mining. A prerequisite to such tasks is accurate reference metadata extraction process. References are most commonly found in the late section of an article; this section is often labeled ‘‘References’’, ‘‘Bibliography’’ or ‘‘List of References’’, and information that is normally contained in this section includes the author names, title, journal, volume, number (issue), year, and page information. These have constituted an important kind of metadata valuable for literature search, analysis, and evaluation [9]. Automatic reference extraction is particularly difﬁcult because of the problems of inconsistent formatting, semantically overloaded punctuations and ﬁeld separators, and existence of many dramatically different reference styles. Inspired by the work of Yin et al. [30], where the inner emission probability was computed according to the bigram sequence relationship of words within the same ﬁeld, we describe a method that utilizes trigram HMMs with more priority to the words emitted in transitions to the same state for the task of metadata extraction from references. We propose a three dimensional transition matrix in which the probability of transitioning to a new state depends not only on the current state according to the traditional HMM but also on the previous state. Our method improves on those adopted by previous researches, by recommending a new approach for smoothing transition probabilities, a modiﬁed shrinkage technique for smoothing emission probabilities and optimization of the emission vocabulary. ⇑ Corresponding author. Tel.: +86 13601036730; fax: +86 10 62765822. E-mail addresses: [email protected] (B. Ojokoh), [email protected] (M. Zhang), [email protected] (J. Tang). 0020-0255/$ - see front matter Ó 2011 Elsevier Inc. All rights reserved. doi:10.1016/j.ins.2011.01.014

B. Ojokoh et al. / Information Sciences 181 (2011) 1538–1551

1539

The rest of this paper is organized as follows: Section 2 discusses related work. In Section 3, an overview of the trigram hidden Markov model is presented. Section 4 describes the approach adopted here in solving the problem of metadata extraction and the system architecture while emphasizing the contributions of this paper. In Section 5, the experimental results are presented. Finally, Section 6 concludes the paper. 2. Related work Several researches have been conducted on extraction of metadata from references. The methods adopted generally fall under two broad categories: rule and knowledge-based methods [6,10,18] and machine learning methods [1,2,4,7,8,28,29]. Ojokoh [18] used rule-based approach for the task from a number of reference styles. Day et al. [6] utilized a knowledge based approach for solving the problem. Recently, Gupta et al. [10] used a combination of regular expression based heuristics and knowledge based systems to extract metadata from the references of some biological science papers. Cortez et al. [4] and Cortez and da Silva [5] employed unsupervised machine learning approach for metadata extraction with relatively good results. Some related studies involving scholarly digital libraries have been conducted by many researchers. Automated reference extraction systems will be relevant for these studies. For instance, Shi et al. [26] worked on anchor text extraction; Papavlasopoulos [20] worked on evaluating the scientiﬁc impact of journal; and Kerne et al. [13] developed applications that could be useful to represent the output of automated metadata extraction applied to speciﬁc context like in our research. Several researchers have applied the hidden Markov model for different information extraction tasks with good performance when applied to the tasks in both structured, semi structured and free text [17]. For instance, HMMs have been used for extraction of gene names and locations from scientiﬁc abstracts [14]; named-entity recognition [1]; and extraction of addresses [2,28]. Seymore et al. [24] explored the use of HMM for learning model structure from data, but centered their research on the task of extracting information from the headers of computer science research papers. The reference metadata extraction problem has been solved using HMMs by some researchers. Connan and Omlin [3] developed a variety of models for different reference styles while Yin et al. [30] explored a bigram HMM which used word sequential relation and position information in text ﬁelds in addition to only word frequency used by traditional HMMs. Geng and Yang [8] proposed the use of special states for delimiters and HTML tags while extracting metadata from Web references. More recently, Hetzner [12] used HMMs for metadata extraction from the same data set used in this work. He assigned two states to each label and one state to each delimiter implementing the traditional HMM. He further recommended a better method for model selection; mapping optimization and the employment of a second-order HMM considered in our work. Our research was therefore partly motivated by the need to solve the problems proposed in [12], as it creates a more efﬁcient HMM for metadata extraction. It is the ﬁrst to use a full second-order HMM with more priority to the words emitted in transitions to the same state for the task of metadata extraction from references. Peng and McCallum [21] applied HMM to metadata extraction with only second-order transitions. The inclusion of word sequential relation and position information in text ﬁelds has been adopted with success [1,30]. While those works focused on these at the ﬁrst order level, this work pays more attention to the existence of such within the same state and at a higher order level. Some researchers outside the metadata extraction domain have proved that higher-order HMMs can achieve higher performance than ﬁrst-order HMMs. This was shown in the contexts of speech recognition [15] and part-of-speech tagging [29]. More recently, Shahin [25] conﬁrmed that using second-order HMMs enhanced speaker authentication performance compared to ﬁrst-order models. Generally, higher-order HMMs have been proved to incorporate more information into the training procedures of practical problems solved by them. Embedding more information into the learning process, especially if there are many training data, leads to accuracy of estimations. 3. The trigram hidden markov model A hidden Markov model (HMM) is a ﬁnite state automaton comprising of stochastic state transitions and symbol emissions. The automaton models a probabilistic generative process whereby a sequence of symbols is produced by starting at a designated start state, transitioning to a new state, emitting one of a set of output symbols selected by that state, transitioning again, emitting another symbol, and so on, until a designated ﬁnal state is reached. Associated with each of the set of states, S = {s1, . . . , sn}, are a probability distribution over the symbols in the emission vocabulary V = {w1, . . . , wm}, and a probability distribution over its set of outgoing transitions. When using HMM to perform metadata extraction, the goal is to determine the most likely sequence of states that generates a given sequence of symbols. The Viterbi Algorithm is a common method for calculating this [23]. It would solve this problem in just O(TN2) time, where T is the length of the sequence and N is the number of states [7]. After obtaining the results, each word is then put in its corresponding ﬁeld according to the state sequence. The hidden Markov model proposed in this work is different from the traditional HMM which utilizes the common Viterbi algorithm [23]. It is a new model referred to as trigram. It is a full second-order HMM, but with the placement of more priority to the words emitted in transitions to the same state. The new transition matrix is three dimensional, stating that the probability of transitioning to a new state depends not only on the current state according to the traditional HMM but also on the previous state. This allows more information to be incorporated into the model.

1540

B. Ojokoh et al. / Information Sciences 181 (2011) 1538–1551

Hence, the trigram hidden Markov model is a four-tuple (S, A, V, B),

T ¼ ðS; A; V; BÞ:

ð1Þ

S is the set of states, (i) S ¼ fS1 ; . . . ; SN g;

ð2Þ

(ii) A ¼ faijk g;

ð3Þ

where, aijk = P(qt+2 = Skjqt+1 = Sj, qt = Si) N X

aijk ¼ 1;

1 6 i; j 6 N:

k¼1

The second order transition and emission probabilities were obtained based on the fact that higher-orders HMM can be transformed to an equivalent ﬁrst-order HMM [11,22]. This fact establishes the simplicity of the proposed model. Therefore:

aijk ¼ aij ajk ;

ð4Þ

aij ¼ PðSi ! Sj Þ;

ð5Þ

ajk ¼ PðSj ! Sk Þ;

ð6Þ

where 1 6 i, j, k 6 N

(iii) V ¼ fV 1 ; . . . ; V M g;

ð7Þ

where M is the number of emission symbols in the discrete vocabulary, V. (iv) B is the emission probability matrix whose elements were obtained from products of the ﬁrst order emission probability matrix described in Eqs. (8)–(10)

bj ðV m Þ ¼

PðV m1 V m ; Si ! Sj Þ if i ¼ j; otherwise: P f ðV m jSi ! Sj Þ

ð8Þ

P(Vm1Vm, Si ? Sj) is the probability of emitting symbol Vm as an inner symbol preceded by Vm1 while transitioning from state i to state j. The probability is computed for all symbols appearing immediately to the left of Vm and, Pf(VmjSi ? Sj) is the probability of emitting symbol Vm as the ﬁrst while transitioning from state i to state j. Hence,

B ¼ bj ðV m Þ bk ðV m Þ; where

bk ðV m Þ ¼ M X

bj ðV m Þ;

ð9Þ

PðV m1 V m ; Sj ! Sk Þ if j ¼ k; otherwise; Pf ðV m jSj ! Sk Þ bk ðV m Þ ¼ 1;

1 6 j;

ð10Þ

k 6 N:

m¼1

4. Architecture of the approach Here, we describe the approach adopted in solving the problem of metadata extraction. Each subsection emphasizes the signiﬁcant contributions and the methods exploited in the implementation of the trigram hidden Markov model for the task of metadata extraction from heterogeneous references. Fig. 1 presents the general architecture of the metadata extraction process. 4.1. The training phase The reference dataset used in this research was created by the Cora project [16]. It was used because of its public availability and relatively widespread usage. It contains 500 tagged references, each consisting of some or all of these 13 ﬁelds: author, title, editor, booktitle, date, journal, volume, tech, institution, pages, location, publisher and note. Cora dataset is a very heterogeneous set of reference data [12]. The model was ﬁrst constructed to produce all the word sequences contained in the data set, with the start state having as many outgoing transitions as there are word sequences with each word sequence being represented by a unique path with

B. Ojokoh et al. / Information Sciences 181 (2011) 1538–1551

Tagged Data Sets

Training Phase

1541

Trained Model Structured Records

Taxonomy

Viterbi Algorithm

Testing Data Sets

Evaluation

Fig. 1. Architecture of the HMM-metadata extraction process.

one state per word. Neighbour and V-merging techniques [27,24,30] were used to perform series of iterative merges in order to produce an optimal structure of states. Section 4.4 describes how the emission vocabulary was obtained. Values were obtained for the probability matrices using the maximum likelihood model. For instance, for the transition,

cðSi ! Sj Þ PðSi ! Sj Þ ¼ P P ; cðSi ! SÞ S2 where c(Si ? Sj) is the number of transitions from state Si to Sj, while state Si in the training data. For the emission, (i) PðV m1 V m ; Si ! Sj Þ ¼

ð11Þ P P cðSi ! SÞ is the total number of transitions from S2

cððSi ! Sj Þ " V m1 V m Þ ; cððSi ! Sj Þ " V m1 Þ

ð12Þ

where c((Si ? Sj)"Vm1Vm) is the number of times Vm1Vm appears in transitions from state Si to Sj, while c((Si ? Sj)"Vm1) is the total number of emissions of Vm1 from state Si to Sj.

cððSi ! Sj Þ " V m Þ

; (ii) P f ðV m jSi ! Sj Þ ¼ P P cððSi ! Sj Þ " pÞ p2 where c((Si ? Sj)"Vm) is the number of times Vm is emitted in transitions from state Si to Sj, while is the total number of emissions from state Si to Sj.

ð13Þ P P cððSi ! Sj Þ " pÞ p2

4.2. Smoothing Various researchers have proposed different methods for solving the problem of smoothing. Laplace smoothing [1], absolute discounting [7], a modiﬁcation of absolute discounting [19] and naı¨ve smoothing [12] are some of these. For the task of reference metadata extraction, shrinkage has been proved to be more effective [7,30] than the traditional smoothing techniques mentioned earlier, especially when there are insufﬁcient training data. This technique of eradicating zero probabilities makes it possible to increase the priority of important probabilities while discountenancing unimportant ones. Back-off shrinkage was employed in bigram [30] at four levels – bigram, unigram, global and uniform, combining the estimates with a weighted average, with adjustments on the weights based on the importance of the model. This research proposes a modiﬁcation of the back-off shrinkage giving more regard to the words emitted in transitions to the same state. This idea helped to increase the accuracy and precision of the results by reasonable values. The importance of unigram was also reduced in this research. Since operation is at the second order level, the unigram model may not be necessary while smoothing other emission probabilities except for the inner emission probabilities, where only a little percentage is required. In addition, many HMM researchers do not pay attention to smoothing transition probabilities. They only emphasize smoothing emission probabilities [1,2,28,30]. A naı¨ve smoothing technique was adopted for transition probabilities by Hetzner [12]. However, this work is proposing shrinkage for both transition and emission probabilities. The smoothing of transition probabilities being done at the ﬁrst order level before obtaining the second order probabilities also helped to improve the results. With this, the stress of encountering zero values would have been eradicated before hand. Hence, the smoothing equations used in this research are described in Eqs. (14)–(16).

1542

B. Ojokoh et al. / Information Sciences 181 (2011) 1538–1551

For the transition probabilities:

PðSi ! Sj Þ ¼ k1 PðSi ! Sj Þ þ k2 PðSi [ Sj Þ;

ð14Þ

where k1 = 0.99999995, k2 = 0.5(106), P(Si [ Sj) is the probability of occurrence of state Si or Sj in the entire training sample. For ﬁrst word emission probabilities, we have:

Pf ðV m jSi ! Sj Þ ¼ k1 Pf ðV m jSi ! Sj Þ þ k2 Pglobal ðV m Þ:

ð15Þ

where k1 = 0.95, k2 = 0.05 For the other (inner) emission probabilities, P(Vm1Vm, Si ? Sj) is computed as:

k1 PðV m1 V m ; Si ! Sj Þ þ k2 PðV m jSi Þ þ k3 Pglobal ðV m Þ:

ð16Þ

where k1 = 0.9, k2 = 0.05, k3 = 0.05 Pglobal(Vm) is the global probability, that is, the probability of Vm occurring in the entire training data, P(VmjSi) is the unigram probability of Vm in state Si. 4.3. Out of vocabulary problem The minimum frequency method [24,30] was used to deal with out of vocabulary problem, which occurs when some new words that were not encountered during training are found during testing. The minimum frequency was set to {3}, so words occurring once or twice were mapped to unknown words and their emission probability was estimated using the training data. 4.4. Taxonomy and feature extraction It is important to critically select the number of symbols and appropriately map each of them to tokens when extracting metadata using HMM. In order to ensure mapping optimization while adequately extracting the features of the tokens found in the data set, 20 symbols were used in this research. The emission vocabulary is described in Table 1. Regular expressions, the incorporation of a knowledge base of state names and a list of common words found in speciﬁc ﬁelds were used for mapping tokens to the symbols in the vocabulary. These features were computed in the order listed in Table 1, so that in non-disjoint cases, for example, 1999 for datelike and purenumber, the former takes precedence. 4.5. The new viterbi algorithm for the trigram model A trained hidden Markov model, with values for the various parameters was obtained after going through the procedures described above. This model consists of values for the transition and emission probability matrices along side the sequence of symbols derived from each of the taxonomised references from the testing data set. These parameter values were passed to the Viterbi Algorithm to obtain a sequence of states (q1, . . . , qT) most likely to have produced the symbol sequence which represents the reference. This result consists of structured records. Table 1 Emission vocabulary. Emission symbols

Examples from sample

Comma Fullstop Initial Startcap Hyphen Notelike Purenumber Location Institutename Techlike Ampersand Journallike Volumelike Datelike Pagelike Acronym bktitlelike Editorlike Publisherlike Others

, . A., W., B.G., M.-C Formalising, Third, Detecting Appear 4 0054 881 Vienna, Sydney, Osaka Dept, University, Department, Division PhD, thesis, Masters, Ph.D., Master’s & Journal, J. Vol., No., Nos. January, February, Dec., 1999 pp., pages ACM, SIAM, SIGPLAN, AAAI, MIT Proc., Proceedings, Report, Conference Editors Publisher, press and, in, the, and other words not matching the other symbols

B. Ojokoh et al. / Information Sciences 181 (2011) 1538–1551

1543

In this work, the original Viterbi Algorithm [23] was modiﬁed for compatibility with the trigram model. The modiﬁed algorithm is presented as follows: (i) Initialization: Beginning at state 0,

d0 ði; jÞ ¼ 1; w0 ði; jÞ ¼ 0;

1 6 i;

j 6 N;

ð17Þ

1 6 i; j 6 N:

ð18Þ

dt ðj; kÞ ¼ max½dt1 ði; jÞaijk B;

ð19Þ

wt ðj; kÞ ¼ arg max½dt1 ði; jÞaijk ;

ð20Þ

(ii) Recursion: 16i6N

16i6N

2 6 t 6 T;

1 6 j 6 N:

(iii) Termination:

P ¼ max ½dT ði; jÞ;

ð21Þ

qT

ð22Þ

16i;j6N

¼ argj max½dT ði; jÞ; 16i;j6N

qT1 ¼ argi max½dT ði; jÞ:

ð23Þ

16i;j6N

(iv) Path (state sequence) backtracking:

qt ¼ wtþ1 qtþ1 ; qtþ2 ;

t ¼ T 2; T 3; . . . ; 2; 1

ð24Þ

Modiﬁcation was adopted at the initialization stage [2,30]. Another contribution of this work is the modiﬁcation at the recursion stage of the Viterbi Algorithm (Eq. (19)) to incorporate the new emission probabilities at the second order level. The output of the Viterbi Algorithm is a set of states {q1, . . . , qn} representing the most likely to have produced the symbol sequence formed from the testing references. Eventually, a structured set of records in the same format with the tagged references used for testing was produced. 4.6. Evaluation These structured records were compared with real testing data set to evaluate how well the trained model performed the task of metadata extraction. Evaluation was done measuring the per-token accuracy, precision, recall and F1-measure for each ﬁeld. These evaluation measures are deﬁned as follows:

A ; AþC A Recall ¼ ; AþB AþD Accuracy ¼ ; AþBþCþD 2 Pr ecision Re call ; F1 ¼ Pr ecision þ Re call Pn Ai ; Overall Precision ¼ Pn i¼1 A i¼1 i þ C i Pn Pr ecisioni Average Precision ¼ i¼1 ; n Precision ¼

ð25Þ ð26Þ ð27Þ ð28Þ ð29Þ ð30Þ

where A is the number of correctly extracted tokens (True Positives). B is the number of tokens, existing but not extracted (False Negatives). C is the number of tokens extracted but associated with wrong labels (False Positives), while D is the number of tokens in labels that did not exist and were not extracted (True Negatives) and n is the number of ﬁelds. 5. Experimental results and discussion Experiments were conducted using three different data sets: Cora data set, Flux-CiM data set - publicly available (http:// www.dcc.ufam.edu.br/eccv/ﬂux-cim) and used by Cortez et al. [4] and another manually created heterogeneous data set (which we will hereafter refer to as ManCreat) [30]. A four-level cross validation method was used. Each data set was split

1544

B. Ojokoh et al. / Information Sciences 181 (2011) 1538–1551

into four parts, such that, in each run, a different part was used as a test set while the other three parts were merged as the training set, performing four rounds of experiments. The training set was used to train the HMM, and the testing set was used to evaluate the effect of extraction. The ﬁnal results of each experiment represent the average of the four runs. 5.1. Experiments with the cora data set Table 2 shows the evaluation results of the trigram model for metadata extraction from the cora dataset. Generally, the author ﬁeld was extracted with the best precision (98.31%) and recall (99.54%) while the date ﬁeld was extracted with the highest accuracy (99.70%). The reason is not far fetched; the high frequency of occurrence of the author and the date ﬁelds helped the training of the model signiﬁcantly. The author, date, title, pages, volume and booktitle ﬁelds were extracted with F1-measure of 98.91, 98.16, 96.73, 97.20, 95.53 and 92.71% respectively. Other ﬁelds had over 80% F1-measure, except for the note ﬁeld. This ﬁeld also had slightly low results in related work. The frequency of occurrence of the note ﬁeld was low in the data set; in addition, it contained different types of words. The average accuracy of extraction was 99.25% which is relatively high. Table 3 shows the results of the experiments carried out using the baseline methods and our proposed approach. Evaluation was conducted in terms of overall and average precision, recall and F1-measure. We refer to the baseline methods as TradHMM [23], ImpHMM [12] and Bigram [30]. TradHMM implements the traditional ﬁrst-order HMM as described in [23] using the same methods of model structure selection and parameter estimation used in this work. ImpHMM (Improved HMM) assigns two states to each label and one state each to a delimiter, yet implementing the original Viterbi Algorithm. Published data [12] were adopted as the results for this method. Bigram [30] implements HMM using word sequential relation and position information in text ﬁelds in addition to word frequency at the emission level. Our proposed implementation of HMM (Trigram) outperformed the baseline methods. We can see from Table 3 that Trigram achieved the best performance, indicating the effectiveness of the method. Trigram outperformed the Traditional HMM with average precision, recall and F1-measure of 21.4, 26.5 and 24.7% respectively. In overall performance, Trigram’s results were signiﬁcantly better than those of TradHMM. ImpHMM, an improved method of using two states for every tag and assigning states to delimiters, had better results than those of TradHMM. This shows that the method adopted in the work could be effective. Bigram, had the next best result to trigram. The method showed some improvement over other HMM implementations. Trigram surpassed all these other implementations in performance. It had additional average performance of 7.7, 2.1 and 5.0% in precision, recall and F1 respectively when compared to ImpHMM. It also performs signiﬁcantly better than Bigram in both overall and average performance. Table 2 Accuracy, precision, recall and F1-measure of the trigram HMM from cora data set (%). Field

Acc.

Prec.

Recall

F1

Title Author Date Pages Volume Journal Booktitle Publisher Location Tech Institution Note Editor Overall Average

98.32 99.43 99.70 99.66 99.71 98.83 98.10 99.60 99.33 99.38 99.49 99.28 99.41 99.25 99.25

96.36 98.31 97.93 96.33 97.04 90.31 91.56 91.04 89.04 83.64 85.50 82.11 92.78 95.06 91.69

97.14 99.54 98.39 98.10 94.24 85.66 93.92 88.87 88.76 85.60 90.47 54.88 85.36 95.18 89.30

96.73 98.91 98.16 97.20 95.53 87.76 92.71 89.82 88.53 83.69 87.55 60.33 88.61 95.12 89.66

Table 3 Performance of HMM methods (%). Prec.

Recall

F1

TradHMM

Overall Average

78.32 70.35

77.37 62.76

77.84 64.98

ImpHMM

Overall Average

93.40 84.00

93.30 87.20

93.30 83.70

Bigram

Overall Average

93.10 88.70

93.20 86.10

93.10 86.20

Trigram

Overall Average

95.10 91.70

95.20 89.30

95.20 89.70

1545

B. Ojokoh et al. / Information Sciences 181 (2011) 1538–1551

The results show that the positive effect of considering sequential emission information in transitions to the same state is very signiﬁcant. The positive effect of extending this to second order is also notable. Extending transition to second order level also helps but not as signiﬁcantly as giving priority to emissions when transitioning to the same state. It can also be deduced from the results that the treatment of the emission level probabilities of Hidden Markov Model generally contributes more signiﬁcantly to its performance than the transition probabilities. 5.1.1. Related experiments ImpHMM [12] is the major and most current research apart from ours that used Cora data set to solve the same problem with HMM. Further steps were taken to do comparisons with the sets of published results of this HMM implementation on ﬁeld level. Figs. 2–6 show the results of these comparisons in terms of precision, recall and F1-measure. In this research, metadata extraction was done with higher precision and recall than in Hetzner [12] for eight ﬁelds (author, date, journal, institution, booktitle, location, publisher and volume ﬁelds. This shows the importance of the approaches used in the various parts of this work. In fact, the difference in the precision for the institution ﬁeld was above 20%. The note and editor ﬁelds had higher precision but lower recall than in Hetzner [12], because of some existing but not extracted ﬁelds. The title and pages ﬁelds extraction results were a bit lower in precision compared to that in Hetzner [12]. For the recall evaluation of those ﬁelds, the pages ﬁeld’s performance was almost the same; while the title ﬁeld performed better in this work. The F1-measure, combining precision and recall was higher in this research for ten out of thirteen ﬁelds, while the results of Hetzner [12] were better in F1-measure in only three ﬁelds. The performance of the model in extracting metadata from the note ﬁeld was signiﬁcantly poorer. The same trend was noted in Hetzner [12] and Peng and McCallum [21] who used Conditional Random Fields to extract metadata from the data set. Hetzner [12] extracted with 29.5%, while this research did it with 82.11% precision. While the recall was lower than in Hetzner [12], the F1-measure was signiﬁcantly higher. It is particularly difﬁcult to learn accurately from this ﬁeld, because the words there are too general and inconsistent. In addition, the ﬁeld occurs just a few times in the data set. 5.2. Experiments with the Flux-CiM data sets Cortez et al. [4] proposed an unsupervised method for metadata extraction from references. They used two different data sets to evaluate their system [one from the health sciences (HS) domain, consisting of 820 structured records of references

100

90

80

60

Trigram

50

ImpHMM

40

30

20

Ti Au tle th or D at Pa e g Vo es lu Jo me u Bo rna ok l Pu tit bl le i Lo she ca r tio n In Tec st h itu tio n N ot Ed e i O tor ve Av ra er ll ag e

Precision (%)

70

Field Fig. 2. Precision of trigram and ImpHMM.

Ti A u t le th o D r at Pa e Vo ges lu Jo m e ur Bo n a o l Pu ktit bl le i Lo she ca r tio n In Te st ch itu tio n N o Ed te O itor v Av era er ll ag e

F1-measure (%)

Ti Au tle th o D r at Pa e Vo ges lu Jo m e ur Bo na o l Pu ktit bl le i Lo she ca r tio n In Te st ch itu tio n N ot Ed e O itor v Av era er ll ag e

Recall (%)

1546 B. Ojokoh et al. / Information Sciences 181 (2011) 1538–1551

100

90

80

70

60

50 Trigram

ImpHMM

40

30

20

Field

Fig. 3. Recall of trigram and ImpHMM.

100

90

80

70

60 Trigram

50

ImpHMM

40

30

20

Field

Fig. 4. F1-measure of trigram and ImpHMM.

1547

B. Ojokoh et al. / Information Sciences 181 (2011) 1538–1551

100

99

98

Average Performance (%)

97

96

95

94

Trigram Unsupervised

93

92

91

90 Precision

Recall

F1-measure

Evaluation Measures Fig. 5. Average performance of trigram and unsupervised approach with Flux-CiM (CS) data set.

100

99.9

99.8

Overall Performance (%)

99.7

99.6 One-third Two-third All

99.5

99.4

99.3

99.2

99.1

99 Accuracy

Precision

Recall

F1-measure

Evaluation Measures Fig. 6. Effect of training data size on overall model performance.

1548

B. Ojokoh et al. / Information Sciences 181 (2011) 1538–1551 Table 4 Accuracy, precision, recall and F1-measure from Flux-CiM (HS) data set (%).

Title Author Date Pages Volume Journal Overall Average

Acc.

Prec.

Recall

F1

99.80 99.76 99.98 99.95 99.94 99.88 99.88 99.88

99.61 99.60 100 99.50 99.88 99.98 99.66 99.76

99.94 99.65 99.58 99.76 98.74 98.75 99.65 99.33

99.78 99.62 99.58 99.63 99.30 99.36 99.65 99.54

100

99.9

99.8

Average Performance (%)

99.7

99.6 One-third Two-third All

99.5

99.4

99.3

99.2

99.1

99 Accuracy

Precision Recall Evaluation Measures

F1-measure

Fig. 7. Effect of training data size on average model performance.

and the other from computer science (CS) domain consisting of 722 records of references]. Our method was tested with the two data sets. The evaluation results show that trigram performed excellently in the metadata extraction task. The average accuracy, precision, recall and F1-measure for the Flux-CIM (HS) data set were 99.88, 99.76, 99.33 and 99.54% respectively (Table 4) while the overall accuracy, precision, recall and F1-measure were 99.88, 99.66, 99.65 and 99.65% respectively. This establishes the fact that HMM is suitable for the task of metadata extraction from heterogeneous data. Also, the results obtained from the experiments are better than the results published previously [4] for extraction from the same data set. For instance, average precision, recall and F1-measure reported in Cortez et al. [4] were 98.09, 95.42 and 96.71% (respectively) for the Flux-CiM (CS) data set that they used in testing their method; whereas, in this research the average precision, recall, and F1-measure were 99.56, 99.17 and 99.36% for the same data set (Fig. 5). 5.2.1. Effect of data size on training Experiments were also conducted with the Flux-CiM (HS) data set to evaluate the effect of data size on the learning ability of the model. Two other smaller groups of data were extracted from the entire Flux-CiM (HS) data set. The ﬁrst was one-third of the data set in size, made up of 275 structured references while the second was two-third of the data set consisting of 537 references. Experiments were carried out on each of these sets of references separately and all the 820 references separately.

1549

B. Ojokoh et al. / Information Sciences 181 (2011) 1538–1551 Table 5 Accuracy, precision, recall and F1-measure of the trigram HMM from ManCreat data set.

Title Author Date Pages Volume Issue Journal Url Publisher Location Overall Average

Acc.

Prec.

Recall

F1

0.9701 0.9872 0.9907 0.9932 0.9960 0.9961 0.9648 0.9952 0.9969 0.9892 0.9884 0.9884

0.9417 0.9805 0.9805 0.9582 0.8608 0.8443 0.9072 0.8453 0.7869 0.8659 0.9349 0.8719

0.9411 0.9720 0.8973 0.9160 0.8412 0.8894 0.9124 0.9517 0.7837 0.9117 0.9324 0.8947

0.9414 0.9761 0.9363 0.9360 0.8444 0.8647 0.9098 0.8936 0.7789 0.8878 0.9337 0.8776

96 Trigram Bigram 94

Overall Performance (%)

92

90

88

86

84

82

80 Precision

Recall Evaluation Measures

F1-measure

Fig. 8. Overall performance of trigram and bigram with ManCreat data set.

Fig. 6 shows that the bigger the training data size, the better the overall results. However, the result still conﬁrmed that HMM is a fast learner [2]. With just one-third of the entire data set, an overall accuracy of 98.80% was reached. Adding the same size of data only increased the accuracy by 0.07%. An additional set of data increased it by only 0.01%. The trends were also similar for overall precision, recall and F1-measure. Adding more data increased the values but very minimally. For the average performance, the situation was different. The average accuracy of the model increased as the size of data increased, but the average precision was best for the data set that was two-third of the entire data set in size, while the average recall and F1-measure decreased with increased data size (Fig. 7), showing that all that the model needed to be trained effectively had almost been obtained with the ﬁrst one- third of the data set. 5.3. Experiments with the ManCreat data set ManCreat data set consists of 712 manually tagged references. It was used for experiments by Yin et al. [30]. The data set is somewhat different from the Cora data set used to train the trigram hidden Markov model in this research. The set of ﬁelds present in the two data sets were not completely the same. For instance, the booktitle and tech ﬁelds are in Cora but missing

1550

B. Ojokoh et al. / Information Sciences 181 (2011) 1538–1551

in ManCreat data set. However, the trigram model was retrained with the data set and experiments were conducted to measure its performance. Table 5 presents the result of the evaluation. In spite of the discrepancies, the trigram model outperformed the bigram model according to the published results of Yin et al. [30] in the overall task of metadata extraction as shown in Fig. 8. This conﬁrms that the trigram model with more priority to the words emitted in transitions to the same state exhibits the quality of generality and applicability to previously unknown domains. 6. Conclusion and future work A trigram (full second order) hidden Markov model (HMM) with more priority to the words emitted in transitions to the same state was applied for metadata extraction from heterogeneous references. Back-off shrinkage was proposed for smoothing transition probabilities. A modiﬁcation of the existing back-off shrinkage technique was also recommended for emission smoothing. The simplicity of the model was established through the deployment of smoothing at the ﬁrst order level, hence solving the problem of zero probabilities beforehand. In addition, values were easily obtained for the second order parameters directly from ﬁrst order. Experimental results showed that our approach outperforms the baseline HMM methods and other related methods for the task of metadata extraction. The trigram HMM performed particularly very well in precision with the overall precision greater than 95%. The average accuracy of extraction from all the ﬁelds was over 99%. The average precision was over 90%, while the average recall and F1-measure were above 89%. We also conﬁrmed that the trigram HMM is a fast learner and even without too large training set, a reasonable level of accuracy, precision, recall and F1-measure is achievable. Although, reference metadata extraction problems have been solved successfully over the years, the need for more improvements keeps arising. Hence, proffering solutions leading to better precision and accuracy over seemingly good results is inevitable. In addition, the new discoveries and contributions of this work can be applied while solving other newer problems within and outside the natural language processing context. Acknowledgments This study is partially supported by the National High Technology Research and Development Program of China (863 Program No. 2009AA01Z143), the Specialized Research Fund for the Doctoral Program of Higher Education of China and Special Fund for Fast Sharing of Science Paper in Net Era by CSTD (’’FSSP’’ Grant Nos. 20090001110106, 20100001110203). Thanks to Ping Yin for his contributions. References [1] D. Bikel, S. Miller, R. Schwartz, R. Weischedel, Nymble: a high-performance learning name- ﬁnder, in: Proceedings of the Fifth Conference on Applied Natural Language Processing, Washington, D.C., 1997, pp. 194–201. [2] V.R. Borkar, K. Deshmukh, S. Sarawagi, Automatic segmentation of text into structured records, in: Proceedings of the ACM SIGMOD Internationl Conference on Management of Data, 2001, pp. 175–186. [3] J. Connan C. Omlin, Bibliography extraction with hidden Markov models, Technical Report US-CS-TR-00-6, Department of Computer Science, University of Stellenbosch, February 2000. [4] E. Cortez, A.S. da Silva, M.A. Goncalves, F. Mesquita, E.S. de Moura, FLUX-CiM: ﬂexible unsupervised extraction of citation metadata, in: Proceedings of the Joint Conference of Digital Libraries, ACM, Vancouver, Canada, 2007, pp. 215–223. [5] E. Cortez, A.S. da Silva, unsupervised strategies for information extraction by text segementation, in: Proceedings of the Fourth SIGMOD Ph.d Workshop on Innovative Database Research, 2010. [6] M.-Y. Day, R.T.-H. Tsai, C.-L. Sung, C.-C. Hsieh, C.-W. Lee, S.-H. Wu, K.-P. Wu, C.-S. Ong, W.-L. Hsu, Reference metadata extraction using a hierarchical knowledge representation framework, Decision Support Systems 43 (2007) 152–167. [7] D. Freitag, A. McCallum, Information extraction with HMMs and shrinkage, in: Proceedings of the AAAI-99 Workshop Machine Learning and Information Extraction, 1999. [8] J. Geng, J. Yang, AUTOBIB: Automatic extraction of bibliographic information on the webs, in: Proceedings of the International Database Engineering and Applications Symposium, IEEE Computer Society, 2004, pp. 193–204. [9] C.L. Giles, K. D Bollacker, S. Lawrence, Digital libraries and autonomous citation indexing, IEEE Computer 32 (1999) 67–71. [10] D. Gupta, B. Morris, T. Catapano, G. Sautter, A new approach towards bibliographic reference identiﬁcation, parsing and inline citation matching, in: Proceedings of the International Conference of Contemporary Computing, India, 2009, pp. 93–102. [11] S.O. Hansson, Do we need second-order probabilities?, Dialectica 6 (2008) 525–533 [12] E. Hetzner, A simple method for citation metadata extraction using hidden Markov models, in: Proceedings of the Joint Conference on Digital Libraries, Pittsburgh PA, USA, 2008, pp. 280–283. [13] A. Kerne, A.M. Webb, S. Damaraju, N. Lupfer, A. Mathur, Meta-Metadata: a metadata semantics language for collection representation applications, in: Proceedings of the CIKM, Toronto, Ontario, Canada, 2010. [14] T.R. Leek, Information extraction using hidden Markov models, Masters, University of California, San Diego, 1997. [15] J.F. Mari, J.P. Haton, A. Kriouile, Automatic Word Recognition based on second-order hidden Markov models, IEEE Transactions on Speech and Audio Processing 5 (1997) 22–25. [16] A. McCallum, Cora data set, 2005.. [17] B.A. Ojokoh, Automated information extraction system for heterogeneous digital library documents, Doctoral Consortium of the ACM/IEEE Joint Conference on Digital Libraries, Vancouver, British Columbia, Canada, June 2007. [18] B.A. Ojokoh, Rule-based metadata extraction for heterogeneous references, Oriental Journal of Computer Science and Technology 2 (2009). [19] B.A. Ojokoh, O.S. Adewale, S.O. Falaki, Improving on the smoothing technique for obtaining emission probabilities in hidden Markov models, Oriental Journal of Computer Science and Technology 1 (2008) 15–24.

B. Ojokoh et al. / Information Sciences 181 (2011) 1538–1551

1551

[20] S.H. Papavlasopoulos, M.S. Poulos, N.T. Korﬁatis, G.D. Bokos, A non linear index to evaluate a journal’s scientiﬁc impact, Information Sciences 180 (2010) 2156–2175. [21] F. Peng, A. McCallum, Accurate information extraction from research papers using conditional random ﬁelds, in: Proceedings of the Human Language Technology Conference and North American Chapter of the Association for Computational Linguistics, 2004. [22] J.A. du Preez, Efﬁcient training of high-order hidden Markov models using ﬁrst-order representations, Computer Speech and Language 12 (1998) 23– 39. [23] L.R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of the IEEE 77 (1989) 257–286. [24] K. Seymore, A. McCallum, R. Rosenfeld, Learning hidden Markov model structure for information extraction, in: Workshop on Machine Learning for Information Extraction, 1999. [25] I. Shahin, Improving speaker identiﬁcation performance under the shouted talking condition using the second-order hidden Markov models, EURASIP Journal on Applied Signal Processing 5 (2005) 482–486. [26] S. Shi, F. Xing, M. Zhu, Z. Nie, J-R. Wen, Anchor text extraction for academic search, in: Proceedings of the Workshop on Text and Citation Analysis for Scholarly Digital Libraries, ACL-IJCNLP, Suntec, Singapore, 2009, pp. 10–18. [27] A. Stolcke, S.M. Omohundro, Hidden Markov model induction by Bayesian model merging, in: S.J. Hanson, J.D. Cowan, C.L. Giles (Eds.), Advances in Neural Information Processing Systems, Morgan Kaufman, 1992, pp. 11–18. [28] K. Taghva, J. Coombs, R. Pereda, T. Nartker, Address Extraction using hidden Markov models, in: Proceedings of the Document Recognition and Retrieval XII Conference, California, USA, 2005, pp. 119–26. [29] S.M. Thede, M.P. Harper, A second-order hidden Markov model for part-of-speech tagging, in: Proceedings of the 28th Annual Meeting of the Ass. for Computational Linguistics, Baltimore, MD, 1999, pp. 175–182. [30] P. Yin, M. Zhang, Z. Deng, D. Yang, Metadata extraction from bibliographies using bigram HMM, in: Proceedings of the Seventh International Conference on Asian Digital Libraries, LCNS 3334, 2004, pp. 310–319. Dr. Bolanle Ojokoh has B.Sc, M.Tech and Ph.D. degrees in Computer Science. She has been involved in teaching Computer Science courses for 10 years in the Federal University of Technology, Akure, Nigeria. Her research interests include metadata extraction, digital libraries, quality evaluation and text mining. She has published several papers in learned journals and academic conferences. She is currently pursuing Postdoctoral Research in Peking University, China. Prof. Ming Zhang received B.Sc, M.Sc and Ph.D. from Peking University, China. Her research interests include metadata extraction, document classiﬁcation, opinion mining, recommendation systems, web2.0 and social networking. She is a member of the editorial review board of the international journal of digital library systems. She has played active roles in the program committees of some international conferences including: ICADL (the International Conference on Asia–Paciﬁc Digital Libraries), GCC (the International Conference on Grid and Cooperative Computing), and CSWS 2009 (the third Chinese Semantic Web Symposium). She has published more than 80 academic papers related to digital libraries. Tang Jian received B.Sc Degree in Mathematics from Beijing Normal University, China. He is currently pursuing his Ph.D. degree in Peking University, China. His research interests are metadata extraction and conditional random ﬁelds.

Lihat lebih banyak...

A trigram hidden Markov model for metadata extraction from heterogeneous references

Descripción

Comentarios