A scaleable document clustering approach for large document corpora

December 23, 2017 | Autor: Vladimir Dobrynin | Categoría: Information Systems, Information Retrieval, Information Processing, Semantic relations, Library and Information Studies, Corpus, Method, Document Clustering, Document categorization, Experimentation, Cluster, Minimum Spanning Tree, Scalability, Large Data Sets, Corpus, Method, Document Clustering, Document categorization, Experimentation, Cluster, Minimum Spanning Tree, Scalability, Large Data Sets

Share Embed

Laporkan tautan ini

Descripción

Information Processing and Management 42 (2006) 1163–1175 www.elsevier.com/locate/infoproman

A scaleable document clustering approach for large document corpora Niall Rooney a

a,*

, David Patterson a, Mykola Galushka a, Vladimir Dobrynin b

Northern Ireland Knowledge Engineering Laboratory, University of Ulster, Jordanstown, Newtownabbey BT37 OQB, UK b Faculty of Applied Mathematics and Control Processes, St. Petersburg State University, 35, University ave., St. Petersburg 198504, Russia Received 17 June 2005; received in revised form 21 October 2005; accepted 24 October 2005 Available online 27 December 2005

Abstract In this paper, the scalability and quality of the contextual document clustering (CDC) approach is demonstrated for large data-sets using the whole Reuters Corpus Volume 1 (RCV1) collection. CDC is a form of distributional clustering, which automatically discovers contexts of narrow scope within a document corpus. These contexts act as attractors for clustering documents that are semantically related to each other. Once clustered, the documents are organized into a minimum spanning tree so that the topical similarity of adjacent documents within this structure can be assessed. The predeﬁned categories from three diﬀerent document category sets are used to assess the quality of CDC in terms of its ability to group and structure semantically related documents given the contexts. Quality is evaluated based on two factors, the category overlap between adjacent documents within a cluster, and how well a representative document categorizes all the other documents within a cluster. As the RCV1 collection was collated in a time ordered fashion, it was possible to assess the stability of clusters formed from documents within one time interval when presented with new unseen documents at subsequent time intervals. We demonstrate that CDC is a powerful and scaleable technique with the ability to create stable clusters of high quality. Additionally, to our knowledge this is the ﬁrst time that a collection as large as RCV1 has been analyzed in its entirety using a static clustering approach. Ó 2005 Elsevier Ltd. All rights reserved. Keywords: Information retrieval; Document clustering

1. Introduction The purpose of information retrieval (IR) is to store documents electronically and to support the user’s search for relevant documents (Baeza-Yates & Ribeiro-Neto, 1999). Most information retrieval systems *

Corresponding author. Tel.: +44 289 0366 114. E-mail addresses: [email protected] (N. Rooney), [email protected] (D. Patterson), [email protected] (M. Galushka). 0306-4573/$ - see front matter Ó 2005 Elsevier Ltd. All rights reserved. doi:10.1016/j.ipm.2005.10.003

1164

N. Rooney et al. / Information Processing and Management 42 (2006) 1163–1175

operate on the principal of indexing the documents in the collection. A popular indexing approach is based on the vector space model (Sebastiani, 2002) where documents and queries are represented as weighted vectors of indexed terms where the size of the vectors matches the vocabulary space. The weight of each index term within each document indicates its signiﬁcance in terms of its representation and discriminative power. Documents are deemed similar based on a cosine or Euclidean distance metric and fast search algorithms can be applied to match documents to a query. The results of a query in general are returned in the form of a ranked list. This approach combined with machine learning classiﬁcation techniques has been shown to be the most successful approach for document categorization problems (Sebastiani, 2002). However such indexing approaches suﬀer from the lack of any semantic organization of the document collection and an inability to support the user in their query formulation. In fact it is an implicit assumption in such systems that the user will always be able to construct quality keyword queries which competently express their information needs. In reality, cognitive research has shown that the user often is not fully aware of their information needs and perform an iterative process of reﬁning their search based on an increasing clariﬁcation of their information needs (Belew, 2000). The need for exploratory search mechanisms can only be supported if the document collection is organized into a meaningful structure, which allows part or all the document collection to be browsed at each stage of a search. This has prompted researchers (Cutting, Karger, Pedersen, & Tukey, 1992; Hearst & Pedersen, 1996; Mechkour, Harper, & Muresan, 1998; Liu & Croft, 2004) to re-examine the process of cluster based information retrieval as it clearly has the capability to facilitate such mechanisms. Generally clustering algorithms can be categorized as hierarchical (agglomerative and divisive) or partitional in nature (Jain, Murty, & Flynn, 1999). Partitional clustering, such as the well known K-means tends to be a much more eﬃcient approach to clustering, although it in general requires apriori knowledge of the number of clusters. Most clustering algorithms determine the similarity between points based on a distance measure. In terms of clustering a corpus of documents, a more natural measure of the similarity between documents is based on the word distributions of the documents. A probabilistic or information theoretic framework can then be applied to cluster documents or words/features in the documents. Such approaches belong to the area of distributional clustering (Baker & McCallum, 1998; Dhillon, Manella, & Kumar, 2003; Pereira, Tishby, & Lee, 1993) where the similarity measure is based on an information theoretical divergence criteria (Lin, 1991). The most recent research in distributional clustering has focused primarily on the Information Bottleneck (IB) clustering framework (Slonim & Tishby, 2000; Tishby, Pereira, & Bialek, 1999). The IB method is based on the following abstract concepts. Given the empirical joint probability distribution of two variables, one variable is ‘‘compressed’’ so that the maximum mutual information is maintained between both. In the context of clustering, the variables represent the set of documents and the set of words. Using this method, it is possible to form either word clusters or document clusters. In terms of document clustering, the original IB algorithm was agglomerative and sub-optimal in nature (i.e. does not necessarily form the ‘‘best’’ clusters). A number of new algorithms have been presented based on IB, with the intent of either trying to improve its performance in terms of text categorization problems or its eﬃciency or both (Bekkerman, El-Yaniv, Tishby, & Winter, 2001; Bekkerman, El-Yaniv, Tishby, & Winter, 2002; El-Yaniv & Souroujon, 2001; Slonim & Tishby, 2000; Slonim, Friedman, & Tishby, 2002). For example, the sequential IB (Slonim et al., 2002) method is partitional which improves its eﬃciency drawback and ensures the most optimal partitioning of documents into clusters given the mutual information. Gondek and Hoﬀman (2004) provided an extension to the IB approach that allows clustering to factor out existing knowledge orthogonal to the classes of interest. The technique described in this paper, referred to as CDC is in a similar vein to IB, in that it is based on distributional clustering, however CDC’s focus is on detecting contexts which act as attractors for semantically related documents given the context. A context is deﬁned as a probability distribution of terms over the word set. A context can be either general or narrow in scope, depending on the assessment of the probability distribution e.g. in CDC for a collection of documents relating to medical literature, the context ‘‘patient’’ is likely to be general in that it can occur in many documents covering a number of vastly diﬀerent subjects, whereas the context ‘‘cataract’’ is likely to be more narrow, as it refers to a particular medical condition. CDC is a distributional clustering technique which uses the narrow contexts identiﬁed in the collection to cluster documents that are related. Typically there will be a large number of narrow contexts in a document collection, as such, a large number of clusters are formed.

N. Rooney et al. / Information Processing and Management 42 (2006) 1163–1175

1165

In this paper, we demonstrate the quality of CDC’s document clustering approach for a large document collection by presenting the results of experiments carried out on the Reuters corpus RCV1 (Rose, Stevenson, & Whitehead, 2002). We have previously shown the quality of this document clustering approach in comparison to other approaches for smaller data-sets such as the Reuters-21578 and 20 Usenet Newsgroup collections (Dobrynin, Patterson, & Rooney, 2004). The main objective of this paper is to demonstrate the ability of CDC to group and structure semantically related documents in a large data-set, and assess the technique using independent objective measures. In terms of ‘‘large’’, we mean not only large in the number of documents in the collection, but also large in the number of document terms or features and large in the potential number of clusters. The RCV1 collection contains approximately 35 times more documents than the popular Reuters21278 collection and contains approximately 10 times the number of distinct words after stemming. The fact that the sequential implementation of CDC was able to pre-process, identify cluster contexts, cluster the whole collection and discover the internal structure of clusters on a standard desktop (2.2 GHz, 1 GB RAM) within 60 h, is a consequence of its scalability. This latter important property of CDC excludes being able to benchmark the results of the experiments against other document clustering approaches in the same environment as many clustering methods such as hierarchical methods lack scalability, due to their non-linear time complexity. Hierarchical methods such as CURE (Guha, Rastogi, & Shim, 1998) try to improve upon eﬃciency by the use of random sampling, however it has only been shown to be eﬀective for very low dimensional data. Even partitional clustering approaches that are scaleable such as as spherical K-means have a time complexity O(tNjSj) where t is the number of iterations, N is the number of clusters and jSj is the number of non-empty term weights in a sparse document—term weight matrix (Dhillon, Fan, & Guan, 2001). This is greater than the time complexity of CDC as the process of context formation/clustering is shown to have time complexity O(NjSj) in Section 2.1.4. Even allowing for techniques that can improve the eﬃciency of K-means by a constant factor (Dhillon et al., 2001; McCallum, Nigam, & Ungar, 2000), the performance of K-means is still dependent on the random choice of initial partitioning of data and the appropriate choice of number of clusters. As such, a number of repeat re-clustering runs would be required to eﬀectively assess the performance of such a partitional method, which although is possible with much smaller data sets, would be computationally expensive, for the RCVI collection. Cutting et al. (1992), presented a method to try and improve the quality of K-means by using an agglomerative clustering approach, the Buckshot algorithm, to better choose the initial cluster centroids. Buckshot has drawbacks as it puts limits on how many clusters can be formed, and also like K-means, it is not deterministic. Also the combination of Buckshot and K-means is less eﬃcient than K-means by itself. In contrast, CDC is not dependent on an arbitrary choice of parameters that impacts its performance. Perhaps, as a consequence, no results, to our knowledge have been presented to date on the result of clustering the whole of the RCV1 collection. 2. Methodology The CDC concept was ﬁrst presented in (Dobrynin et al., 2004) and was initially assessed on much smaller data-sets. For the purposes of exposition, we present the most salient points concerning the clustering approach. The CDC approach is based on the following assumptions: A large number of narrow contexts exist in a document corpus. There are certain context terms in the document corpus which allow for the identiﬁcation of narrow contexts. Narrow contexts in the corpus act as attractors for clusters of documents which are semantically related given the context. It is these assumptions that make it possible to identify cluster attractors and split the corpus into a large ﬂat structure of thematically homogeneous clusters in linear time dependent fashion (Dobrynin et al., 2004). There is no explicit semantic description of the cluster but it should be clear to an expert from examining the cluster content why the documents are considered semantically related. The following sections describe the clustering process and the internal cluster organization.

1166

N. Rooney et al. / Information Processing and Management 42 (2006) 1163–1175

2.1. Clustering In the following deﬁnitions a term refers to a word in the document corpus and a context is a probability distribution over all terms. Diﬀerent contexts can therefore be presented by diﬀerent probability distributions over the same set of terms. A context can be viewed as being either general or narrow in scope. A context can be described as narrow if it meets bounds on its entropy and document frequency or general otherwise. 2.1.1. Narrow context word discovery Let X denote the set of all documents in the document corpus and Y denote the set of all terms present in X. Given a context term z 2 Y, we deﬁne its context as the probability distribution of a set of words which cooccur with the given term. More speciﬁcally the context of the term z is represented in the form of a conditional probability distribution p(Yjz) where the random variable Y takes values from Y and p(yjz) is equal to the probability of randomly selecting the term y in a randomly selected document within which the term z co-occurs. We can approximate this distribution as P x2X ðzÞ tf ðx; yÞ pðyjzÞ ¼ P x2X ðzÞ;t2Y tf ðx; tÞ where tf(x, y) is the term frequency of the term in document x and X(z) is the set of all documents from the corpus which contain the term z. It is obvious that in most cases the context of the term z is too general in scope to present useful information about the corpus. So we are interested only in identifying context terms z which identify narrow contexts. Narrow contexts are identiﬁed by a consideration of the entropy of the probability distribution p(Yjz) X H ðY jzÞ ¼ pðyjzÞ logðpðyjzÞÞ y

and the document frequency of the context term, df(z). Let Y(z) denote the set of all diﬀerent terms from documents in X(z). When there is a uniform distribution of terms from Y(z) the entropy H(Yjz) is equal to log jY ðzÞj. According to Heap’s Law (Baeza-Yates & Ribeiro-Neto, 1999) log jY ðzÞj ¼ OðlogðjX ðzÞjÞ i.e. the entropy of a term zis related to the log of the size of the set of documents containing the term, X(z). As the document frequency of the term z, df(z) equals jX(z)j, the entropy of the context is related to the logarithm of the document frequency of its context term. So high document frequency terms have higher entropy values than low document frequency terms. This means it insuﬃcient to choose contexts based on a consideration of their entropy alone. To take into account the dependency between the document frequency df(z) of the term z and the entropy of its context, we divide context terms into disjoint subsets, based on their document frequency [ Y0 ¼ Yi i

Y i ¼ fz : z 2 Y ; dfi 6 df ðzÞ < dfiþ1 g i ¼ 1; . . . ; r Here the threshold dfi satisﬁes the condition dfi+1 = a dfi where a > 1 is a constant. The choice of document frequency thresholds is based on the following principles. As in other IR applications, we consider terms that have very low or high document frequency as not being informative and hence such terms are not considered as being contextual terms (Sebastiani, 2002). The interval range [dfi, dfi+1) for i P 2 is set to be at least as large as the interval range, [dfi1, dfi). This is to allow for the fact that entropy increases logarithmically with document frequency, so that if the document frequency interval size doubles in size, the corresponding interval range for entropy remains constant. However, as terms with medium-high document frequency tend to have a more sparse distribution, the last interval range is set to be much larger than twice its antecedent in size. It can be shown that H(Yjz) is bounded from above by a linear function of the index i so as a consequence it is

N. Rooney et al. / Information Processing and Management 42 (2006) 1163–1175

1167

possible to set a linear function threshold Hmax(i) to select a set Z of terms which identify narrow contexts, from each subset [ Z¼ Zi; i

Z i ¼ fz : z 2 Y i ; H ðY jzÞ 6 H max ðiÞg i ¼ 1; . . . ; r However, instead of setting upper threshold values Hmax(i), the approach adopted in this paper is to assume in total there are N narrow word contexts. For every i = 1, . . . , r, a set Zi Ì Yi, is selected such that jZ i j ¼ P

N .jY i j j¼1;...;r jY j j

and z1 2 Zi, z2 2 Yi Zi ! H(Yjz1) 6 H(Yjz2). So we select from each subset, a proportionate number (pro0 portional to the size of subset jYij relative to the size of the union S of all subsets Y ) of context terms that fall in this interval and which have the lowest entropy. Then Z ¼ i Z i where Z is the total set of selected narrow context terms. The choice of N is set so that as many natural narrow contexts are identiﬁed, as are present in the collection. After selecting a set of narrow contexts, we merge narrow contexts together if the Jensen–Shannon (JS) divergence (Lin, 1991) between them is below a certain threshold, as such they eﬀectively identify the same context. JS divergence between the probability distributions p1 and p2 representing two contexts respectively p 0:5H ½p1 0:5H ½p2 JSf0:5;0:5g ½p1 ; p2 ¼ H ½ where H[p] denotes the entropy of the probability distribution and p and p denote the average probability distribution = 0.5p1 + 0.5p2. As such, less than N narrow contexts may be retained. 2.1.2. Cluster formation Narrow contexts {p(Yjz)}z2Z are cluster attractors. All documents are grouped into at most one cluster i.e. this is a hard clustering approach. Every document x is represented by the probability distribution p(Yjx) where tf ðx; yÞ pðyjxÞ ¼ P t2Y tf ðx; tÞ and the distance between a document x and the context for the term z is calculated by the JS-divergence. A document x is therefore assigned to a cluster with attractor z if z ¼ arg min JSf0:5;0:5g ½pðY jxÞ; pðY jtÞ t2Z

Every cluster C(z) can be represented by two probability distributions over the term set Y, one is for the cluster’s attractor p(Yjz), the other is for the cluster’s centroid pavr(YjC(z)) which is an average of probability distributions of the documents assigned to the cluster. P x2CðzÞ pðyjxÞ pavr ðyjCðzÞÞ ¼ jCðzÞj 2.1.3. Internal cluster structure discovery The inner structure of a cluster is represented by a graph where each vertex in the graph represents a document. Each vertex stores a weight equal to the distance between the document and the cluster attractor it belongs to. Any two vertices in the graph are connected by an undirected edge, whose weight is equal to the distance between corresponding documents. This distance is determined as before using the JS divergence. The standard Kruskal’s algorithm is used to ﬁnd the minimum spanning tree (MST) which spans all graph vertices and has the minimum total distance for its edges. This structure allows an eﬀective means of browsing clusters (Dobrynin, Patterson, Galushka, & Rooney, 2005).

1168

N. Rooney et al. / Information Processing and Management 42 (2006) 1163–1175

2.1.4. CDC time complexity Let L be the average length of a document in terms of the number of distinct terms, and dfavr be the average document frequency of a term. Each context requires determining the term frequency, of every term that cooccurs with it, over all document that the context occurs in. Assume, for simplicity that each document has length L. As there are jYj terms in the collection, context generation requires jYjL dfavr operations. Let S = jYjdfavr. The selection of N narrow contexts, requires the identiﬁcation of N terms of minimum entropy. This can be done in OðjY j log N Þ operations, which according to Heap’s law, can be estimated as OðjX jb log N Þ, where b < 1. Finally clustering of documents requires comparing every document to at most N attractors. This requires O(jXj Æ N) operations. So the total time complexity for CDC consists of: 1. O(jSj Æ L) operations to generate all contexts; 2. OðjX jb log N Þ operations to select narrow contexts; 3. O(jXj Æ N) operations to cluster documents. The complexity of 2. is less that complexity of 3. so the total complexity of CDC is bounded by: OðjSj maxðL; N ÞÞ as jSj jXj. Because L is constant, this complexity can be given as: OðjSj N Þ This is less than the time complexity for partitional clustering approaches such as K-means which is O(tjSj Æ N) where t iterations are required for the iterative algorithm to converge, and N is the number of clusters (Dhillon et al., 2001). 3. Experimental evaluation We assess the CDC approach to document clustering on the Reuters Corpus Volume 1 (RCV1) collection (Rose et al., 2002). Similar to Lewis, Yang, Rose, and Lin (2004), we refer to this original collection of over 800,000 manually categorized newswire stories as RCV1-v1, and to the pre-processed version of this collection carried out by Lewis et al. (2004) as RCV1-v2. Lewis et al. (2004) divided the RCV1-v2 chronologically into a training set (referred to in this paper as RCV1-v2-train) consisting of 23,149 documents for articles published from August 20, 1996 to August 31, 1996 and into a test set (RCV1-v2-test) consisting of the 781,265 test documents (September 1, 1996–August, 19, 1997). Documents in the collection have been categorized according to three category sets; a Topic category set consisting of 103 categories, an Industry category set consisting of 354 categories and a Region category set consisting of 366 categories. Note that category information, was not used in forming clusters, but is used as an independent means of assessing the quality of clusters. We clustered both the RCV1-v2-train and a variant of the RCV1-v2 collection which we refer to as RCV1-v2*. RCV1-v2* contains the same documents as RCV1-v2 but they are not pre-processed, but in fact are drawn from the original raw collection RCV1. This allowed a clustering process based on all words in the documents in this collection, and not just words in the training set. Prior to clustering the RCV1-v2* collection, documents were parsed by converting all words from the text in a document into lower case, deleting stop-words using a standard stop-words list (SMART, 571 words) and applying the Porter algorithm for stemming. We consider a word as any maximal sequence of symbols which start with a symbol in the range a–z, and ends with any symbol between a and z or any symbol between 0 and 9 and in between contains symbols from the set {a–z0–9_–/}. It should be noted that no additional information about the document set apart from the text of the documents themselves, was used. As such the approach is totally unsupervised in nature and therefore has no manual knowledge engineering overheads. In total, 507,883 diﬀerent word stems terms remained after parsing, in comparison to the 47,236 stems identiﬁed for RCV1-v2-train.

N. Rooney et al. / Information Processing and Management 42 (2006) 1163–1175

1169

The number of frequency intervals and their bounds is based on a consideration of the distribution of document frequencies of all terms for each data set. As such these values are data set speciﬁc. For RCV1-v2-train, the number of document frequency intervals was ﬁve and the bounds were set as follows: 5 < df ðzÞ < 12 12 6 df ðzÞ < 25 25 6 df ðzÞ < 50 50 6 df ðzÞ < 100 100 6 df ðzÞ < 5000 The number of contexts selected N was set to 1000. For RCV1-v2*, the number of document frequency intervals was seven and the intervals were set as: 25 < df ðzÞ < 50 50 6 df ðzÞ < 100 100 6 df ðzÞ < 200 200 6 df ðzÞ < 400 400 6 df ðzÞ < 800 800 6 df ðzÞ < 1600 1600 6 df ðzÞ < 10000 The diﬀerence in the number of frequency intervals and their interval bounds for the two data-sets is a consequence of RCV1-v2* containing nearly 35 times the number of documents than RCV1-v2-train. The number of contexts selected N was set to 4000. The value of N for RCV1-v2-train and RCV1-v2* were set so to identify as many narrow contexts as are contained within each corpus and to try to ensure that the clusters on average would not contain a large number of documents. Clustering the RCV1-v2-train collection, resulted in 838 contexts and 720 non-empty clusters and clustering the whole collection resulted in 3053 contexts and 2505 nonempty clusters. (empty clusters arise due to no documents being assigned to these particular narrow contexts). Fig. 1 represents the distribution of clusters for RCV2-v2* according to their size. 65% contain 100 documents or less and 77% have a size less than 500 documents. As such, the majority of clusters are small in size, relative to the size of the whole collection. Three sets of experiments were carried out to consider three issues relating to the quality of the clustering process, namely, how close in similarity are adjacent documents in the MST, how homogeneous, in terms of

Fraction of clusters

0.7 0.6 0.5 0.4 0.3 0.2 0.1

50 00 01 10 000 00 -2 20 00 00 0 -3 30 000 00 -4 40 00 00 0 50 -50 00 00 10 100 00 00 0 15 -15 0 00 00 20 0-2 00 00 00 0M ax si ze

10

10

0.5, that is they have at least half of their categories in common. At the other end of the graph it can be seen that only 8% of pairs share no category labels. These results provide strong evidence for the hypothesis that CDC is a competent approach for clustering semantically related documents together around a common context, especially when it is considered that there are 103 diﬀerent possible categories a document can have, and many have more than 1 category. This is further emphasized when the graph in Fig. 3 for the randomly drawn document pairs, is analyzed. Here there are almost no documents for JC values above the [0.3, 0.4) interval for Topics, indicating no category overlap in these intervals. Additionally it can be seen that 61% of pairs share

Fraction of documents

0.7

topics

0.6

regions

0.5 industries 0.4 0.3 0.2 0.1

] ,1

.0

.9 ) ,0

.9 [0

.8 ) .8 [0

.7 )

,0

,0

.7 [0

.6 ) [0

.6

,0

.5 ) ,0

.5 [0

.4 ) .4 [0

.3 )

,0

,0

.3 [0

.2 ) ,0

.2 [0

.1 [0

[0

.0

,0

.1 )

0

JC Interval Fig. 2. JC distribution for adjacent documents in RCV1-v2-train.

Fraction of documents

N. Rooney et al. / Information Processing and Management 42 (2006) 1163–1175

1171

topics

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2

regions industries

] .0

)

[0

,1 .9

[0

.8

,0

.9

)

)

.8

[0

.7

,0

.7

) [0

.6

,0

.6

) ,0 .5 [0

[0

.4

,0

.5

)

)

.4

[0

.3

,0

.3

) ,0

.2 [0

.2

,0 .1 [0

[0

.0

,0

.1

)

0

JC Interval Fig. 3. JC distribution for random pairs of documents in RCV1-v2-train.

almost no category labels ([0.0, 0.1) interval). A similar proﬁle is observed by the Regions category set. Here an even larger proportion (63%) of document pairs have the highest category overlap, compared to only 8% for the random experiment. Over 72% of pairs have a JC coeﬃcient >0.5. Regions also displayed a larger proportion of document pairs in the [0.0, 0.1) interval (23%), in comparison to Topics. However there was also a larger proportion of randomly drawn documents in the same interval (87%). This provides additional support for the quality of CDC clustering. Conversely, the results were worse with the Industries category set. Here only 27% of pairs had the highest category overlap, and the proportion with a JC coeﬃcient >0.5 was only 0.35. This would tend to indicate that CDC’s semantic clustering follows more closely to the Topics and Regions categories, which appear to be orthogonal in their knowledge to Industries. However, as can be seen in Fig. 3, the likelihood of a pair of randomly selected documents having highest category overlap in terms of the Industries category set is virtually zero. Therefore CDC clustering is still able to group documents to some level of similarity with respect to this category set. It should also be noted that industries produced the largest proportion of pairs (58%) in the [0.0, 0.1) JC interval range, which reemphasizes CDC’s poor ability to cluster the documents in terms of this category set. It should be noted that this proportion is still much less than in the random experiment, which is 97%. Fig. 4 shows the JC distribution for the clusters formed for the total collection. Noticeably, this graph is identical in shape to Fig. 2, showing that CDC maintains a quality of clustering for the large corpus. The only diﬀerence is that the proportion of pairs in the [0.9, 1.0] JC interval for Topics, Regions and Industries is 4%, 7% and 7% higher respectively. Additionally, at the opposite end of the graph, interval [0.0, 0.1) the proportion

topics

Fraction of documents

0.8 0.7

regions

0.6

industries

0.5 0.4 0.3 0.2 0.1 ] .0

)

,1

.9

.9 [0

) .8 ,0 [0

.8

) .7 ,0 [0

.7

) .6 [0

.6 ,0

) .5 [0

.5 ,0

) .4 ,0 [0

.4

) .3 ,0 [0

.3

) .2

.2 ,0 [0

.1 ,0 [0

[0

.0 ,0

.1

)

0

JC Interval Fig. 4. JC distribution for adjacent documents in RCV1-v2*.

1172

N. Rooney et al. / Information Processing and Management 42 (2006) 1163–1175

of pairs in this interval for Regions is 8% lower and for Industries it is 12% lower. The proportion for Topics remains unchanged. Therefore, it can be concluded that overall performance is better with the total collection due to the extra knowledge contained within the additional documents and hence the increased number of stemmed terms, included within this collection. (This also resulted in an increased number of contexts and non-empty clusters—refer to Section 3.) 4.2. Cluster homogeneity In the second set of experiments we investigate how well a representative document (either real or prototypical) describes the contents of the cluster, by carrying out classiﬁcation experiments where the category set for the representative document is used to classify all other documents within the same cluster. Our goal is to determine how homogeneous the clusters are in terms of document classiﬁcation. In this sense, we use the category information associated with documents to be able to assess CDC, in a similar fashion to the classiﬁcation method proposed by Slonim and Tishby (2000), Slonim et al. (2002) to assess the performance of unsupervised methods used on labeled data sets. These experiments were carried out on the RCV1-v2* collection. It would be of interest to benchmark these results against other well known unsupervised clustering methods, however the number of documents and distinct words in RCV1-v2* made this prohibitively time-expensive to do so using the same computational environment. We consider two approaches to forming a representative for the cluster, the ﬁrst forms a prototypical document in the cluster, the second selects a real document within the cluster. The ﬁrst approach, CC (Cluster Classiﬁcation), assess the categories of all documents in the cluster and assigns to them those categories whose popularity is above a deﬁned threshold. Let T(z(d), b) be the set of all categories assigned by experts to documents d from cluster C(z(d)) with context z(d) whose popularity (where popularity is deﬁned as percentage of documents having this category) exceeds threshold b. Then classiﬁer CC assigns category set T(z(d), b) to document d. We consider T(z(d), b) as the category set of a prototypical document for the clusters. Here we assume that the most popular document categories are the most representative for the cluster and as such should be used to classify all documents within the cluster. The second approach, Nearest Document to Centroid Classiﬁcation (NDC), identiﬁes the document closest to the cluster centroid and assigns its categories to all other documents in the cluster. Let CD(z(d)) be the document from cluster C(z(d)) nearest to the centroid of the cluster. Then classiﬁer NDC assigns category set T(CD(z(d))) to document d. The assumption here is that the document closest to the centroid is most representative of the cluster context and therefore best placed to classify the other documents in the cluster. Clearly the nearest document to the cluster centroid is excluded from the evaluation itself so as not to bias the results. We assess these classiﬁers on clusters formed for RCV1-v2*, by calculating measures of micro-averaged precision (p), recall (r) and F-measure (f), deﬁned below, where TPi is the number of documents that are assigned category i and category i is one of the categories within a given category set CS assigned to the document, FPi is the number of documents that are wrongly assigned to category i and FNi is the number of documents that are incorrectly not assigned category i. We assess these values for all three category sets. PjCSj TPi p ¼ PjCSj i¼1 i¼1 ðTPi þ FPi Þ PjCSj TPi r ¼ PjCSj i¼1 i¼1 ðTPi þ FNi Þ PjCSj 2 i¼1 TPi f ¼ PjCSj i¼1 ð2TPi þ FPi þ FNi Þ For the CC classiﬁer, we calculated the eﬀect of b varying from 0.2 to 0.5 (as b increases the determination of the category popularity becomes more stringent). The results of this assessment for the two classiﬁers are shown in Tables 1 and 2. Focusing on the results for the CC classiﬁer, it is clear, that as b increases, the precision rises but the recall falls, regardless of the category set. This is due to T(z(d), b) becoming smaller due to an increasingly stringent

N. Rooney et al. / Information Processing and Management 42 (2006) 1163–1175

1173

Table 1 CC classiﬁer results for RCV1-v2* Category set

b

p

r

f

Topics

0.2 0.3 0.4 0.5

0.562 0.682 0.775 0.829

0.722 0.636 0.567 0.552

0.632 0.659 0.655 0.641

Regions

0.2 0.3 0.4 0.5

0.575 0.673 0.781 0.817

0.544 0.489 0.427 0.404

0.559 0.566 0.552 0.541

Industries

0.2 0.3 0.4 0.5

0.460 0.571 0.645 0.732

0.228 0.181 0.160 0.127

0.305 0.282 0.257 0.216

Table 2 NDC classiﬁer results for RCV1-v2* Category set

p

r

f

Topics Regions Industries

0.617 0.564 0.158

0.564 0.470 0.216

0.589 0.468 0.183

deﬁnition of ‘‘popular’’ categories leading to fewer false positives but increasing false negatives. The optimal value of b in terms of the F-measure is 0.3, for Topics and Regions (0.2 for Industries). The NDC classiﬁer performed worse than CC providing F-measures that were always worse than the poorest result from CC. NDC gave slightly poorer results as may be expected because only categories from one document (the most average document) are used to predict the class of all others as opposed to using the majority class of all documents, as with CC. However the NDC classiﬁer has the advantage over the CC, in that no new knowledge needs to be determined in order to assess the homogeneity of clusters. The results from the classiﬁcation approaches chosen, show that the there must be a large degree of homogeneity within clusters as that when the same categories are assigned to all documents within a cluster, we still have a high precision, recall and F-measure in terms of an unsupervised technique. This supports our earlier ﬁndings and provides more evidence for the quality of the clustering process. Currently only sparse results can be found in the literature concerning the use of purely unsupervised clustering techniques for document classiﬁcation to assess the quality of the clustering process, for this data-set. For example, Slonim et al. (2002) provided the results of sequential information bottleneck on RCV1-v2-train, using only categories assigned from the Topics set. Document classiﬁcation was performed by assigning documents in the cluster the most dominant label in the cluster, hence converting the problem into a single label classiﬁcation. Only the 10 most popular categories were used in their experiment. They recorded a precision result of 0.835. Given that, CC and NDC used multi-labeled classiﬁcation (103 labels) using all category information, the results for Topics precision (with a maximum of 0.829) compare favorably well. 4.3. Context/cluster stability over time Having demonstrated the quality of the clustering process, in terms of the semantic relationships between adjacent documents, we now investigate the stability of the discovered contexts (and hence the clusters) over time. It is reasonable to project that the set of narrow contexts may change over time as new documents are added to the corpus. If this were so, re-clustering would be necessary on a frequent basis to discover the new contexts. To investigate this we carried out an experiment on the clustered RCV1-v2-train set whereby eight

N. Rooney et al. / Information Processing and Management 42 (2006) 1163–1175

0.6 0.5 0.4 0.3 0.2 0.1

[0.8,0.9)

[0.6,0.7)

[0.4,0.5)

interval 3 [0.2,0.3)

0

interval 6 [0.0,0.1)

Fraction of document pairs in MST

1174

training

JC Interval

Fig. 5. The eﬀect on JC distribution for eight interval test sets.

sets of 1000 documents from the RCV1-v2-test set, (taken at regular intervals every 100,000 documents starting from the ﬁrst 1000 documents in the test-set) were assessed in terms of their document similarity. As the dataset is ordered chronologically, these sets represent news documents at diﬀerent sequential time intervals throughout the year. Document similarity was determined for documents in the sets, by comparing the JC between a test document and nearest document within an appropriate cluster, where the nearest document was the document with least JS divergence measure. The appropriate cluster was determined by ﬁnding the cluster with least the JS divergence between the test document and its cluster context attractor. The JC distribution was calculated for all the test documents, within a given set. Fig. 5 shows the eﬀect on the JC distributions for the Topics category set for the training set (by way of comparison) and for the eight interval sets in the test set. From this it can be seen that the distribution of the JC values remains reasonably constant for each of the eight interval sets. There is of course some variation for the given JC interval values, across interval sets e.g. for interval test set ﬁve, the fraction of documents in the range is [0.9, 1.0] dips in comparison to the other intervals. Such an interval set, may represent a rare news event, not present in the training data, which may require the addition of newly discovered narrow contexts for this case. The remaining test interval sets show no such pattern, and as such, this can be regarded as a unique event. Therefore in general, the contexts discovered for the original training set are also appropriate for the test interval sets and as a consequence the clusters and narrow contexts can be regarded as being stable over time. 5. Conclusions We have shown that the scalability of the Contextual Document Clustering approach provides a means to eﬀectively cluster large document collections such as RCV1-v2. We have shown that the similarity between adjacent documents within clusters is high with respect to their Topics and Regions category assignments. We have also demonstrated that the narrow contexts (clusters), discovered on the training set, are very stable over time. This is signiﬁcant and shows the competency of the technique to identify important, stable underlying themes within the corpus. Also, as demonstrated by the classiﬁcation experiments, the clusters are shown to be reasonably homogeneous with respect to the Topics and Regions assignments. The results are poorer for the Industries category set. This may reﬂect the view proposed by Gondek and Hoﬀman (2004) that trying to cluster documents with respect to maximizing the quality of clusters in terms of background knowledge may require conditioning out other possibly orthogonal background knowledge. However the use of supervised techniques to classify documents according to the Industries categories have also given poor results, indicating that by itself it may be a poor classiﬁcation scheme (Lewis et al., 2004). In future work we plan to consider subdividing (larger) clusters into sub-clusters using knowledge inherent within the MST, such as identifying ‘‘transitional points’’ where the distance between adjacent documents is larger than a certain threshold. This

N. Rooney et al. / Information Processing and Management 42 (2006) 1163–1175

1175

may be an indication that the cluster contains sub-clusters of documents that are more highly relevant to each other than their general relevancy to the rest of the cluster. We recognize that the approach of CDC has been only assessed on a collection containing documents that are relatively small in size. Documents in the RCV1 contain between a few hundred to several thousand words. It is possible that documents that are much larger in size, would be related to more than one context, and it may require documents to be split through using section breaks or identifying topic changes, and CDC reﬁned to allow for soft clustering. We intend to investigate this issue in future work. References Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. ACM Press. Baker, L. D., & McCallum, A. K. (1998). Distributional clustering of words for text classiﬁcation. In Proceedings of the 21st ACM international conference on research and development in information retrieval (SIGIR-98) (pp. 96–103). Bekkerman, R., El-Yaniv, R., Tishby, N., & Winter, Y. (2001). On feature distributional clustering for text categorization. In Proceedings of the 24th ACM international conference on research and development in information retrieval (SIGIR-01) (pp. 146–153). Bekkerman, R., El-Yaniv, R., Tishby, N., & Winter, Y. (2002). Distributional word clusters vs. words for text categorization. Journal of Machine Learning Research, 1(1–48). Belew, R. (2000). Finding out about: a cognitive perspective on search engine technology and the WWW. Cambridge Press. Cutting, D., Karger, D., Pedersen, J., & Tukey, J. (1992). Scatter/Gather: a cluster-based approach to browsing large document collections. In Proceedings of the 15th ACM international conference on research and development in information retrieval (SIGIR’92) (pp. 318–329). Dhillon, I. S., Fan, J., & Guan, Y. (2001). Eﬃcient clustering of very large document collections. In Data mining for scientiﬁc and engineering applications (pp. 357–381). Kluwer Academic Publishers. Dhillon, I. S., Manella, S., & Kumar, R. (2003). Divisive information-theoretic feature clustering algorithm for text classiﬁcation. Journal of Machine Learning Research, 3, 1265–1287. Dobrynin, V., Patterson, D., Galushka, M., & Rooney, N. (2005). SOPHIA: an interactive cluster based retrieval system for the OHSUMED collection. IEEE Transaction on Information Technology for Biomedicine, 9(2). Dobrynin, V., Patterson, D., & Rooney, N. (2004). Contextual document clustering. In Proceedings of the 26th European conference on information retrieval research. LNCS (Vol. 2997, pp. 167–180). Springer. El-Yaniv, R., & Souroujon, O. (2001). Iterative double clustering for unsupervised and semi-supervised learning. In Proceedings of 12th European conference on machine learning (pp. 121–132). Springer. Gondek, T., & Hoﬀman, T. (2004). Non redundant data clustering. In Proceedings of the 4th IEEE international conference on data mining (pp. 75–82). IEEE CS Press. Guha, S., Rastogi, R., & Shim, K. (1998). CURE: An eﬃcient clustering algorithm for large databases. In Proceedings of the ACM SIGMOD international conference on management of data (pp. 73–84). Hearst, M. A., & Pedersen, J. O. (1996). Reexamining the cluster hypothesis: scatter/gather on retrieval results. In Proceedings of the 19th international ACM SIGIR conference on research and development in information retrieval (SIGIR’96) (pp. 76–84). Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM Computing Surveys, 31(3), 264–323. Lewis, D., Yang, Y., Rose, T., & Lin, F. (2004). RCV1: a new benchmark collection for text categorization research. Journal of Machine Learning Research, 5, 362–397. Lin, J. (1991). Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37(1), 145–151. Liu, X., & Croft, W. B. (2004). Cluster-based retrieval using language models. In Proceedings of the 27th annual international conference on research and development in information retrieval (SIGIR-04) (pp. 186–193). McCallum, A., Nigam, K., & Ungar, L. H. (2000). Eﬃcient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 169–178). Mechkour, M., Harper, D. J., & Muresan, G. (1998). The webcluster project: using clustering for mediating access to the WWW. In Proceedings of the 21st ACM international conference on research and development in information retrieval (SIGIR-98) (pp. 357–358). Pereira, F., Tishby, N., & Lee, L. (1993). Distributional clustering of English words. In 30th annual meeting of the association for computational linguistics (pp. 183–190), Columbus, Ohio. Rose, T., Stevenson, M., & Whitehead, M. (2002). The Reuters corpus Vol. 1—from yesterday’s news to tomorrow’s language resources. In Proceedings of the 3rd international conference on language resources and evaluation. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computer Surveys, 34(1), 1–47, March. Slonim, N., & Tishby, N. (2000). Document clustering using word clusters via the information bottleneck method. In Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval (SIGIR’00) (pp. 208–215). Slonim, N., Friedman, N., & Tishby, N. (2002). Unsupervised document classiﬁcation using sequential information maximization. In Proceedings of the 25th annual international conference on research and development in information retrieval (SIGIR’02) (pp. 129–136). Tishby, N., Pereira, F., & Bialek, W. (1999). The Information bottleneck method. In Proceedings of the 37th annual allerton conference on communication, control, and computing (pp. 368–377).

Lihat lebih banyak...

A scaleable document clustering approach for large document corpora

Descripción

Comentarios