Key Element-Context Model: An Approach to Efficient Web Metadata Maintenance

June 7, 2017 | Autor: Chew-hung Chang | Categoría: Digital Library, Web Pages, Context Model, Information System
Share Embed


Descripción

Key Element-Context Model: An Approach to Efficient Web Metadata Maintenance Ba-Quy Vuong, Ee-Peng Lim, Aixin Sun, Chew-Hung Chang, Kalyani Chatterjea, Dion Hoe-Lian Goh, Yin-Leng Theng, and Jun Zhang Nanyang Technological University, Singapore {vuon0001,aseplim}@ntu.edu.sg

Abstract. In this paper, we study the problem of maintaining metadata for open Web content. In digital libraries such as DLESE, NSDL and G-Portal, metadata records are created for some good quality Web content objects so as to make them more accessible. These Web objects are dynamic making it necessary to update their metadata records. As Web metadata maintenance involves manual efforts, we propose to reduce the efforts by introducing the Key element-Context (KeC) model to monitor only those changes made on Web page content regions that concern metadata attributes while ignoring other changes. We also develop evaluation metrics to measure the number of alerts and the amount of efforts in updating Web metadata records. KeC model has been experimented on metadata records defined for Wikipedia articles, and its performance with different settings is reported. The model is implemented in G-Portal as a metadata maintenance module.

1

Introduction

In a digital library (DL), creating and maintaining metadata records for Web pages with high quality content serve three important purposes. Firstly, the Web content is largely uncensored and it requires some domain knowledge and experience to distinguish the good content objects from the poor ones. Digital librarians therefore play a critical role in selecting the good quality contents and creating metadata records. Secondly, the existence of metadata records allows one to organize and present the Web content objects according to some classification or grouping (e.g., task-based grouping) schemes adopted by a digital library. In this case, the metadata records serve as proxies of Web content objects. Accessing these Web content objects will be of no different from accessing other non-Web content objects. Finally, metadata records contain attributes that are searchable. Again, this allows Web content objects to be queried like other digital library objects. There are already several metadata creation efforts for Web pages in digital library projects [1,2,6]. The National Science Digital Library1 (NSDL) funded by the National Science Foundation has been indexing and creating metadata 1

http://www.nsdl.org

L. Kov´ acs, N. Fuhr, and C. Meghini (Eds.): ECDL 2007, LNCS 4675, pp. 63–74, 2007. c Springer-Verlag Berlin Heidelberg 2007 

64

B.-Q. Vuong et al.

for quality Web resources in science, technology, engineering, and mathematics domain for research and education. The Digital Library for Earth System Education2 (DLESE) project maintains a collection of metadata records about Web pages and other resources relevant to earth science education. In our digital library system known as G-Portal, we also create and maintain metadata for selected Web content objects related to geography education, and provide a wide range of services to help learners organize and collaborate in the learning process [2]. Once created, metadata records of Web content objects often require updates due to the changes made on the referenced Web pages. This is known as the Web metadata maintenance problem. Web metadata maintenance is a challenging problem due to several reasons: – Autonomous changes to Web content. Web pages, residing on different sites, can be changed anytime by their owners. Such changes often happen without alerting those DL systems maintaining metadata about the affected pages. – Manual efforts to update metadata records. Whenever changes are detected on Web pages, the respective metadata owners have to update the affected metadata records and the updates require manual inpection and judgement. Web metadata maintenance consists of three major subtasks, namely (a) scheduled monitoring, (b) change detection, and (c) metadata record update. Scheduled monitoring is to periodically fetch the latest versions of the Web content objects so as to detect changes. Here, we assume that the Web content objects do not publish changes to DL systems as soon as changes occur. A pull-based monitoring is therefore required. Change detection refers to comparing different versions of a Web content object to determine if there are changes made. Once changes are detected, metadata records have to be updated for those changed Web content objects in subtask (c). We observe that metadata record(s) may be derived from only some content region(s) in a Web page. Therefore, not all changes to a Web page result in changes to its metatdata. To reduce false alarms in subtask (b) and to minimize manual efforts in subtask (c), we propose a Key element-Context (KeC) model that allows metadata owners to select Web page content regions for monitoring. The main idea is to narrow the scope of Web page change detection using the concepts of key element and context. Given a metadata attribute, there is a content region in the Web page which is used to derive the value of the attribute directly. We call this content region key element and introduce a concept known as context to help locating a key element. For a given metadata attribute, different choices of context(s), options to locate context(s), and options to locate the key element, may lead to different amount of alerts and maintenance efforts. We therefore develop evaluation metrics to measure the two types of overheads. The KeC model was tested on a set of metadata records created for Wikipedia pages. The empirical results showed that the maintenance overheads could be reduced by making the appropriate 2

http://www.dlese.org

Key Element-Context Model

65

choices of context and key elements, and the options to locate them. Compared with the naive approach to monitor Web page changes, our proposed model has shown significant improvements. This KeC model has been implemented as a part of G-Portal as a Web metadata maintenance module. Details of this module is not covered in this paper due to space constraint. The rest of the paper is organized as follows. Section 2 describes the related research. We present the KeC model in Section 3 and define the evaluation metrics in Section 4. After reporting our experiments in Section 5, we conclude the paper in Section 6.

2

Related Work

Monitoring dynamic Web pages for metadata maintenance is a new and challenging problem. Sharaf and Labrinidis proposed a model of freshness-aware scheduling of continuous queries [5]. Continuous queries are those registered by user and are executed whenever a new update occurs in the Web page in order to maintain an up-to-date result in a real-time fashion [7]. Pandey proposed a Continuous Adaptive Monitoring (CAM) method consisting of a few phases in which Web pages are monitored based on the resources allocated [4]. The main purpose of this approach is to optimize a schedule for monitoring each Web page. Both work did not provide any mechanism for monitoring specific Web content regions for metadata maintenance. They assumed Web changes are “pushed” from Web sites to applications. Nevertheless, in real life, it is more common to detect changes to Web pages using a “pull” (polling) approach. One of the general-purpose approaches close to detecting relevant Web page changes for metadata maintenance is WebCQ [3]. This approach allows a user to specify Web objects for monitoring and tracks information at Web page level. However, it does not support structured Web objects and is not designed for Web content regions where metadata attributes are derived from.

3 3.1

KeC Model Key Element and Context

Our proposed KeC model requires metadata owner to define for each metadata attribute a key element. A key element is some content region in the Web page referenced by the metadata record, and the content region is directly used to derive the value of the attribute. A key element can be in various forms including text, table, image, etc.. Consider the following metadata record created for the Wikipedia’s Singapore page shown in Figure 1. It consists of three metadata attributes all related to economy. Metadata Attribute Value Total GDP: $123.4 billion Per capita: $28,368 Manufacturing Contribution: $34.55 billion

66

B.-Q. Vuong et al.

Key Element

Context

Fig. 1. Wikipedia Article on Singapore

As illustrated in the figure, the key elements of Total GDP and Per capita attributes are two cells in an information box on the right of the page where the values of the two attributes are found. The Manufacturing Contribution attribute, on the other hand, has a key element that is a sentence in the page and the value is derived manually from the information provided as 0.28 × $123.4 billion = $34.55 billion. The fact that some metadata attribute values are not directly extractable from the Web page is not uncommon and should be considered in the design of Web metadata maintenance module. A change to a key element most likely implies an update to be made on the corresponding metadata attribute. The kind of metadata attribute value update to be made, however, very much depend on the way the attribute value is “derived” from key element. With key elements defined for metadata attributes, a Web metadata maintenance module can focus on only those changes to the Web page that affect the key elements. Alerts caused by other unimportant changes to the Web page are known as false alarms. A good Web metadata maintenance module should therefore aim to reduce false alarms by monitoring changes to key elements only. A key element in a Web page can be identified either by its content or location. – By content: This uses the latest key element’s content to identify the key element content region. For example, for key elements that are text regions, text content can be used for identification. For key elements that include media objects such as image, audio, and video files, we can check for their file names, timestamps, and/or hash values. This works well for the key elements whose contents are unique. By using the content to identify a key element, we are indirectly detecting the changes to it. The main drawback however is that one would have to scan through the corresponding Web page to locate the key element(s).

Key Element-Context Model

67

– By location: This uses some location information to identify a key element. The location can range from byte offset from the beginning of the Web page (the most rigid form), to some combination of HTML tag path and byte offset. The advantage of using location is that it does not require scanning the Web page for key element content. It however cannot accommodate minor changes to the key element location that does not really affect the metadata attribute value, especially when the location of key element is rigidly defined. To overcome the inherent limitations of using key element alone to track a Web page content region, we introduce context as a larger content region enclosing a key element. In other words, we define for each key element a context that cover the former’s content region. Within a context, exactly one occurrence of the key element of a metadata attribute is expected. The content of context may not have any relationship with the metadata attribute. Its main role is to demarcate a content region where an occurrence of the key element is to be found. This is important to a key element to be identified by content in case key element’s content is not unique in the Web page. In addition, we only need to scan for the key element within the context covering a smaller region compared with the entire Web page. Context is also important to a key element that is to be identified by location because it gives more flexibility to the specification of key element’s location. Instead of a key element’s location defined with respect to the entire Web page, it can be defined with respect to the context. Hence, changes outside the context will not be taken as location changes to the key element within the context. For example, in Figure 1, the Total GDP attribute has key element defined by a cell in the information box. The context of this key element could be the information box. It does not really matter where the Total GDP cell is located as long as it is found within the information box (or the context) or is said to float within the context. The identification of context and the accompanying options for the key element and context identification is discussed in the next section. 3.2

Identification Options

Given a context, one can specify any of the two identification options for finding the key element within it, namely: – Fixed key element: The key element is assigned a fixed location within its context. This is specified when it is necessary for the key element to stay at the same place in the context, even when the context itself may move around in the Web page (i.e., the context has floating location). – Floating key element: The key element is free to move within the context and is to be identified by content (e.g., the Total GDP example). The fixed location of key element within the context can either be a byte offset, or a combination of HTML tag path from the beginning of the context and byte offset.

68

B.-Q. Vuong et al. Key element

Context 1

Context 2

Fig. 2. Wikipedia Article on FIFA Standings Table

Each context itself, similar to key element, is also assigned an identification option of either fixed or floating location. – Fixed context: A context is designated a fixed location if it has to stay at the same place in the Web page identified by a byte offset, or an offset from the beginning of a HTML element located by a tag path. – Floating context: A floating context is used when its location in the Web page is not important. In this case, we can use the content of context for identification; or a pair of signature patterns to mark the begin and end of the context. To sum up the above, the KeC model provides the following four combinations of identification options for a given pair of key element and context: fixed context and fixed key element, fixed context and floating key element, floating context and fixed key element and floating context and floating key element. These options have different ways of generating alerts with respect to a change in the Web page. For example for the fixed context and fixed key element option, the metadata owner is alerted whenever the location of context is not found in the Web page, or the key element’s location within the context is not found, or the key element’s content has changed. The details of other options are discussed in Section 4. 3.3

Nested Context

So far in the KeC model, a key element is identified using one context. In the other words, it adopts a single context. This is appropriate for cases where the Web page or the monitored information are not structurally complex, e.g., the information box containing Total GDP in Figure 1. For more complicated cases where the key element cannot be easily identified by using just one enclosing context, we define nested context to be a context that can enclose another context and this enclosed context may in turn contain one or more smaller contexts. The largest context contains all other contexts while the smallest one only contains the key element. Each nested context can also be identified either by fixed or floating location within the enclosing context. For example, Figure 2 shows the standings table of national football teams. Assume that a metadata attribute concerns the position of Brazil team, not their points. An appropriate combination of identification options is to use a context

Key Element-Context Model

69

nested in another context and the key element as shown. Context 2 is identified by floating location, Context 1 is identified by fixed location and the key element is also identified by floating location. As long as both contexts are found and the key element’s content does not change, no alert will be fired. On the other hand, changes to the location of Context 1 suggest changes in position of Brazil team. Changes in the key element’s content may also be the result of removal of the team from the standings table. In both cases, the metadata owner should be alerted to make appropriate revision to his/her metadata. When this example is handled by a single context by assigning the standings table as the context with fixed location option and the particular row containing Brazil team as the key element with floating location option, much higher number of alerts are generated compared with the nested context approach due to frequent changes in the team’s point.

4

Evaluation Metrics

To evaluate the performance of our proposed KeC model, we propose a set of evaluation metrics which assumes that there is a set of metadata attributes to be monitored. Each attribute has one key element and all attributes share the same context. Although these metrics only apply to the single context model, they can be easily extended to evaluate performance of the nested context. The set of evaluation metrics can be further divided into total number of alerts and true user effort. The total number of alerts refers to the number of messages that a metadata owner receives to examine metadata attributes for possible revisions. Some of these alerts may eventually result in changes of metadata attributes but others may not. The true user effort measures the amount of efforts that a metadata owner spends to revise metadata attributes. To keep it simple, we only consider the effort of searching and checking the key elements’ content of the affected attributes. This effort is measured by the amount of Web page content (in bytes) that the owner needs to examine. All notations used in our proposed evaluation metrics are given in Table 1. Let K be the set of key elements sharing the same context. In the simplest case where KeC model is not used, any single change to the Web page will trigger a user alert on all attributes. In terms of effort, the user will need to scan the entire Web page to see if the change would cause some update to metadata attribute values. The number of alert and true user effort for this option are defined in Equations 1 and 2 respectively. A1 = NW E1 =

NW 

(1) LW v

(2)

v=2

If the KeC model is used, each metadata attribute can have one of following four available options: fixed context and fixed key element, fixed context and floating key element, floating context and fixed key element, floating context and

70

B.-Q. Vuong et al. Table 1. Notations in Evaluation Metrics

NW NCl NCs NKli NKci  NKci LW v  LW v L”W v LCv 

LCv LKvi

# of times a Web page W is changed # of times a context’s location is not found # of times context is not found # of times ith key element’s location within context is not found # of times ith key element’s location is found but not its context # of times ith key element’s content is not found within context Length of the Web page at version v Equals LW v if context’s location is not found and 0 otherwise Equals LW v if context is not found and 0 otherwise Length of the context at version v if context is found but key element’s location within context is not found, and 0 otherwise Length of the context at version v if the context is found but key elements content is not found within context, 0 otherwise Length of the key element i at version v if its location is found but its content is not found within context, 0 otherwise

floating key element. To make it simple, we assume that all attributes use the same option. Fixed context and fixed key element: The metadata owner is alerted whenever the location of context is not found in the Web page, or the a key element’s location within the context is not found, or the a key element’s content is changed. In these cases, user effort is determined by: – Location of context is not found: The user effort involves the length of Web page since the new context’s location has to be determined. – Location of a key element is not found: Since context can be found, its region can be highlighted for user to scan for the key element. Thus, the user effort involves the length of context region. – A key element’s content is changed: Since the key element’s location is unchanged, the user effort involves finding the new key element’s content within the key element’s region to help the metadata owner identify changes. The number of alerts and true user effort are defined in Equations 3 and 4 respectively.  (NKli + NKci ) (3) A2 = NCl + key element i ∈K E2 =

NW  v=2



(LW v + LCv +



LKvi )

(4)

key element i ∈K

Fixed context and floating key element: Metadata owner is alerted when the context’s location is not found or a key element’s content within the context is not found. If the context’s location is not found, the metadata owner needs to

Key Element-Context Model

71

search the entire Web page for the key element. If context’s location is found but not the key element’s content, the context’s region can be highlighted to guide the search for key element. Thus, the total number of alerts and true user effort are defined in Equations 5 and 6 respectively.   A3 = NCl + NKci (5) key element i ∈K E3 =

NW 





(LW v + LCv )

(6)

v=2

Floating context and fixed key element: If a context is not found or a key element’s location within context is not found, an alert will be sent to the metadata owner. When the context is found but not all key element’s locations, the context region can be highlighted to guide the key element search. Thus, the total number of alerts and user effort are defined by Equations 7 and 8 respectively.  A4 = NCs + (NKli + NKci ) (7) key element i ∈K E4 =

NW 



(L”W v + LCv +

v=2

LKvi )

(8)

key element i ∈K

Floating context and floating key element: An alert is sent to the metadata owner whenever the context’s content or a key element’s content is not found. Context region, if found, can be highlighted to guide the search for the key element. Thus, the total number of alerts and user effort are defined by Equation 9 and Equation 10 respectively.   (NKci ) (9) A5 = NCs + key element i ∈K E5 =

NW 



(L”W v + LCv )

(10)

v=2

5

Experiment

To evaluate the effectiveness of our proposed models, we conducted experiments on some Wikipedia articles. The objective is to see how well the proposed models perform in terms of reducing the amount of alerts and true user effort. We also investigated situations in which each identification option performed best. 5.1

Data Sets

We use two data sets. The first comprises 174 Wikipedia articles of randomly selected countries obtained from a publicly available listing of countries. We

72

B.-Q. Vuong et al.

simulated the Web page evolution process by retrieving the edit history of each article and extracting versions from this edit history. The country articles’ edit histories are retrieved in the period from July 1, 2006 to December 31, 2006. The second set consists of only one article about FIFA World Ranking3 . The edit history used is from July 1, 2004 to December 31, 2006. 5.2

Experiment Setup

We applied the proposed models and evaluation metrics to measure the performance in three cases: monitoring changes for metadata derived from the information box tables of country articles, monitoring changes for metadata derived from sentences in the Economy paragraph of the country articles, and monitoring changes for metadata derived from the FIFA standings table. Experiment 1: Information box table of country articles. In this experiment, we built metadata record for each country. Each metadata record consists of 15 attributes whose values are derived from the information box table. The attributes were selected so that the 1st versions of most articles contain them. For those articles that cover less than 15 attributes, we just used a subset of these attributes that appear in the article’s first version. Our statistics showed that there were no articles contain less than 12 attributes in their first version. The identification options shown in Table 2 were used, in which Content Region 1 is “Information box” and Content Region 2 is “a cell in the information box”. Experiment 2: Sentences in Economy paragraph of country articles. In this experiment, we built one metadata record for each country. Each metadata record contains one attribute which was derived from the third sentence in the first version of the Economy paragraph. The identification options are shown in Table 2 where Content Region 1 is “Economy Paragraph” and Content Region 2 is “the third sentence in the first version of the Economy paragraph”. Experiment 3: FIFA standings table. In this experiment, we built one metadata record with eight attributes corresponding to the position of 8 football teams in the standings table. The identification options are shown in Table 2 where Content Region 1 is “Standings table”, and Content Region 2 is “a cell in the standings table” For the cases of floating context in all the 3 experiments, we used the context’s signature patterns to locate content region 1. 5.3

Experimental Results

Table 3 shows the averaged total alerts and true user efforts for different identification options of the three experiments. Experiment 1: Information box table of country articles. As shown in Table 3, Options 5 and 7 gave the least number of alerts (60). Both options assigned the information box as context and each info box cell as a key element. 3

http://en.wikipedia.org/wiki/Fifa world rankings

Key Element-Context Model

73

Table 2. Identification options for experiments Option Opt 2: Opt 3: Opt 4: Opt 5: Opt 6: Opt 7: Opt 1:

Context Context location Key element Full Web pages Fixed location Content region Full Web pages Fixed location Content region Content region 1 Fixed location Content region Content region 1 Fixed location Content region Content region 1 Floating location Content region Content region 1 Floating location Content region No monitoring models are deployed.

1 1 2 2 2 2

Key element location Fixed location Floating location Fixed location Floating location Fixed location Floating location

Table 3. Experimental Results Experiments Metrics Exp 1 Total Alerts (823 versions) True User Effort (MB) Exp 2 Total Alerts (823 versions) True User Effort (MB) Exp 3 Total Alerts (1113 versions) True User Effort (MB)

Opt 1 823 38.07 823 38.07 1113 23.47

Opt 2 103 4.48 37.4 1.78 241 0.35

Opt 3 103 4.57 37.6 1.70 241 0.41

Opt 4 165 0.65 41.9 1.78 588 0.22

Opt 5 60 0.68 41 1.80 116 0.06

Opt 6 165 0.659 31 0.81 588 0.22

Opt 7 60 0.68 30 0.82 116 0.06

Option 5 considered fixed context and Option 7 considered floating context. Their numbers of alerts were much smaller than that caused by not using the proposed metadata monitoring model. In terms of user effort, it turns out that Option 4 was the best instead of Options 5 and 7. It is because fixed context, fixed key element always help to visually guide the metadata owner to the new context or key element more easily, thus reducing the efforts in metadata attribute updating. Experiment 2: Sentences in the Economy paragraph of country articles. It is shown that Option 7 (floating paragraph, floating sentence) returned the best result in terms of total number of alerts. Option 7 also had better result than Options 2,3,4 and 5 and slightly better than option 6. In terms of true user effort, Option 6 (floating paragraph, fixed sentence) was the best. This option reduced the effort very significantly compared with no monitoring. This option also halved the amount of user efforts for Options 2,3,4 and 5. It also had a slightly better result than Option 7. Experiment 3: FIFA standings table. As shown in Table 3, Options 5 and 7 returned the best results in terms of number of alerts. They help to reduce the number of alerts very much. We also notice that the options of assigning Web page as context and standings table as key element (Options 2 and 3) generated smaller number of alerts than the options of assigning standings table as context and cells as key element with fixed key element (Options 4 and 6). This is because in Options 2 and 3, any important change alerts once to the metadata owner while in Options 4 and 6, the same change may be alerted more than once if

74

B.-Q. Vuong et al.

there are more than one metadata attribute monitored. However, in terms of true user effort, Options 4 and 6 had better results as alerts in these options already contain information to reduce maintenance effort. In terms of user effort, the best options in this experiment were also 5 and 7. The use of these options helps to reduce effort of Option 1 by 372.8 times.

6

Conclusions

As Web increasingly becomes the preferred source of information, it is necessary to create and maintain metadata for useful Web content. This paper introduces the KeC model to reduce the maintenance effort by tracking only the relevant content regions in Web pages. With various identification options provided, a metadata owner can select the most appropriate key element and context specifications and identification options for the monitored data. This paper also introduces some evaluation metrics to measure the performance of proposed models by the number of alerts generated and the user’s maintenance effort. We conducted some experiments on three different Web metadata monitoring scenarios. The results showed that our proposed KeC model significantly reduced the number of alerts as well as the amount of user effort. The proposed KeC model has been implemented in the Web metadata monitoring subsystem of G-Portal, a digital library system for geography education.

References 1. Lagoze, C., Krafft, D., Cornwell, T., Dushay, N., Eckstrom, D., Saylor, J.: Metadata aggregation and automated digital libraries: a retrospective on the NSDL experience. In: Proceedings of ACM/IEEE-CS Joint Conference on Digital Libraries, Chapel Hill, NC., pp. 230–239 (2006) 2. Lim, E.-P., Goh, D.H.-L., Liu, Z., Ng, W.-K., Khoo, C.S.-G., Higgins, S.E.: G-portal: A map-based digital library for distributed geospatial and georeferenced resources. In: Proceedings of ACM/IEEE Joint Conference on Digital Libraries, Portland, Oregon, pp. 351–358 (2002) 3. Liu, L., Pu, C., Tang, W.: WebCQ-detecting and delivering information changes on the web. In: Proceedings of ACM CIKM, McLean, Virginia, pp. 512–519 (2000) 4. Pandey, S., Ramamritham, K., Chakrabarti, S.: Monitoring the dynamic web to respond to continuous queries. In: Proceedings of World Wide Web Conference, Budapest, Hungary, pp. 659–668 (2003) 5. Sharaf, M.A., Labrinidis, A., Chrysanthis, P.K., Pruhs, K.: Freshness-aware scheduling of continuous queries in the dynamic web. In: Proceedings of ACM Workshop on the Web and Databases, Baltimore, Maryland, pp. 73–78 (2005) 6. Sumner, T., Dawe, M.: Looking at digital library usability from a reuse perspective. In: Proceedings of ACM/IEEE Joint Conference on Digital Libraries, Roanoke, VA, pp. 416–425 (2001) 7. Terry, D., Goldberg, D., Nichols, D., Oki, B.: Continuous queries over append-only databases. In: Proceedings of ACM SIGMOD, San Diego, California, pp. 321–330 (1992)

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.