Quantifying and measuring metadata completeness

August 25, 2017 | Autor: A. Manitsaris | Categoría: Information Systems, Library and Information Studies

Descripción

Quantifying and Measuring Metadata Completeness

Merkourios Margaritopoulos, Thomas Margaritopoulos, Ioannis Mavridis, and Athanasios Manitsaris Department of Applied Informatics, University of Macedonia, 54006, Thessaloniki, Greece. E-mail: {mermar, margatom, mavridis, manits}@uom.gr

Completeness of metadata is one of the most essential characteristics of their quality. An incomplete metadata record is a record of degraded quality. Existing approaches to measure metadata completeness limit their scope in counting the existence of values in ﬁelds, regardless of the metadata hierarchy as deﬁned in international standards. Such a traditional approach overlooks several issues that need to be taken into account. This paper presents a ﬁne-grained metrics system for measuring metadata completeness, based on ﬁeld completeness. A metadata ﬁeld is considered to be a container of multiple pieces of information. In this regard, the proposed system is capable of following the hierarchy of metadata as it is set by the metadata schema and admeasuring the effect of multiple values of multivalued ﬁelds. An application of the proposed metrics system, after being conﬁgured according to speciﬁc user requirements, to measure completeness of a real-world set of metadata is demonstrated. The results prove its ability to assess the sufﬁciency of metadata to describe a resource and provide targeted measures of completeness throughout the metadata hierarchy.

Introduction Metadata can be simply deﬁned as “data about data” or “information about information.” The purpose of metadata is to provide adequate, correct, and relevant information to users so as to obtain a true picture of the resource they describe without having to access it. Completeness of metadata refers to their sufﬁciency to fully describe a resource covering all its possible aspects. Given the more reﬁned deﬁnition for metadata by Greenberg (2003), as structured data about an object that supports functions associated with the designated object, it becomes clear that sufﬁciency to fully describe a resource is directly associated with the particular activities or processes that metadata are intended to support. Furthermore, Sicilia, García, Pagés, Martínez, and Gutiérrez (2005) argue that each kind of activity or process demands a number of concrete metadata elements. Thus, Received April 4, 2011; revised October 3, 2011; accepted October 17, 2011 • Published © 2011 • Published onlineonline 7 December 2011 inOnline Wiley Library Online 2011ASIS&T ASIS&T in Wiley (wileyonlinelibrary.com). DOI: 10.1002/asi.21706 Library (wileyonlinelibrary. com). DOI: 10.1002/asi.21706

completeness of metadata depends on the speciﬁc functionality or usage expected by the particular metadata application. Quoting from Ochoa and Duval (2009, p. 71), “Completeness is the degree to which the metadata instance contains all the information needed to have a comprehensive representation of the described resource.” A comprehensive representation varies according to the application and the community of use. Numerous metadata standards have been established in an attempt to deﬁne sufﬁcient descriptions of a resource from different perspectives and satisfy diverse functionalities. Theoretically, a sufﬁcient description exists when all metadata elements of a standard are populated with values. However, in practice this is not what happens in the real world. Relevant surveys by Guinchard (2002), Najjar, Ternier, and Duval (2003) and Friesen (2004) have shown that indexers tend to ﬁll out only particular metadata elements that could be considered “popular,” while they ignore other less popular elements. The creation of metadata is a task requiring major labor and ﬁnancial cost and, most important, the involvement of knowledgeable and experienced people (Liddy et al., 2002; Barton, Currier, & Hey, 2003). Since all these requirements are, generally, difﬁcult to be fully met, it is rather common, in the majority of digital repositories, to have incomplete metadata. The issue of incomplete metadata records is rather problematic, especially, in collections resulting from harvesting from metadata databases (Dushay & Hillmann, 2003) or from automatically generated metadata (Greenberg, Spurgin, & Crystal, 2005; Ochoa, Cardinaels, Meire, & Duval, 2005; Margaritopoulos M., Margaritopoulos, T., Kotini, & Manitsaris, 2008). On the other hand, when searching for information, users of metadata limit their search criteria by using only a very small percentage of metadata elements, as highlighted by Najjar et al. (2004). This fact, as Stvilia, Gasser, Twidale, Shreeves, and Cole (2004) note, shows that completeness is often in conﬂict with simplicity. However, this conﬂict between completeness and simplicity of a metadata standard is attempted to be balanced by application proﬁles (Hillmann & Phipps, 2007). Application proﬁles, among other things, select certain metadata ﬁelds of one or more standards and

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 63(4):724-737, 2012

consider them as mandatory or as optional, based on the needs of the particular users’ community. The concept of completeness of metadata, as an object of study, is integrated in the more general concept of metadata quality. The research community considers completeness of metadata a fundamental quality characteristic. Several dimensions or characteristics of information quality and metadata quality have been proposed by researchers (Bruce & Hillmann, 2004; Stvilia, Gasser, Twidale, & Smith, 2007; Margaritopoulos T., Margaritopoulos, M., Mavridis, & Manitsaris, 2008; Arazy & Kopak, 2011) in an effort to deﬁne quality and provide the necessary means for its assessment and improvement. Among the numerous suggested quality dimensions, completeness is one of the most essential. Indeed, it can be considered a prerequisite to assess quality, since incomplete records are in any case not of quality due to lack of essential information (Sicilia et al., 2005). Among all the constituent elements of metadata quality, completeness is the easiest to be quantiﬁed and measured by automatic means without human intervention. This article is organized as follows. The next section gives a brief account of related work on research efforts on metadata completeness measurement, followed by the section presenting the motivation that led to the design of a new ﬁne-grained metrics system for metadata completeness, which is introduced afterwards. The paper continues presenting an implementation of the proposed metrics on two real-world metadata records. First, the system is conﬁgured according to user requirements and then the measurement results themselves are presented, after applying the respective mathematical formulas. A discussion of the proposed metrics follows and, ﬁnally, the last section concludes the article.

Related Work Several researchers have created metrics to measure metadata quality by computing indicators of quality and, among them, the completeness indicator of a metadata record. Completeness is measured based on the presence or absence of values in metadata ﬁelds1 deﬁned in different metadata standards. In some research efforts, the ﬁelds to be considered when completeness is measured are selected by the application that uses the metadata based on their importance for the speciﬁc process or activity handling the metadata. In a study to assess the quality of metadata records in the National Science Digital Library (NSDL) repository, Bui and Park (2006) harvested over 1 million Dublin Core metadata records and assessed metadata quality in terms of metadata uses in frequency, consistency, completeness, and accuracy. Completeness is given a percentage measure based on the presence or absence of values in the 15 elements of the simple Dublin Core standard (Dublin Core Metadata Element Set).

It is assumed that an element is marked as present if one instance of the element is populated with a value. In Hughes (2004) an algorithm to assess metadata quality of the Open Language Archive Community (OLAC) metadata repository is presented. The algorithm computes an overall quality score of a metadata record based on several metadata characteristics, among them the element absence penalty. Certain elements are considered necessary; thus, their absence results in reduced quality, introducing in this way the idea of the weighted importance of the elements in terms of their contribution to metadata quality. Moen, Stewart, and McClure (1998) describe techniques and procedures used in an exploratory, systematic assessment of one particular set of metadata records used in implementations of the U.S. Government Information Locator Service (GILS). Completeness is expressed only by counting the number of elements in a metadata record and no measure is given in relation to a “complete” record. “Mandatory” and “optional” elements are measured separately, thus underlining the importance of the context of use when assessing completeness. Furthermore, repeatable elements (multiple instances of the same element) are taken into account. Moreira, Gonçalves, Laender, and Fox (2009) present a tool to perform automatic evaluation of a digital library. For the evaluation of metadata, completeness is deﬁned according to the number of ﬁelds of the Dublin Core standard found in a record. However, the examined records are checked only for the existence of the 15 ﬁelds of the simple DC standard. A complete record with 15 ﬁelds present is assigned a completeness measure of 1. Multiple instances of the same ﬁeld are not considered. Metrics to measure metadata quality in digital repositories are introduced in Ochoa and Duval (2009). More speciﬁcally, completeness is deﬁned as the number of ﬁelds that contain a non-null value divided by the total number of ﬁelds of the record according to the metadata standard, as in: com =

N

P (i)/N

(1)

i=1

where N is the number of ﬁelds and P(i) is 1 if the i-th ﬁeld has a non-null value, 0 otherwise. In cases where the ﬁelds of a record are not of equal importance, any particular application might assign weights of importance to each ﬁeld. Therefore, the measure of “weighted completeness” is computed as: com =

N N (ai ∗ P (i))/ ai i=1

(2)

i=1

where ai is the weight of ﬁeld i. Thus, a ﬁeld is considered either complete or noncomplete, being assigned only two discrete values. Motivation

1 The terms “metadata ﬁeld” and “metadata element” are used interchangeably throughout this article.

All of the above-referenced research efforts deﬁne completeness at the record level by measuring the number of

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2012 DOI: 10.1002/asi

725

elements or providing formulas to compute completeness of a record based on the existence or absence of values in the metadata ﬁelds of a standard. Some of them take into account the relative importance of each ﬁeld, either by setting weights of importance or by only considering certain ﬁelds and ignoring others. Although such metrics seem to be providing meaningful measures of completeness of a metadata record, several issues are overlooked, raising obvious concerns. A source of concern is the case of multivalued ﬁelds (ﬁelds with cardinality greater than 1). In Ochoa and Duval (2009), a multivalued ﬁeld is considered to be complete if at least one instance of the ﬁeld exists. However, this proposition overlooks the fact that a multivalued ﬁeld can be loaded with a variable amount of information. As a consequence, such variations are not reﬂected to the measure of completeness of the ﬁeld’s record, yielding degraded results. The need to consider the multiple instances of multivalued ﬁelds is recognized in Bui and Park (2006), where the instances of the Subject ﬁeld are recorded separately, since the authors recognize the need to analyze the richness of the various subject areas in a repository. Another major source of concern is the case of aggregate ﬁelds (or container elements). For example, metadata standards like IEEE LOM (Learning Object Metadata) and LoC METS (Metadata Encoding & Transmission Standard; http://www.loc.gov/standards/mets) deﬁne a hierarchy of data elements, including aggregate data elements and simple data elements (leaf nodes of the hierarchy). Only leaf nodes have individual values deﬁned through their associated value space and datatype. If completeness is measured taking into account only the simple ﬁelds that have individual values, then the whole structure of the metadata schema is ignored. There is no meaning in assigning a 0 or 1 as an existence indicator of value when measuring completeness in the 58 simple ﬁelds of IEEE LOM metadata schema, regardless of the parent node they belong to and ignoring the hierarchy and interdependence of the ﬁelds. For example, how can we count multiple instances of the simple ﬁeld “2.3.1 Entity” belonging to multiple instances of the parent aggregate ﬁeld “2.3 Contribute”? In all the above literature, the presence of multiple representation forms of the same value in a metadata ﬁeld as a possible factor that might inﬂuence completeness is not examined as well. The “langstring” datatype of IEEE LOM is such a case. This source of additional information included in a metadata record is a matter of possible consideration when completeness of the record is to be measured. The above discussion brings out the obvious need to redeﬁne completeness in a way that all the aforementioned issues of concern will be taken care of. In an effort to measure completeness, we will comply with the following principles: Completeness of a record must only be measured against the speciﬁcations of the metadata standard or application proﬁle used by any given application. The standard or the application proﬁle prescribes a certain number of ﬁelds as necessary placeholders of information to describe a resource. Completeness is then considered as the degree to which these 726

placeholders are full with values. That degree may swing in a continuum from total presence to absolute absence of values. Thus, a requirement for a metrics system to measure completeness would be its ability to assign values in a continuous range between these two endpoints. Intuitively, a range of values in the closed interval [0, 1] seems to adequately satisfy this requirement, since nonexistence of completeness would map to the value of 0 and total completeness would map to the value of 1, offering the convenience of easily translating the values to percentages. Limiting the range value of completeness in this condensed interval should not restrict the ability of the metrics system to detect any slight variation in the amount of information contained in a metadata record. Therefore, a critical requirement for the system would be to be able to differentiate the measure of completeness so as to reﬂect such variations. The following section will introduce a new ﬁne-grained metrics system for metadata completeness based on the above requirements. Based on the analysis of the introductory and the motivation section of the paper, it is clear that completeness is only examined in terms of the existence of values in metadata elements and is not concerned with the values themselves, that is, we do not deal with the so-called “information content” expected to be found inside a metadata value. Such an orientation—for example, counting the number of words in free text ﬁelds and comparing to some minimum standard—would represent an absolutely different approach than the present one to the issue of metadata completeness that is deﬁnitely out of the scope of the present research effort. Yet we need to emphasize that the evaluation of completeness as a metadata quality parameter based on the existence of values in metadata elements (regardless of the actual content of these values) does not, in any way, underestimate or overlook this aspect. Within the context of our conceptual framework for metadata quality assessment introduced in Margaritopoulos T., Margaritopoulos, M., Mavridis, and Manitsaris (2008), the evaluation of the content of a value is an issue assigned to the other two dimensions of metadata quality, i.e., correctness and relevance (the measurement of which still remains an open research issue)–providing that a value is already there. This condition, i.e., the existence of values, is evaluated and measured by completeness. The Proposed Metadata Completeness Metrics System The keystone concept of the proposed ﬁne-grained metrics system for measuring metadata completeness is the concept of completeness of a ﬁeld, which will be deﬁned in the next subsection. Completeness of a Metadata Field Just as completeness of a metadata record is deﬁned as the weighted average of the existence or nonexistence of values in its ﬁelds (according to the metadata standard), completeness of a metadata ﬁeld is deﬁned following the same logic

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2012 DOI: 10.1002/asi

but taking into account also the nature of the ﬁeld (simple, aggregate, multivalued). Thus, completeness of a ﬁeld f, represented as COM(f), will be given the disjunctive deﬁnition of Equation (3): COM(f ) = ⎧ m(f ) ⎪ ⎪ ⎪ [ak (f ) ∗ Pk (f )] if f is a simple ﬁeld ⎪ ⎪ ⎪ k=1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎨m(f ) if f is an aggregate [ak (f ) ∗ COMk (f )] multi-valued ﬁeld ⎪ ⎪ k=1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ nsf ⎪ (f ) if f is an aggregate ⎪ ⎪ [ak (f ) ∗ COM(sfk (f ))] ⎩ single-valud ﬁeld k=1 (3) In Equation (3), f represents any node (ﬁeld) belonging to the hierarchy of the metadata schema. For example, for the IEEE LOM metadata schema, f may be a ﬁeld like “5.2 Educational.Learning resource type,” a category like “1 General,” or even the whole record, which is the root node of the hierarchy. The ﬁrst case of the split deﬁnition of Equation (3) deﬁnes completeness of a simple ﬁeld, either single-valued or multivalued. The minimum number of values ﬁeld f should have in order to be considered complete is represented by m(f). Considering the values of ﬁeld f to be stored in m(f) placeholders, Pk (f ) is deﬁned as an existence indicator marking the presence (equal to 1) or absence (equal to 0) of value in the k-th placeholder. Then, completeness of a simple ﬁeld is deﬁned as the weighted average of the existence indicators of its values. In the special case of a single-valued ﬁeld, obviously m(f) is equal to 1. In a multivalued ﬁeld, m(f) is a number depending on the semantics of the ﬁeld. Sometimes it is deﬁned by reality itself, for example, the number of languages contained in a multilanguage document. In cases where m(f) is not a real-life fact, it should be deﬁned by the application in which the metadata are used, for example, the number of third-party annotations for a resource. To give an example of the involved math calculation in the ﬁrst case of Equation (3), let us assume that we have a DC metadata record with two instances of the simple multivalued ﬁeld “Creator” (f = “Creator”) registered in the record. If the resource associated with the metadata record has been created by three creators (m(f) = 3), then the existence indicators of the (three) expected values would be P1 (f) = 1, P2 (f) = 1, P3 (f) = 0. Considering the expected values to be equally weighted (a1 (f) = 0.33, a2 (f) = 0.33, a3 (f) = 0.33) and applying the ﬁrst case of Equation (3), completeness of the ﬁeld “Creator” would be: COM (f) = a1 (f) ∗ P1 (f) + a2 (f) ∗ P2 (f) + a3 (f) ∗ P3 (f) = 0.33 ∗ 1 + 0.33 ∗ 1 + 0.33 ∗ 0 = 0.66 The second case of the split deﬁnition of Equation (3), is similar to the ﬁrst one with the difference that the values of an aggregate ﬁeld are structured ﬁelds themselves,

bearing a measure of completeness of their own, other than the two discrete values of the existence indicator Pk (f ). Thus, completeness of a multivalued aggregate ﬁeld is computed as a weighted average of the completeness measures of its values (instances), with COMk (f ) be the completeness measure of the k-th value of the ﬁeld. For example, completeness of the category “8 Annotation” of IEEE LOM will be computed as the weighted average of the measures of completeness of each “8 Annotation” instance. In this particular case, m(f) will be a certain number of annotations expected to be found in a record by a metadata application that would be regarded as sufﬁcient in terms of information provided by previous users on the educational use of the associated learning object. Assuming that in an IEEE LOM metadata record, two instances of the ﬁeld “8 Annotation” (f = “8 Annotation”) are registered, while the metadata application expects four instances (m(f) = 4), and the measures of completeness of the two registered instances are 0.8 and 0.4 (COM1 (f) = 0.8, COM2 (f) = 0.4, COM3 (f) = 0, COM4 (f) = 0), while each expected instance is considered to be equally weighted (a1 (f) = 0.25, a2 (f) = 0.25, a3 (f) = 0.25, a4 (f) = 0.25), then, applying the second case of Equation (3), completeness of the ﬁeld f = “8 Annotation” would be: COM(f) = a1 (f) ∗ COM1 (f) + a2 (f) ∗ COM2 (f) + a3 (f) ∗ COM3 (f) + a4 (f) ∗ COM4 (f) = 0.25 ∗ 0.8 + 0.25 ∗ 0.4 + 0.25 ∗ 0 + 0.25 ∗ 0 = 0.3 In the ﬁrst two cases of Equation (3), ak (f ) represents the weights of importance of the values of a multivalued ﬁeld. In order to limit the measures of completeness in the closed interval [0, 1], these weights are forced to have a sum of 1. The values of a multivalued ﬁeld are generally considered to be equally weighted, that is, no value is regarded as more important than the rest. However, there might be applications where this default assumption does not ﬁt. For example, an application might wish to assign progressively reduced weights to the creators of a resource being registered in a metadata record, as it may not consider any additional creators as much important as the previous ones. In our previous example where we had a DC metadata record with two instances of the simple multivalued ﬁeld “Creator,” while the resource associated with the record was actually created by three creators, if we had assigned progressively reduced weights to the “Creator” instances (e.g., a1 (f) = 0.6, a2 (f) = 0.3, a3 (f) = 0.1), then completeness of the ﬁeld “Creator” would have been: COM(f) = a1 (f) ∗ P1 (f) + a2 (f) ∗ P2 (f) + a3 (f) ∗ P3 (f) = 0.6 ∗ 1 + 0.3 ∗ 1 + 0.1 ∗ 0 = 0.9 Another special case of weight assignment is that of conditional weight to an aggregate ﬁeld, based on the value of a speciﬁc subﬁeld of its. For example, referring to IEEE LOM ﬁelds, an application might wish to assign a higher weight to an instance of “2.3 Lifecycle.Contribute,” in case

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2012 DOI: 10.1002/asi

727

that its “2.3.1 Lifecycle.Contribute.Role” of the contributor is “author,” than the weight to a different instance where the role of the contributor is “editor.” In the last case of the split deﬁnition of Equation (3), completeness of a single-valued aggregate ﬁeld is computed from the measures of completeness of its subﬁelds, referring to the ﬁelds of the next lower level down the metadata hierarchy. The symbol nsf(f) represents the number of subﬁelds of ﬁeld f according to the metadata standard or application proﬁle used and sfk (f ) is the k-th subﬁeld of ﬁeld f. For example, for each (single-valued) instance of the aggregate ﬁeld “8 Annotation” of IEEE LOM nsf (“8 Annotation”) = 3, which is the number of the next lower level subﬁelds—8.1 through 8.3—and sf2 (“8 Annotation”) is “8.2 Annotation.Date.” The variable ak (f ) represents the weight of each subﬁeld, which, in this case, is its relative importance to its parent ﬁeld. Again, the sum of these weights is considered to be equal to 1. If an instance of “8 Annotation” contains a subﬁeld “8.1 Annotation.Entity” (ﬁlled with a value), no value for the subﬁeld “8.2 Annotation.Date” and a value for the subﬁeld “8.3 Annotation.Description,” since all three subﬁelds are simple single-valued ﬁelds, then the completeness measures of each will be equal to their existence indicator, that is, COM(sf1 (f)) = 1, COM(sf2 (f)) = 0 and COM(sf3 (f)) = 1. If we assign a weight of 0.4 to both subﬁelds “8.1” and “8.3” and a 0.2 weight to “8.2” subﬁeld (a1 (f) = 0.4, a2 (f) = 0.2, a3 (f) = 0.4), then the completeness measure of this particular instance of ﬁeld “8” will be calculated according to the third case of Equation (3), i.e.: COM(f) = a1 (f) ∗ COM(sf1 (f)) + a2 (f) ∗ COM(sf2 (f)) + a3 (f) ∗ COM(sf3 (f)) = 0.4 ∗ 1 + 0.2 ∗ 0 + 0.4 ∗ 1 = 0.8 The deﬁnition of Equation (3) for the completeness of a ﬁeld has a clear recursive nature, since it uses previous values from lower ﬁeld levels to create new ones. For example, completeness of the category “1 General” of IEEE LOM is computed as the weighted average of the measures of completeness of its eight subﬁelds (1.1 through 1.8). Seven of these eight subﬁelds are simple ﬁelds. In case they are single-valued, such as “1.7 General.Structure,” completeness is equal to the existence indicator of the value. In case of multivalued ﬁelds, such as “1.4 General.Description,” completeness is computed as the weighted average of the measures of completeness of their instances. The completeness of each instance of the multivalued simple ﬁelds is, again, equal to the existence indicator of its value. Field “1.1 General.Identiﬁer” is an aggregate ﬁeld, so its completeness will be computed as the weighted average of the measures of completeness of its subﬁelds (“1.1.1 General.Identiﬁer.Catalog” and “1.1.2 General.Identiﬁer.Entry”), which, in turn, are simple ﬁelds and their completeness is equal to the existence indicators of their values. Thus, traversing the sub-tree of the category “1 General” of IEEE LOM down to its leaves, we compute the completeness of this aggregate ﬁeld. 728

In general, in order to compute completeness of any node we start from the node and traversing downwards the hierarchy of the metadata schema, we recursively compute the measures of completeness of the nodes of its sub-tree, applying Equation (3). The base of the recursion is always the existence indicator of a single value of a simple ﬁeld. Completeness of the whole record is computed the same way—starting from the top level (the root node) of the hierarchy. Completeness at the Representation Level When traversing the nodes of a metadata hierarchy in order to measure completeness of a record or of a speciﬁc ﬁeld, one could say that the end of this process is the lowest level of the hierarchy or a ﬁeld of a leaf node with no further structure of subﬁelds.Yet the measure of completeness cannot just be computed by checking the presence of values in a simple ﬁeld because such a ﬁeld may still contain additional information that might affect the ideal representation of the described resource. This additional information belongs to the representation level of the values. It is possible for the values of certain datatypes to be represented in different forms that are semantically equivalent, e.g., a particular text in different languages, such as in the datatype “langstring” of the IEEE LOM standard. The existence of different representations should affect the measure of completeness of the respective ﬁeld as being valuable additional information. Following the same logic, other forms of representation that could be set by a metadata standard might be different visual or audio versions of the same piece of textual information, which are simultaneously present in a metadata ﬁeld. The consideration of the representation level does not allow us to think of completeness of a single value of a simple ﬁeld as having only the two discrete values of the existence indicator Pk (f ). Completeness of such a ﬁeld should be computed as the weighted average of its existent forms of representation. Thus, if we consider the different forms of representation of a single value of a simple ﬁeld, completeness of a simple ﬁeld is formulated as: COMrepresentations(f ) =

m(f )

L [bj ∗ Rj (k, f )]]

k=1

j =1

[ak (f ) ∗

(4) where L is the number of all the possible different representation forms, bj is the weight of the j-th representation form and Rj (k, f ) is the existence indicator of this speciﬁc representation form of the k-th single value of ﬁeld f. The sum of the weights bj is again set to 1. The minimum number of the possible different representations—for the represented value to be considered complete—and the weight of each is an issue determined by the application. In the case of different languages of the same textual information, a minimum number of languages that would make the value of the textual ﬁeld to be considered complete might be the number of

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2012 DOI: 10.1002/asi

languages of a certain geographical region (e.g., the European Union languages) or the number of languages used in a particular educational institution. As for the weights of the languages, apparently mother languages of a particular place or languages most frequently used would be assigned a higher weight. Equation (4) seems to adequately cover the issue of completeness of a simple ﬁeld from any point of view. However, an essential problem emerges. If a ﬁeld is a multivalued one, then, applying Equation (4), the measure of completeness will be computed taking into account all the instances of the different representation forms, regardless of the total number of values they express. For example, let us assume a situation where completeness of ﬁeld “1.5 General.Keyword” of IEEE LOM is to be measured. The maximum number of values m(f) is set to 5 and the maximum number of languages L is set to 3, while we assign equal weights for all the different values (ak = 1/5, k = 1, . . . , 5) and all the different languages (bj = 1/3, j = 1,2,3) to express these values. We will consider two different cases. In case “a” the ﬁeld is populated with one keyword expressed in three different languages, so the measure of completeness (according to Equation (4)) will be equal to: (1/5) ∗ (1/3) ∗ [(1 + 1 + 1) + (0 + 0 + 0) + (0 + 0 + 0) + (0 + 0 + 0) + (0 + 0 + 0)] = 0.2 In case “b” the ﬁeld is populated with two keywords each expressed in one language, so completeness of the ﬁeld would be equal to: (1/5) ∗ (1/3) ∗ [(1 + 0 + 0) + (1 + 0 + 0) + (0 + 0 + 0) + (0 + 0 + 0) + (0 + 0 + 0)] = 0.13 Apparently, in terms of completeness, one keyword outperforms two keywords just because this keyword is expressed in more languages than the two keywords, each of which is expressed in only one language. Consequently, we need to make a distinction between measurement of completeness that takes into account only the number of values of a ﬁeld (which, obviously, is the most important) and measurement of completeness taking into account the number of the different representations of these values. The former is the one expressed by the ﬁrst case of the split deﬁnition of Equation (3), from now on called COMvalues (f ). The latter is the one expressed by Equation (4). The above example involving keywords expressed in different languages is presented in graphical form in Figure 1. The requirement, in order to maintain the importance of the different values against their different representations, is that more values must always result in a higher completeness score than fewer values, regardless of the number of representation forms the values are expressed in. If we compute completeness as the weighted average of COMvalues (f ) and COMrepresentations (f ) and assign a higher weight to COMvalues (f ), then this requirement is met. Hence, the

measure of completeness of a simple ﬁeld—to substitute the ﬁrst case of Equation (3) – becomes: COM(f ) = c∗COMvalues (f )+(1−c)∗COMrepresentations (f ) (5) The value of weight c expresses the maximum measure of completeness of a complete ﬁeld (taking into account only its values), regardless of their representations. The value of weight 1 − c expresses the additional amount of completeness, attributed to the representations of the values, so as to reach the maximum value of completeness (the value of 1) of a complete ﬁeld. In the previous example of keywords expressed in different languages, if we set c = 0.8 (a value higher than 0.5), then for case “a” we have: COM(f ) = 0.8 ∗ COMvalues + (1 − 0.8) ∗ COMrepresentations = 0.8 ∗ 1/5 + 0.2 ∗ 3/15 = 0.2 while for case “b” we have COM(f) = 0.8 ∗ COMvalues + (1 − 0.8) ∗ COMrepresentations = 0.8 ∗ 2/5 + 0.2 ∗ 2/15 = 0.35 Weight c acts as expected, forcing the second case to have a bigger measure of completeness than the ﬁrst one, since it employs two keywords compared to one, regardless of the number of different languages these keywords are expressed in.

Application to Real-World Metadata Records In order to put the proposed metadata completeness metrics into action and prove its applicability and effectiveness, we apply the metrics on real-world metadata records, the ﬁrst one following the simple Dublin Core and the second one following the IEEE LOM metadata schema. A sample DC metadata record harvested from the CADAIR open access online research repository of the Aberystwyth University of Wales (http://cadair.aber.ac.uk/dspace/) through the OAI/PMH metadata harvesting protocol (Open Archives Initiative, 2008) is displayed in Figure 2. The metadata schema is the simple Dublin Core (Dublin Core Metadata Element Set). The repository contains metadata records describing resources such as scientiﬁc articles and papers, conference proceedings, books, technical reports, etc. In Figure 3, a sample IEEE LOM metadata record harvested from the open access Slovenian Education Network repository (http://sio.edus.si/LreTomcat) through the OAI/PMH metadata harvesting protocol (Open Archives Initiative, 2008) is shown. The record follows the LRE (Learning Resource Exchange) application proﬁle (by the European Schoolnet: http://eun.org) based on the IEEE LOM

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2012 DOI: 10.1002/asi

729

FIG. 1.

Completeness based on values or representations.

FIG. 2. A sample DC metadata record. [Color ﬁgure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

metadata schema. The repository contains metadata records describing learning objects of multimedia content. Before applying the formula of Equation (3) and performing any mathematical calculations, a proper conﬁguration of the metrics system has to be performed by setting values to the parameters that customize the measurement according to the needs and requirements of its users. Such a process is required to be accomplished separately for the DC and the IEEE LOM metadata records. The parameters are set for two user proﬁles: an administrator of the repository hosting the metadata and an instructor planning to use the resources associated with the repository’s metadata for educational purposes in order to prepare a lesson. The conﬁguration of the metrics system is most important, since the measures of completeness are affected by the parameter values (which are set by the metadata user) in a decisive way. The relative weights of importance and the minimum number of values a multivalued ﬁeld should have in order for a user to consider it complete inﬂuence the amount of information of the metadata record needed by this user to have an ideal (in terms of completeness) 730

representation of the associated resource. The completeness measure of the record is computed against this amount of information, which expresses full completeness. Conﬁguration of the DC Metadata Record All 15 ﬁelds of the simple Dublin Core have been deﬁned as multivalued. Furthermore, the simple DC schema is ﬂat— its ﬁelds have no internal structure—thus, the only parameters to set are the minimum number of values for each ﬁeld in order to consider the ﬁeld complete and any relative weights of importance to reﬂect differentiated degrees of signiﬁcance of each ﬁeld to different users. The weights of importance for each one of the 15 ﬁelds of the simple DC for the administrator proﬁle and the instructor proﬁle are set as follows. For the administrator proﬁle: Fields identifying the resource, like “Identiﬁer” and “Title” are considered of the highest importance. In the second order of importance, we classify ﬁelds as “Type” and “Format” describing the

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2012 DOI: 10.1002/asi

FIG. 3. A sample IEEE LOM metadata record. [Color ﬁgure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

nature and format of the resource (necessary to the administrator for its technical manipulation), as well as ﬁelds referring to its main contributors like “Creator” and “Publisher.” Next, we place ﬁelds referring to the temporal dimension of a resource’s lifecycle and any associated rights of use, like “Date” and “Rights.” Next in order of importance come ﬁelds describing the content of the resource, like “Language,” “Subject,” and “Description.” Last, ﬁelds of restricted interest to a repository administrator, like “Coverage,” “Source,” “Contributor,” and “Relation” take their place in the order of importance.

For the instructor proﬁle: The most important ﬁelds for an instructor are considered ﬁelds referring to the content of the resource, like “Title,” “Subject,” and “Description.” In the second order of importance come ﬁelds referring to the resource’s accessibility, like “Rights,” “Language,” and “Identiﬁer” (which most of the time contains a URL address enabling access to the resource), as well as the ﬁeld “Creator.” In descending order of importance we place the ﬁelds “Type” and “Format,” “Date” and “Coverage,” and “Source,” “Publisher,” “Contributor,” and “Relation.” Following the above order of importance, we assign numerical values to

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2012 DOI: 10.1002/asi

731

TABLE 1. proﬁle.

DC ﬁeld Identiﬁer Title Creator Format Editor Date Language Rights Type Contributor Description Coverage Relation Source Subject

Weights of relative importance of DC ﬁelds according to user

Weight (admin proﬁle)

Weight (instructor proﬁle)

0.200 0.200 0.100 0.100 0.100 0.051 0.022 0.051 0.100 0.008 0.022 0.008 0.008 0.008 0.022

0.067 0.166 0.067 0.050 0.017 0.033 0.067 0.067 0.050 0.017 0.166 0.033 0.017 0.017 0.166

the weights, always making sure their sum to be 1. The exact weights of importance are shown in Table 1. Last, we need to set the minimum number of occurrences (the m(f) parameters) of a ﬁeld in order to be considered complete. As mentioned, the m(f) of a multivalued ﬁeld is determined either by reality itself, or by the needs and requirements of the application that uses the metadata. In this regard, the number of creators of a resource, or the number of languages contained in a multilanguage document is an objective, real fact, while the number of dates (“DC.Date”) of a resource is subjective information depending on the number of the events of the resource’s life that the application considers sufﬁciently important so that their associated points or periods of time are worthwhile to be registered in the metadata record describing the resource. In the general case, given the fact that measurement of completeness is a process that can take place at any time (long after the indexing of the resource) and cannot necessarily involve experienced indexers or people who are aware of the nature and history of the resources (for example, when measuring completeness of metadata harvested in massive quantities from third-party repositories), any information about the real life of the resources cannot be easily available. In the absence of knowledge of the true values, there is no other way but to make an estimation for the m(f) parameters that is most likely to approximate their real values. This estimation may take into account any common characteristics of the resources and the requirements that we set as users of the metadata application. Hence, for the majority of the ﬁelds we set m(f) = 1, as we require only one occurrence of the ﬁeld in order to consider it complete. For the rest of the ﬁelds: We set m(f) = 3 for the ﬁeld “Creator,” since for the kind of the resources from which the particular resource was drawn, three creators is an average estimate. We set m(f) = 2 for the ﬁeld “Identiﬁer,” since we expect two identiﬁers, a global one (e.g., URI, ISBN, etc.) and a local one to identify the resource in the repository of origin. We set m(f) = 2 732

TABLE 2. Minimum number of occurrences of a DC ﬁeld in order to be considered complete. DC ﬁeld Identiﬁer Title Creator Format Editor Date Language Rights Type Contributor Description Coverage Relation Source Subject

Minimum no. of values 2 1 3 1 1 2 1 1 1 1 1 1 1 1 5

for the ﬁeld “Date,” since we expect two dates associated with authorship and publication of the resource. Last, we set m(f) = 5 for the ﬁeld “Subject,” since in this ﬁeld one can register keywords describing the resource, thus we expect ﬁve keywords (ﬁve occurrences of the ﬁeld with associated values) in order to consider it complete. The minimum number of occurrences of a ﬁeld in order to be considered complete is shown in Table 2. Conﬁguration of the IEEE LOM Metadata Record Thirty out of the 77 ﬁelds of IEEE LOM metadata schema have been deﬁned as multivalued. Moreover, the metadata schema is hierarchical, i.e., many of its ﬁelds have internal structure (aggregate ﬁelds), comprising subﬁelds, down to the depth of four levels, while the values of some of the simple ﬁelds can be expressed according to the langstring datatype, which offers the capability of having various translations of the same alphanumeric value in different languages. The parameters we need to set in order to conﬁgure the system, before performing any measurement, are the minimum number of values for each multivalued ﬁeld in order to consider the ﬁeld complete and the relative weights of importance for each ﬁeld. We consider a langstring datatype ﬁeld to be complete having only one langstring value. Thus, the minimum number of languages for langstring ﬁelds is set to 1 and the value of the c parameter for the relative contribution of the different values of a langstring datatype ﬁeld to the measure of its completeness (in relation to its different translations) is also set to 1. The weights of importance for each ﬁeld were set for the two user proﬁles, the administrator proﬁle and the instructor proﬁle. The weights for each ﬁeld were set according to the relative importance of the ﬁeld to its parent ﬁeld. For example, in the ﬁrst level (the nine categories), ﬁelds (categories) a repository administrator is, mainly, interested in are “1 General” and “4 Technical,” since these categories contain ﬁelds

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2012 DOI: 10.1002/asi

TABLE 3.

Weights of relative importance of IEEE LOM ﬁelds for admin and instructor proﬁles.

Field

Weight (admin)

Weight (instructor)

Field

Weight (admin)

Weight (instructor)

Field

Weight (admin)

Weight (instructor)

1 1.1 1.1.1 1.1.2 1.2 1.3 1.4 1.5 1.6 1.7 1.8 2 2.1 2.2 2.3 2.3.1 2.3.2 2.3.3 3 3.1

0.222 0.375 0.400 0.600 0.188 0.125 0.075 0.100 0.063 0.038 0.038 0.133 0.300 0.100 0.600 0.300 0.400 0.300 0.133 0.400

0.278 0.025 0.400 0.600 0.250 0.125 0.188 0.188 0.088 0.050 0.088 0.067 0.300 0.250 0.450 0.300 0.400 0.300 0.022 0.400

3.1.1 3.1.2 3.2 3.2.1 3.2.2 3.2.3 3.3 3.4 4 4.1 4.2 4.3 4.4 4.4.1 4.4.1.1 4.4.1.2 4.4.1.3 4.4.1.4 4.5 4.6

0.400 0.600 0.300 0.300 0.400 0.300 0.100 0.200 0.222 0.143 0.429 0.286 0.036 1.000 0.250 0.250 0.250 0.250 0.036 0.036

0.400 0.600 0.300 0.300 0.400 0.300 0.100 0.200 0.078 0.286 0.143 0.343 0.114 1.000 0.250 0.250 0.350 0.150 0.043 0.043

4.7 5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 6 6.1 6.2 6.3 7 7.1 7.2

0.036 0.022 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.078 0.417 0.417 0.167 0.056 0.400 0.600

0.029 0.200 0.052 0.273 0.052 0.052 0.091 0.182 0.091 0.052 0.052 0.052 0.052 0.078 0.417 0.417 0.167 0.056 0.400 0.600

for recognizing and identifying the described learning object, as long as ﬁelds for its technical manipulation. Next come the categories “2 Lifecycle” and “3 Meta-metadata,” which offer the administrator information about the version of the learning object, its contributors, and the metadata record itself that describes the object. In the 3rd order of importance we place “9 Classiﬁcation,” in the 4th “6 Rights,” in the 5th “7 Relation” and in the last 6th order we place the categories “5 Educational” and “8 Annotation,” which are of the least importance to an administrator. Thus, we assign numerical values to the weights of ﬁelds of the same level, always making sure that their sum equals 1. In much the same way, we set weights for all the remaining ﬁelds down the hierarchy, as well as for all the ﬁelds for the instructor proﬁle. All the weights set are shown in Table 3. As for the minimum number of values that each multivalued ﬁeld should have in order to be considered complete, what mainly inﬂuences this parameter is the requirement of the application and the reality of the described resources, which are learning objects of multimedia content. For the majority of the ﬁelds, m(f) is set to 1. A small number of ﬁelds require the value of m(f) to be greater than 1. For example, for the ﬁeld “1.5 General.Keyword” we require three keywords (m(f) = 3) in order to consider the ﬁeld complete. Likewise, for the ﬁeld “2.3 Lifecycle.Contribute,” we require two instances of this ﬁeld (m(f) = 2). This is due to the fact that each instance of this ﬁeld acquires its structured meaning by its subﬁeld “2.3.1 Lifecycle.Contribute.Role” and that the number of the most signiﬁcant roles for this kind of resources is two—“author” and “publisher.” All m(f) values greater than 1 are shown in Table 4. At the bottom line of the conﬁguration process of the completeness metrics system, what needs to be stressed is that it adjusts the measurement according to the requirements of the

Field

Weight (admin)

Weight (instructor)

7.2.1 7.2.1.1 7.2.1.2 7.2.2 8 8.1 8.2 8.3 9 9.1 9.2 9.2.1 9.2.2 9.2.2.1 9.2.2.2 9.3 9.4

0.800 0.400 0.600 0.200 0.022 0.333 0.333 0.333 0.111 0.375 0.500 0.250 0.750 0.500 0.500 0.063 0.063

0.800 0.400 0.600 0.200 0.111 0.333 0.333 0.333 0.111 0.375 0.500 0.250 0.750 0.500 0.500 0.063 0.063

TABLE 4. Minimum number of occurrences of an IEEE LOM multi-valued ﬁeld in order to be considered complete. Field

Minimum no. of values

1.5 2.3 4.1 7 8 9 9.2 9.2.2 9.4

3 2 2 2 2 3 3 2 3

end users and makes the metrics highly customizable on the basis of the exact representation of a resource (in terms of its metadata completeness) that an end user will consider ideal so as to assign the top score 1 of a full and complete metadata record. The above examples offered practical directions and guidelines on how to resolve the issues of setting values to the parameters that conﬁgure the system. In this process, a source of concern might raise the m(f) parameter representing the minimum number of values a multivalued ﬁeld should have in order to be considered complete. In the examples presented, the m(f) parameters, the values of which are determined by reality (and not by the subjective assessment of an end user), are supposed to be unknown, hence simple hints to estimate and approximate the real values were provided to overcome this uncomfortable situation. This was done in order to present the whole process without having to access the primary sources of information for each resource (such as the case would be if a considerable number of metadata records to be tested for completeness were harvested from third-party repositories). However, it is

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2012 DOI: 10.1002/asi

733

FIG. 4. Completeness measurement results of the DC record of Figure 2. [Color ﬁgure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

obvious that using the proposed metrics system to measure completeness—after being conﬁgured according to reality and the requirements of its users—is one issue, while the estimation of an attribute of a resource (by trying to “guess” its true value)—when information about the resource’s real life is inaccessible—is another, completely different issue; thus, the two issues should not be confused. Measurement Results After having conﬁgured the metadata completeness metrics system based on user proﬁles, metadata schemata and types of the described resources, we conducted the measurements by performing the relevant mathematical calculations on the two XML encoded metadata records displayed in Figures 2 and 3. A software tool, developed for this task in order to automate the process, was utilized. The application of Equation (3) produced the following results. Completeness Measurement of the DC Record We measured completeness for each one of the 15 ﬁelds of the DC standard, as well as for the whole record itself (displayed in Figure 2). Three different completeness measures were calculated, according to three independent conﬁgurations of the metrics system corresponding to different user proﬁles. The ﬁrst conﬁguration was based on the administrator proﬁle, the second one was based on the instructor proﬁle 734

(parameter values in Table 1), while the third one followed the traditional approach, which considers all ﬁelds to be of equal importance and ignores the possibility of ﬁelds with multiple cardinality (as the case is with the DC standard). In the third proﬁle, called “ﬂat proﬁle,” all 15 ﬁelds of the simple DC are equally weighted and the m(f) parameter for all ﬁelds is 1. Figure 4 displays the completeness measurement results of the DC record of Figure 2. Completeness measures of the DC ﬁelds shown are those calculated according to the administrator and instructor proﬁle (taking into account the parameters of Table 1), while in the ﬂat proﬁle completeness of each ﬁeld is valued as either 0 or 1, since multivalued ﬁelds are not considered. Comparing the completeness measures of the record for the two proﬁles, we conclude that the record is “more complete” for an instructor in relation to an administrator, since the amount of information it contains suits this particular user’s needs and requirements more. Applying the completeness metrics on a slightly differentiated version of the DC record of Figure 2—after having reduced it by using only one “Identiﬁer,” instead of two— we get decreased completeness for the administrator and instructor proﬁle, while for the “traditional” ﬂat proﬁle, fewer identiﬁers, have no effect to the completeness of the record. Figure 5 displays completeness for the “degraded” DC record with only one “Identiﬁer.” For the administrator proﬁle, completeness of the record falls from 71% to 61%, while for the

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2012 DOI: 10.1002/asi

FIG. 5. Completeness measurement results of the “degraded” DC record of Figure 2, with one “Identiﬁer.” [Color ﬁgure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

instructor proﬁle it is reduced from 43% to 39%. For the ﬂat proﬁle, completeness remains unchanged (47%). Thus, the proposed “ﬁne-grained” metrics system proves its capability to highlight and measure even slight differences in the informational content (measured in terms of ﬁeld instances) of a metadata record, which are reﬂected in its completeness measure. Completeness Measurement of the IEEE LOM Record We measured completeness for each one of the 77 ﬁelds (simple or aggregate) of the IEEE LOM standard by applying the recursive formula of Equation (3) in the XML encoded IEEE LOM metadata record of Figure 3. Two different measurements were implemented corresponding to two different conﬁgurations of the metrics system: one for the administrator proﬁle and one for the instructor proﬁle. Absence of a “ﬂat” proﬁle is due to the fact that there is no previous fully designed and implemented attempt on any related work to measure completeness of a record following a hierarchical schema. Table 4 is populated with completeness measures for each one of the IEEE LOM ﬁelds, which were calculated according to the values of the parameters deﬁned for each one of the two proﬁles. The measure of completeness of all missing ﬁelds is 0. Completeness of the whole record is displayed in the last row of Table 5. Table 5 quantitatively expresses the structure of the metadata record based on user requirements. What is really achieved by these measurements is the ability to measure

completeness at any node of the metadata hierarchy. Measuring only the number of populated ﬁelds is not enough to obtain a clear understanding of the amount of information a hierarchical record holds. This is because by only measuring the number of occurrences of a ﬁeld belonging to a parent aggregate multivalued ﬁeld, the hierarchy of the schema is ignored and it is not possible to distinguish whether this particular ﬁeld occurs more than once in the same instance of its parent ﬁeld, or its occurrences belong to different instances of its parent. Discussion The way the proposed metrics system for measuring metadata completeness was designed offers the ability to differentiate each one measure of completeness for any slight variations in the amount of information loaded into the metadata ﬁelds acting as placeholders of values. This is a result of admeasuring the effect of any multiple values a multivalued ﬁeld might have to the total measure of completeness. One less “Identiﬁer” in the DC record of Figure 2 results in reduced completeness for the administrator and instructor proﬁle, while for the traditional “ﬂat” proﬁle, one less value has no effect to the completeness measure of the record. Another signiﬁcant advantage of the proposed metrics system is its ability to spot and quantify problems of completeness at any node of a hierarchy of metadata ﬁelds. The system follows the hierarchy of the schema and computes existence indicators of ﬁeld values, weighing these indicators

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2012 DOI: 10.1002/asi

735

TABLE 5. Completeness measures of ﬁelds of the IEEE LOM record displayed in Figure 3.

Field 1 1.1 1.1.1 1.1.2 1.2 1.3 1.5 2 2.3 2.3.3 3 3.1 3.1.1 3.1.2 3.2 3.2.1 3.2.2 3.2.3 3.4 4.1 4.1 4.3 5 5.2 6 6.1 RECORD

Completeness (admin proﬁle)

Completeness (instructor proﬁle)

57% 100% 100% 100% 100% 100% 89% 8% 30% 100% 90% 100% 100% 100% 100% 100% 100% 100% 100% 33% 100% 100% 27% 100% 42% 100% 30%

78% 100% 100% 100% 100% 100% 89% 9% 30% 100% 90% 100% 100% 100% 100% 100% 100% 100% 100% 33% 100% 100% 9% 100% 42% 100% 41%

according to the importance of each ﬁeld in relation to its parent node. The demonstrated measurement of completeness of the IEEE LOM record of Figure 3 proves this capability. Moreover, the proposed metrics system adds a new dimension to the measurement of completeness by taking into account the representation level of a single value of a simple ﬁeld. This level constitutes information, which, in many cases, might be of signiﬁcant importance for the metadata application and inﬂuences completeness of the ﬁeld according to the assigned application weight of importance (the value of 1 − c of Equation (5)). This dimension was not taken into account in the measurement of the IEEE LOM record of Figure 3. However, this capability of the metrics system can provide valuable information to the metadata application users. For example, in multilingually oriented applications, where different translations of text are of great importance, such “nuances” of information might represent a key factor to affect the measure of metadata completeness. Undoubtedly, a point of discussion is the conﬁguration of the metrics system, i.e., the selection of the parameter values, before conducting the actual measurement and doing the math calculations. Depending on the metadata schema, which deﬁnes the metadata elements and their interrelations within the schema structure, a signiﬁcant number of parameters is required to be set manually. Although the need to set the parameters may appear as an additional burden to metadata users, the truth is that such tasks have already been 736

implemented by application proﬁle creators—contributing, this way, to the effort of balancing the conﬂict between completeness and simplicity noted by Stvilia et al. (2004). The purpose of an application proﬁle is to adapt or combine existing schemas into a package that is tailored to the functional requirements of a particular application, while retaining interoperability with the original base schemas (Duval, Hodgins, Sutton, & Weibel, 2002). In Hillmann and Phipps (2007), an application proﬁle is characterized as a “template for expectation,” as the authors claim that there is high potential in using it to quantify metadata quality in terms of completeness by introducing the notion of expectation (expressed in the application proﬁle) to determine how complete an individual metadata record might be. The conﬁguration parameters we proposed constitute one aspect of this so-called “template for expectation” and can be included in the provisions of an application proﬁle. Since the purpose of an application proﬁle can be served by speciﬁc mechanisms, one of which is cardinality enforcement (Duval et al., 2002), we can deal with the issue of deﬁning the conﬁguration parameters of the proposed completeness metrics by enriching the outputs of this mechanism. Indeed, just as an application proﬁle speciﬁes whether a metadata element is mandatory or optional—based on the importance of the element for the particular community of use—it can also specify the relative importance of each element by assigning an arithmetic value as the element’s weight. Following the same logic, an application proﬁle can specify the exact multiplicity of an element in order to be considered complete as well. The conﬁguration process can be implemented within the context of creating an application proﬁle and can beneﬁt from any relevant effort to facilitate or automate this broader task. For example, the weights of importance of metadata elements can automatically obtain values (providing a suitable software interface were available) by the relative frequency users tend to use each element when issuing queries to a metadata repository (Ochoa & Duval, 2009). Measuring completeness of metadata using the proposed ﬁne-grained metrics system, beyond the above-described beneﬁts and advantages, can offer valuable help in fulﬁlling the speciﬁc requirements set by the context of use. The speciﬁc process or activity handling the metadata determines the pragmatics of measuring and deﬁnes the exact purpose of measuring. For example, measuring completeness of metadata can be used as an important tool to evaluate automatic metadata generation methods and techniques. Another potential application of completeness measuring might be in cases where targeted measures of completeness can be used as additional criteria to ﬁlter the results produced by a search engine. For example, when searching for learning objects, a teacher preparing a lesson might wish to make sure that the metadata of the returned results will contain a certain amount of educational information. Hence, the teacher might set a threshold value for the completeness measure of the metadata of the returned learning objects at speciﬁc elements (educational ﬁelds) and ﬁlter the results according to this criterion.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2012 DOI: 10.1002/asi

Conclusion – Future Work In this paper a metrics system for measuring the completeness of metadata was presented. In an effort to treat the inadequacies of the traditional approach, which is based on measuring completeness of a metadata record by generally counting the presence or absence of values in ﬁelds, the proposed system deﬁnes completeness at the ﬁeld level in a recursive way following the hierarchy of the metadata schema. Multivalued and aggregate ﬁelds were taken into consideration, as well as the representation level of semantically equivalent information in the metadata ﬁelds. The result is a set of metrics that takes into account the needs and requirements of the application level (determined by weighting factors speciﬁed by the particular process or activity of the metadata application) and can be easily implemented by automated means. The proposed metrics system was put into action and used to measure completeness of real-world metadata records. The results prove that the system is able to fulﬁll its requirements. The next step following this work consists of an experimental implementation of the metric system and its application on a working database of metadata in a digital repository. The results for various targeted measures of completeness of particular ﬁelds (especially in a hierarchical metadata schema) are expected to provide valuable information in order to draw interesting conclusions about completeness, and quality of metadata in general, which would otherwise be impossible to reach. References Arazy, O., & Kopak, R. (2011). On the measurability of information quality. Journal of the American Society for Information Science and Technology. 62(1), 89–99. Barton, J., Currier, S., & Hey, J.M.N. (2003). Building quality assurance into metadata creation: An analysis based on the learning objects and e-prints communities of practice (pp. 39–48). In Proceedings of the International Conference on Dublin Core and Metadata Applications: Supporting Communities of Discourse and Practice. Singapore: Dublin Core Metadata Initiative. Bruce, T.R., & Hillmann, D.I. (2004). The continuum of metadata quality: deﬁning, expressing, exploiting (pp. 238–256). In D.I. Hillmann, E. Westbrooks (Eds.), Metadata in Practice. Chicago: ALA Editions. Bui,Y., & Park, J. (2006). An assessment of metadata quality: A case study of the National Science Digital Library metadata repository. In Haidar Moukdad (Ed.), CAIS/ACSI 2006 information science revisited: Approaches to innovation. Dublin Core Metadata Element Set. Retrieved from http://dublincore.org/ documents/dces/ Dushay, N., & Hillmann, D. (2003). Analyzing metadata for effective use and re-use In Proceedings of the International Conference on Dublin Core and Metadata Applications: Supporting Communities of Discourse and Practice (pp. 1–10). Singapore: Dublin Core Metadata Initiative. Duval, E., Hodgins, W., Sutton, S., & Weibel, S. (2002). Metadata principles and practicalities. D-Lib Magazine, 8(4). Retrieved from http://www.dlib.org/dlib/april02/weibel/04weibel.html Friesen, N. (2004). International LOM Survey: Report (Draft). Retrieved from http://arizona.openrepository.com/arizona/bitstream/10150/106473/ 1/LOM_Survey_Report2.doc Greenberg, J. (2003). Metadata and the World Wide Web (pp. 1876–1888). In M. Dekker (Ed.), Encyclopaedia of library and information science. New York: Marcel Dekker.

Greenberg, J., Spurgin, K., & Crystal, A. (2005). AMeGA (Automatic Metadata Generation Applications) Project. Final Report, University of North Carolina & Library of Congress. Retrieved from: http://www.loc.gov/catdir/bibcontrol/lc_amega_ﬁnal_report.pdf Guinchard, C. (2002). Dublin Core use in libraries: a survey. OCLC Systems & Services, 18(1), 40–50. Hillmann, D.I., & Phipps, J. (2007). Application proﬁles: Exposing and enforcing metadata quality (pp. 52–62). In International Conference on Dublin Core and Metadata Applications: Application proﬁles: Theory and practice (pp. 27–31). Singapore: Dublin Core Metadata Initiative. Hughes, B. (2004). Metadata quality evaluation: Experience from the open language archives community. (pp. 320–329). In Z. Chen et al. (Eds.), International Conference on Asian Digital Libraries (ICADL 2004). Lecture Notes in Computer Sciences, 3334, 320–329. Learning Object Metadata. Retrieved from: http://ltsc.ieee.org/wg12/ﬁles/ LOM_1484_12_1_v1_Final_Draft.pdf Liddy, E., Allen, E., Harwell, S., Corieri, S., Yilmazel, O., Ozgencil, N., . . . (2002). Automatic metadata generation & evaluation. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 401–402). New York: ACM Press. Margaritopoulos, M., Margaritopoulos, T., Kotini, I., & Manitsaris, A. (2008). Automatic metadata generation by utilising pre-existing metadata of related resources. International Journal of Metadata, Semantics and Ontologies, 3(4), 292–304. Margaritopoulos, T., Margaritopoulos, M., Mavridis, I., & Manitsaris, A. (2008). A conceptual framework for metadata quality assessment. In International Conference on Dublin Core and MetadataApplications: Metadata for Semantic and Social Applications (pp. 104–116). Singapore: Dublin Core Metadata Initiative. Moen, W., Stewart, E., & McClure, C. (1998). Assessing metadata quality: Findings and methodological considerations from an evaluation of the US Government information locator service (GILS). In Proceedings of the Advances in Digital Libraries Conference (pp. 246–255). Los Alamitos, CA: IEEE Computer Society. Moreira, B., Gonçalves, A., Laender, A., & Fox, E. (2009). Automatic evaluation of digital libraries with 5SQual. Journal of Informetrics, 3(2), 102–123. Najjar, J., Ternier, S., & Duval, E. (2003, November). The actual use of metadata in ARIADNE: An empirical analysis. Paper presented at the ARIADNE Conference, Leuven, Belgium. Najjar, J., Ternier S., & Duval, E. (2004, June). User behavior in learning object repositories: An empirical analysis. Paper presented at the EDMEDIA 2004 World Conference on Educational Multimedia, Hypermedia and Telecommunications, Lugano, Switzerland. Ochoa, X., Cardinaels, K., Meire, M., & Duval, E. (2005, June). Frameworks for the automatic indexation of learning management systems content into learning object repositories. Paper presented at the ED-MEDIA 2005 World Conference on Educational Multimedia, Hypermedia and Telecommunications, Montreal, Canada. Ochoa, X., & Duval, E. (2009). Automatic evaluation of metadata quality in digital libraries. International Journal on Digital Libraries, 10(2–3), 67–91. Open Archives Initiative (2008). OAI Protocol for Metadata Harvesting (OAI-PMH) v2.0. Retrieved from: http://www.openarchives.org/OAI/ openarchivesprotocol.html Sicilia, M.A., García, E., Pagés, C., Martínez, J.J., & Gutiérrez, J. (2005). Complete metadata records in learning object repositories: Some evidence and requirements. International Journal of Learning Technology, 1(4), 411–424. Stvilia, B., Gasser, L., Twidale, M., Shreeves, S., & Cole, T. (2004). Metadata quality for federated collections. In Proceedings of the Ninth International Conference on Information Quality (pp. 111–125). Boston: MIT Press. Stvilia, B., Gasser, L., Twidale, M., & Smith, L. (2007). A framework for information quality assessment. Journal of the American Society for Information Science and Technology, 58(12), 1720–1733.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2012 DOI: 10.1002/asi

737

Lihat lebih banyak...

Quantifying and measuring metadata completeness

Descripción

Comentarios