Quantifying and measuring metadata completeness

August 25, 2017 | Autor: A. Manitsaris | Categoría: Information Systems, Library and Information Studies
Share Embed


Descripción

Quantifying and Measuring Metadata Completeness

Merkourios Margaritopoulos, Thomas Margaritopoulos, Ioannis Mavridis, and Athanasios Manitsaris Department of Applied Informatics, University of Macedonia, 54006, Thessaloniki, Greece. E-mail: {mermar, margatom, mavridis, manits}@uom.gr

Completeness of metadata is one of the most essential characteristics of their quality. An incomplete metadata record is a record of degraded quality. Existing approaches to measure metadata completeness limit their scope in counting the existence of values in fields, regardless of the metadata hierarchy as defined in international standards. Such a traditional approach overlooks several issues that need to be taken into account. This paper presents a fine-grained metrics system for measuring metadata completeness, based on field completeness. A metadata field is considered to be a container of multiple pieces of information. In this regard, the proposed system is capable of following the hierarchy of metadata as it is set by the metadata schema and admeasuring the effect of multiple values of multivalued fields. An application of the proposed metrics system, after being configured according to specific user requirements, to measure completeness of a real-world set of metadata is demonstrated. The results prove its ability to assess the sufficiency of metadata to describe a resource and provide targeted measures of completeness throughout the metadata hierarchy.

Introduction Metadata can be simply defined as “data about data” or “information about information.” The purpose of metadata is to provide adequate, correct, and relevant information to users so as to obtain a true picture of the resource they describe without having to access it. Completeness of metadata refers to their sufficiency to fully describe a resource covering all its possible aspects. Given the more refined definition for metadata by Greenberg (2003), as structured data about an object that supports functions associated with the designated object, it becomes clear that sufficiency to fully describe a resource is directly associated with the particular activities or processes that metadata are intended to support. Furthermore, Sicilia, García, Pagés, Martínez, and Gutiérrez (2005) argue that each kind of activity or process demands a number of concrete metadata elements. Thus, Received April 4, 2011; revised October 3, 2011; accepted October 17, 2011 • Published © 2011 • Published onlineonline 7 December 2011 inOnline Wiley Library Online 2011ASIS&T ASIS&T in Wiley (wileyonlinelibrary.com). DOI: 10.1002/asi.21706 Library (wileyonlinelibrary. com). DOI: 10.1002/asi.21706

completeness of metadata depends on the specific functionality or usage expected by the particular metadata application. Quoting from Ochoa and Duval (2009, p. 71), “Completeness is the degree to which the metadata instance contains all the information needed to have a comprehensive representation of the described resource.” A comprehensive representation varies according to the application and the community of use. Numerous metadata standards have been established in an attempt to define sufficient descriptions of a resource from different perspectives and satisfy diverse functionalities. Theoretically, a sufficient description exists when all metadata elements of a standard are populated with values. However, in practice this is not what happens in the real world. Relevant surveys by Guinchard (2002), Najjar, Ternier, and Duval (2003) and Friesen (2004) have shown that indexers tend to fill out only particular metadata elements that could be considered “popular,” while they ignore other less popular elements. The creation of metadata is a task requiring major labor and financial cost and, most important, the involvement of knowledgeable and experienced people (Liddy et al., 2002; Barton, Currier, & Hey, 2003). Since all these requirements are, generally, difficult to be fully met, it is rather common, in the majority of digital repositories, to have incomplete metadata. The issue of incomplete metadata records is rather problematic, especially, in collections resulting from harvesting from metadata databases (Dushay & Hillmann, 2003) or from automatically generated metadata (Greenberg, Spurgin, & Crystal, 2005; Ochoa, Cardinaels, Meire, & Duval, 2005; Margaritopoulos M., Margaritopoulos, T., Kotini, & Manitsaris, 2008). On the other hand, when searching for information, users of metadata limit their search criteria by using only a very small percentage of metadata elements, as highlighted by Najjar et al. (2004). This fact, as Stvilia, Gasser, Twidale, Shreeves, and Cole (2004) note, shows that completeness is often in conflict with simplicity. However, this conflict between completeness and simplicity of a metadata standard is attempted to be balanced by application profiles (Hillmann & Phipps, 2007). Application profiles, among other things, select certain metadata fields of one or more standards and

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 63(4):724-737, 2012

consider them as mandatory or as optional, based on the needs of the particular users’ community. The concept of completeness of metadata, as an object of study, is integrated in the more general concept of metadata quality. The research community considers completeness of metadata a fundamental quality characteristic. Several dimensions or characteristics of information quality and metadata quality have been proposed by researchers (Bruce & Hillmann, 2004; Stvilia, Gasser, Twidale, & Smith, 2007; Margaritopoulos T., Margaritopoulos, M., Mavridis, & Manitsaris, 2008; Arazy & Kopak, 2011) in an effort to define quality and provide the necessary means for its assessment and improvement. Among the numerous suggested quality dimensions, completeness is one of the most essential. Indeed, it can be considered a prerequisite to assess quality, since incomplete records are in any case not of quality due to lack of essential information (Sicilia et al., 2005). Among all the constituent elements of metadata quality, completeness is the easiest to be quantified and measured by automatic means without human intervention. This article is organized as follows. The next section gives a brief account of related work on research efforts on metadata completeness measurement, followed by the section presenting the motivation that led to the design of a new fine-grained metrics system for metadata completeness, which is introduced afterwards. The paper continues presenting an implementation of the proposed metrics on two real-world metadata records. First, the system is configured according to user requirements and then the measurement results themselves are presented, after applying the respective mathematical formulas. A discussion of the proposed metrics follows and, finally, the last section concludes the article.

Related Work Several researchers have created metrics to measure metadata quality by computing indicators of quality and, among them, the completeness indicator of a metadata record. Completeness is measured based on the presence or absence of values in metadata fields1 defined in different metadata standards. In some research efforts, the fields to be considered when completeness is measured are selected by the application that uses the metadata based on their importance for the specific process or activity handling the metadata. In a study to assess the quality of metadata records in the National Science Digital Library (NSDL) repository, Bui and Park (2006) harvested over 1 million Dublin Core metadata records and assessed metadata quality in terms of metadata uses in frequency, consistency, completeness, and accuracy. Completeness is given a percentage measure based on the presence or absence of values in the 15 elements of the simple Dublin Core standard (Dublin Core Metadata Element Set).

It is assumed that an element is marked as present if one instance of the element is populated with a value. In Hughes (2004) an algorithm to assess metadata quality of the Open Language Archive Community (OLAC) metadata repository is presented. The algorithm computes an overall quality score of a metadata record based on several metadata characteristics, among them the element absence penalty. Certain elements are considered necessary; thus, their absence results in reduced quality, introducing in this way the idea of the weighted importance of the elements in terms of their contribution to metadata quality. Moen, Stewart, and McClure (1998) describe techniques and procedures used in an exploratory, systematic assessment of one particular set of metadata records used in implementations of the U.S. Government Information Locator Service (GILS). Completeness is expressed only by counting the number of elements in a metadata record and no measure is given in relation to a “complete” record. “Mandatory” and “optional” elements are measured separately, thus underlining the importance of the context of use when assessing completeness. Furthermore, repeatable elements (multiple instances of the same element) are taken into account. Moreira, Gonçalves, Laender, and Fox (2009) present a tool to perform automatic evaluation of a digital library. For the evaluation of metadata, completeness is defined according to the number of fields of the Dublin Core standard found in a record. However, the examined records are checked only for the existence of the 15 fields of the simple DC standard. A complete record with 15 fields present is assigned a completeness measure of 1. Multiple instances of the same field are not considered. Metrics to measure metadata quality in digital repositories are introduced in Ochoa and Duval (2009). More specifically, completeness is defined as the number of fields that contain a non-null value divided by the total number of fields of the record according to the metadata standard, as in: com =

N 

P (i)/N

(1)

i=1

where N is the number of fields and P(i) is 1 if the i-th field has a non-null value, 0 otherwise. In cases where the fields of a record are not of equal importance, any particular application might assign weights of importance to each field. Therefore, the measure of “weighted completeness” is computed as: com =

N N   (ai ∗ P (i))/ ai i=1

(2)

i=1

where ai is the weight of field i. Thus, a field is considered either complete or noncomplete, being assigned only two discrete values. Motivation

1 The terms “metadata field” and “metadata element” are used interchangeably throughout this article.

All of the above-referenced research efforts define completeness at the record level by measuring the number of

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2012 DOI: 10.1002/asi

725

elements or providing formulas to compute completeness of a record based on the existence or absence of values in the metadata fields of a standard. Some of them take into account the relative importance of each field, either by setting weights of importance or by only considering certain fields and ignoring others. Although such metrics seem to be providing meaningful measures of completeness of a metadata record, several issues are overlooked, raising obvious concerns. A source of concern is the case of multivalued fields (fields with cardinality greater than 1). In Ochoa and Duval (2009), a multivalued field is considered to be complete if at least one instance of the field exists. However, this proposition overlooks the fact that a multivalued field can be loaded with a variable amount of information. As a consequence, such variations are not reflected to the measure of completeness of the field’s record, yielding degraded results. The need to consider the multiple instances of multivalued fields is recognized in Bui and Park (2006), where the instances of the Subject field are recorded separately, since the authors recognize the need to analyze the richness of the various subject areas in a repository. Another major source of concern is the case of aggregate fields (or container elements). For example, metadata standards like IEEE LOM (Learning Object Metadata) and LoC METS (Metadata Encoding & Transmission Standard; http://www.loc.gov/standards/mets) define a hierarchy of data elements, including aggregate data elements and simple data elements (leaf nodes of the hierarchy). Only leaf nodes have individual values defined through their associated value space and datatype. If completeness is measured taking into account only the simple fields that have individual values, then the whole structure of the metadata schema is ignored. There is no meaning in assigning a 0 or 1 as an existence indicator of value when measuring completeness in the 58 simple fields of IEEE LOM metadata schema, regardless of the parent node they belong to and ignoring the hierarchy and interdependence of the fields. For example, how can we count multiple instances of the simple field “2.3.1 Entity” belonging to multiple instances of the parent aggregate field “2.3 Contribute”? In all the above literature, the presence of multiple representation forms of the same value in a metadata field as a possible factor that might influence completeness is not examined as well. The “langstring” datatype of IEEE LOM is such a case. This source of additional information included in a metadata record is a matter of possible consideration when completeness of the record is to be measured. The above discussion brings out the obvious need to redefine completeness in a way that all the aforementioned issues of concern will be taken care of. In an effort to measure completeness, we will comply with the following principles: Completeness of a record must only be measured against the specifications of the metadata standard or application profile used by any given application. The standard or the application profile prescribes a certain number of fields as necessary placeholders of information to describe a resource. Completeness is then considered as the degree to which these 726

placeholders are full with values. That degree may swing in a continuum from total presence to absolute absence of values. Thus, a requirement for a metrics system to measure completeness would be its ability to assign values in a continuous range between these two endpoints. Intuitively, a range of values in the closed interval [0, 1] seems to adequately satisfy this requirement, since nonexistence of completeness would map to the value of 0 and total completeness would map to the value of 1, offering the convenience of easily translating the values to percentages. Limiting the range value of completeness in this condensed interval should not restrict the ability of the metrics system to detect any slight variation in the amount of information contained in a metadata record. Therefore, a critical requirement for the system would be to be able to differentiate the measure of completeness so as to reflect such variations. The following section will introduce a new fine-grained metrics system for metadata completeness based on the above requirements. Based on the analysis of the introductory and the motivation section of the paper, it is clear that completeness is only examined in terms of the existence of values in metadata elements and is not concerned with the values themselves, that is, we do not deal with the so-called “information content” expected to be found inside a metadata value. Such an orientation—for example, counting the number of words in free text fields and comparing to some minimum standard—would represent an absolutely different approach than the present one to the issue of metadata completeness that is definitely out of the scope of the present research effort. Yet we need to emphasize that the evaluation of completeness as a metadata quality parameter based on the existence of values in metadata elements (regardless of the actual content of these values) does not, in any way, underestimate or overlook this aspect. Within the context of our conceptual framework for metadata quality assessment introduced in Margaritopoulos T., Margaritopoulos, M., Mavridis, and Manitsaris (2008), the evaluation of the content of a value is an issue assigned to the other two dimensions of metadata quality, i.e., correctness and relevance (the measurement of which still remains an open research issue)–providing that a value is already there. This condition, i.e., the existence of values, is evaluated and measured by completeness. The Proposed Metadata Completeness Metrics System The keystone concept of the proposed fine-grained metrics system for measuring metadata completeness is the concept of completeness of a field, which will be defined in the next subsection. Completeness of a Metadata Field Just as completeness of a metadata record is defined as the weighted average of the existence or nonexistence of values in its fields (according to the metadata standard), completeness of a metadata field is defined following the same logic

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2012 DOI: 10.1002/asi

but taking into account also the nature of the field (simple, aggregate, multivalued). Thus, completeness of a field f, represented as COM(f), will be given the disjunctive definition of Equation (3): COM(f ) = ⎧ m(f ) ⎪ ⎪ ⎪ [ak (f ) ∗ Pk (f )] if f is a simple field ⎪ ⎪ ⎪ k=1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎨m(f ) if f is an aggregate [ak (f ) ∗ COMk (f )] multi-valued field ⎪ ⎪ k=1 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ nsf ⎪ (f ) if f is an aggregate ⎪ ⎪ [ak (f ) ∗ COM(sfk (f ))] ⎩ single-valud field k=1 (3) In Equation (3), f represents any node (field) belonging to the hierarchy of the metadata schema. For example, for the IEEE LOM metadata schema, f may be a field like “5.2 Educational.Learning resource type,” a category like “1 General,” or even the whole record, which is the root node of the hierarchy. The first case of the split definition of Equation (3) defines completeness of a simple field, either single-valued or multivalued. The minimum number of values field f should have in order to be considered complete is represented by m(f). Considering the values of field f to be stored in m(f) placeholders, Pk (f ) is defined as an existence indicator marking the presence (equal to 1) or absence (equal to 0) of value in the k-th placeholder. Then, completeness of a simple field is defined as the weighted average of the existence indicators of its values. In the special case of a single-valued field, obviously m(f) is equal to 1. In a multivalued field, m(f) is a number depending on the semantics of the field. Sometimes it is defined by reality itself, for example, the number of languages contained in a multilanguage document. In cases where m(f) is not a real-life fact, it should be defined by the application in which the metadata are used, for example, the number of third-party annotations for a resource. To give an example of the involved math calculation in the first case of Equation (3), let us assume that we have a DC metadata record with two instances of the simple multivalued field “Creator” (f = “Creator”) registered in the record. If the resource associated with the metadata record has been created by three creators (m(f) = 3), then the existence indicators of the (three) expected values would be P1 (f) = 1, P2 (f) = 1, P3 (f) = 0. Considering the expected values to be equally weighted (a1 (f) = 0.33, a2 (f) = 0.33, a3 (f) = 0.33) and applying the first case of Equation (3), completeness of the field “Creator” would be: COM (f) = a1 (f) ∗ P1 (f) + a2 (f) ∗ P2 (f) + a3 (f) ∗ P3 (f) = 0.33 ∗ 1 + 0.33 ∗ 1 + 0.33 ∗ 0 = 0.66 The second case of the split definition of Equation (3), is similar to the first one with the difference that the values of an aggregate field are structured fields themselves,

bearing a measure of completeness of their own, other than the two discrete values of the existence indicator Pk (f ). Thus, completeness of a multivalued aggregate field is computed as a weighted average of the completeness measures of its values (instances), with COMk (f ) be the completeness measure of the k-th value of the field. For example, completeness of the category “8 Annotation” of IEEE LOM will be computed as the weighted average of the measures of completeness of each “8 Annotation” instance. In this particular case, m(f) will be a certain number of annotations expected to be found in a record by a metadata application that would be regarded as sufficient in terms of information provided by previous users on the educational use of the associated learning object. Assuming that in an IEEE LOM metadata record, two instances of the field “8 Annotation” (f = “8 Annotation”) are registered, while the metadata application expects four instances (m(f) = 4), and the measures of completeness of the two registered instances are 0.8 and 0.4 (COM1 (f) = 0.8, COM2 (f) = 0.4, COM3 (f) = 0, COM4 (f) = 0), while each expected instance is considered to be equally weighted (a1 (f) = 0.25, a2 (f) = 0.25, a3 (f) = 0.25, a4 (f) = 0.25), then, applying the second case of Equation (3), completeness of the field f = “8 Annotation” would be: COM(f) = a1 (f) ∗ COM1 (f) + a2 (f) ∗ COM2 (f) + a3 (f) ∗ COM3 (f) + a4 (f) ∗ COM4 (f) = 0.25 ∗ 0.8 + 0.25 ∗ 0.4 + 0.25 ∗ 0 + 0.25 ∗ 0 = 0.3 In the first two cases of Equation (3), ak (f ) represents the weights of importance of the values of a multivalued field. In order to limit the measures of completeness in the closed interval [0, 1], these weights are forced to have a sum of 1. The values of a multivalued field are generally considered to be equally weighted, that is, no value is regarded as more important than the rest. However, there might be applications where this default assumption does not fit. For example, an application might wish to assign progressively reduced weights to the creators of a resource being registered in a metadata record, as it may not consider any additional creators as much important as the previous ones. In our previous example where we had a DC metadata record with two instances of the simple multivalued field “Creator,” while the resource associated with the record was actually created by three creators, if we had assigned progressively reduced weights to the “Creator” instances (e.g., a1 (f) = 0.6, a2 (f) = 0.3, a3 (f) = 0.1), then completeness of the field “Creator” would have been: COM(f) = a1 (f) ∗ P1 (f) + a2 (f) ∗ P2 (f) + a3 (f) ∗ P3 (f) = 0.6 ∗ 1 + 0.3 ∗ 1 + 0.1 ∗ 0 = 0.9 Another special case of weight assignment is that of conditional weight to an aggregate field, based on the value of a specific subfield of its. For example, referring to IEEE LOM fields, an application might wish to assign a higher weight to an instance of “2.3 Lifecycle.Contribute,” in case

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2012 DOI: 10.1002/asi

727

that its “2.3.1 Lifecycle.Contribute.Role” of the contributor is “author,” than the weight to a different instance where the role of the contributor is “editor.” In the last case of the split definition of Equation (3), completeness of a single-valued aggregate field is computed from the measures of completeness of its subfields, referring to the fields of the next lower level down the metadata hierarchy. The symbol nsf(f) represents the number of subfields of field f according to the metadata standard or application profile used and sfk (f ) is the k-th subfield of field f. For example, for each (single-valued) instance of the aggregate field “8 Annotation” of IEEE LOM nsf (“8 Annotation”) = 3, which is the number of the next lower level subfields—8.1 through 8.3—and sf2 (“8 Annotation”) is “8.2 Annotation.Date.” The variable ak (f ) represents the weight of each subfield, which, in this case, is its relative importance to its parent field. Again, the sum of these weights is considered to be equal to 1. If an instance of “8 Annotation” contains a subfield “8.1 Annotation.Entity” (filled with a value), no value for the subfield “8.2 Annotation.Date” and a value for the subfield “8.3 Annotation.Description,” since all three subfields are simple single-valued fields, then the completeness measures of each will be equal to their existence indicator, that is, COM(sf1 (f)) = 1, COM(sf2 (f)) = 0 and COM(sf3 (f)) = 1. If we assign a weight of 0.4 to both subfields “8.1” and “8.3” and a 0.2 weight to “8.2” subfield (a1 (f) = 0.4, a2 (f) = 0.2, a3 (f) = 0.4), then the completeness measure of this particular instance of field “8” will be calculated according to the third case of Equation (3), i.e.: COM(f) = a1 (f) ∗ COM(sf1 (f)) + a2 (f) ∗ COM(sf2 (f)) + a3 (f) ∗ COM(sf3 (f)) = 0.4 ∗ 1 + 0.2 ∗ 0 + 0.4 ∗ 1 = 0.8 The definition of Equation (3) for the completeness of a field has a clear recursive nature, since it uses previous values from lower field levels to create new ones. For example, completeness of the category “1 General” of IEEE LOM is computed as the weighted average of the measures of completeness of its eight subfields (1.1 through 1.8). Seven of these eight subfields are simple fields. In case they are single-valued, such as “1.7 General.Structure,” completeness is equal to the existence indicator of the value. In case of multivalued fields, such as “1.4 General.Description,” completeness is computed as the weighted average of the measures of completeness of their instances. The completeness of each instance of the multivalued simple fields is, again, equal to the existence indicator of its value. Field “1.1 General.Identifier” is an aggregate field, so its completeness will be computed as the weighted average of the measures of completeness of its subfields (“1.1.1 General.Identifier.Catalog” and “1.1.2 General.Identifier.Entry”), which, in turn, are simple fields and their completeness is equal to the existence indicators of their values. Thus, traversing the sub-tree of the category “1 General” of IEEE LOM down to its leaves, we compute the completeness of this aggregate field. 728

In general, in order to compute completeness of any node we start from the node and traversing downwards the hierarchy of the metadata schema, we recursively compute the measures of completeness of the nodes of its sub-tree, applying Equation (3). The base of the recursion is always the existence indicator of a single value of a simple field. Completeness of the whole record is computed the same way—starting from the top level (the root node) of the hierarchy. Completeness at the Representation Level When traversing the nodes of a metadata hierarchy in order to measure completeness of a record or of a specific field, one could say that the end of this process is the lowest level of the hierarchy or a field of a leaf node with no further structure of subfields.Yet the measure of completeness cannot just be computed by checking the presence of values in a simple field because such a field may still contain additional information that might affect the ideal representation of the described resource. This additional information belongs to the representation level of the values. It is possible for the values of certain datatypes to be represented in different forms that are semantically equivalent, e.g., a particular text in different languages, such as in the datatype “langstring” of the IEEE LOM standard. The existence of different representations should affect the measure of completeness of the respective field as being valuable additional information. Following the same logic, other forms of representation that could be set by a metadata standard might be different visual or audio versions of the same piece of textual information, which are simultaneously present in a metadata field. The consideration of the representation level does not allow us to think of completeness of a single value of a simple field as having only the two discrete values of the existence indicator Pk (f ). Completeness of such a field should be computed as the weighted average of its existent forms of representation. Thus, if we consider the different forms of representation of a single value of a simple field, completeness of a simple field is formulated as: COMrepresentations(f ) =

m(f )

L  [bj ∗ Rj (k, f )]]

k=1

j =1

[ak (f ) ∗

(4) where L is the number of all the possible different representation forms, bj is the weight of the j-th representation form and Rj (k, f ) is the existence indicator of this specific representation form of the k-th single value of field f. The sum of the weights bj is again set to 1. The minimum number of the possible different representations—for the represented value to be considered complete—and the weight of each is an issue determined by the application. In the case of different languages of the same textual information, a minimum number of languages that would make the value of the textual field to be considered complete might be the number of

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2012 DOI: 10.1002/asi

languages of a certain geographical region (e.g., the European Union languages) or the number of languages used in a particular educational institution. As for the weights of the languages, apparently mother languages of a particular place or languages most frequently used would be assigned a higher weight. Equation (4) seems to adequately cover the issue of completeness of a simple field from any point of view. However, an essential problem emerges. If a field is a multivalued one, then, applying Equation (4), the measure of completeness will be computed taking into account all the instances of the different representation forms, regardless of the total number of values they express. For example, let us assume a situation where completeness of field “1.5 General.Keyword” of IEEE LOM is to be measured. The maximum number of values m(f) is set to 5 and the maximum number of languages L is set to 3, while we assign equal weights for all the different values (ak = 1/5, k = 1, . . . , 5) and all the different languages (bj = 1/3, j = 1,2,3) to express these values. We will consider two different cases. In case “a” the field is populated with one keyword expressed in three different languages, so the measure of completeness (according to Equation (4)) will be equal to: (1/5) ∗ (1/3) ∗ [(1 + 1 + 1) + (0 + 0 + 0) + (0 + 0 + 0) + (0 + 0 + 0) + (0 + 0 + 0)] = 0.2 In case “b” the field is populated with two keywords each expressed in one language, so completeness of the field would be equal to: (1/5) ∗ (1/3) ∗ [(1 + 0 + 0) + (1 + 0 + 0) + (0 + 0 + 0) + (0 + 0 + 0) + (0 + 0 + 0)] = 0.13 Apparently, in terms of completeness, one keyword outperforms two keywords just because this keyword is expressed in more languages than the two keywords, each of which is expressed in only one language. Consequently, we need to make a distinction between measurement of completeness that takes into account only the number of values of a field (which, obviously, is the most important) and measurement of completeness taking into account the number of the different representations of these values. The former is the one expressed by the first case of the split definition of Equation (3), from now on called COMvalues (f ). The latter is the one expressed by Equation (4). The above example involving keywords expressed in different languages is presented in graphical form in Figure 1. The requirement, in order to maintain the importance of the different values against their different representations, is that more values must always result in a higher completeness score than fewer values, regardless of the number of representation forms the values are expressed in. If we compute completeness as the weighted average of COMvalues (f ) and COMrepresentations (f ) and assign a higher weight to COMvalues (f ), then this requirement is met. Hence, the

measure of completeness of a simple field—to substitute the first case of Equation (3) – becomes: COM(f ) = c∗COMvalues (f )+(1−c)∗COMrepresentations (f ) (5) The value of weight c expresses the maximum measure of completeness of a complete field (taking into account only its values), regardless of their representations. The value of weight 1 − c expresses the additional amount of completeness, attributed to the representations of the values, so as to reach the maximum value of completeness (the value of 1) of a complete field. In the previous example of keywords expressed in different languages, if we set c = 0.8 (a value higher than 0.5), then for case “a” we have: COM(f ) = 0.8 ∗ COMvalues + (1 − 0.8) ∗ COMrepresentations = 0.8 ∗ 1/5 + 0.2 ∗ 3/15 = 0.2 while for case “b” we have COM(f) = 0.8 ∗ COMvalues + (1 − 0.8) ∗ COMrepresentations = 0.8 ∗ 2/5 + 0.2 ∗ 2/15 = 0.35 Weight c acts as expected, forcing the second case to have a bigger measure of completeness than the first one, since it employs two keywords compared to one, regardless of the number of different languages these keywords are expressed in.

Application to Real-World Metadata Records In order to put the proposed metadata completeness metrics into action and prove its applicability and effectiveness, we apply the metrics on real-world metadata records, the first one following the simple Dublin Core and the second one following the IEEE LOM metadata schema. A sample DC metadata record harvested from the CADAIR open access online research repository of the Aberystwyth University of Wales (http://cadair.aber.ac.uk/dspace/) through the OAI/PMH metadata harvesting protocol (Open Archives Initiative, 2008) is displayed in Figure 2. The metadata schema is the simple Dublin Core (Dublin Core Metadata Element Set). The repository contains metadata records describing resources such as scientific articles and papers, conference proceedings, books, technical reports, etc. In Figure 3, a sample IEEE LOM metadata record harvested from the open access Slovenian Education Network repository (http://sio.edus.si/LreTomcat) through the OAI/PMH metadata harvesting protocol (Open Archives Initiative, 2008) is shown. The record follows the LRE (Learning Resource Exchange) application profile (by the European Schoolnet: http://eun.org) based on the IEEE LOM

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2012 DOI: 10.1002/asi

729

FIG. 1.

Completeness based on values or representations.

FIG. 2. A sample DC metadata record. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

metadata schema. The repository contains metadata records describing learning objects of multimedia content. Before applying the formula of Equation (3) and performing any mathematical calculations, a proper configuration of the metrics system has to be performed by setting values to the parameters that customize the measurement according to the needs and requirements of its users. Such a process is required to be accomplished separately for the DC and the IEEE LOM metadata records. The parameters are set for two user profiles: an administrator of the repository hosting the metadata and an instructor planning to use the resources associated with the repository’s metadata for educational purposes in order to prepare a lesson. The configuration of the metrics system is most important, since the measures of completeness are affected by the parameter values (which are set by the metadata user) in a decisive way. The relative weights of importance and the minimum number of values a multivalued field should have in order for a user to consider it complete influence the amount of information of the metadata record needed by this user to have an ideal (in terms of completeness) 730

representation of the associated resource. The completeness measure of the record is computed against this amount of information, which expresses full completeness. Configuration of the DC Metadata Record All 15 fields of the simple Dublin Core have been defined as multivalued. Furthermore, the simple DC schema is flat— its fields have no internal structure—thus, the only parameters to set are the minimum number of values for each field in order to consider the field complete and any relative weights of importance to reflect differentiated degrees of significance of each field to different users. The weights of importance for each one of the 15 fields of the simple DC for the administrator profile and the instructor profile are set as follows. For the administrator profile: Fields identifying the resource, like “Identifier” and “Title” are considered of the highest importance. In the second order of importance, we classify fields as “Type” and “Format” describing the

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2012 DOI: 10.1002/asi

FIG. 3. A sample IEEE LOM metadata record. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

nature and format of the resource (necessary to the administrator for its technical manipulation), as well as fields referring to its main contributors like “Creator” and “Publisher.” Next, we place fields referring to the temporal dimension of a resource’s lifecycle and any associated rights of use, like “Date” and “Rights.” Next in order of importance come fields describing the content of the resource, like “Language,” “Subject,” and “Description.” Last, fields of restricted interest to a repository administrator, like “Coverage,” “Source,” “Contributor,” and “Relation” take their place in the order of importance.

For the instructor profile: The most important fields for an instructor are considered fields referring to the content of the resource, like “Title,” “Subject,” and “Description.” In the second order of importance come fields referring to the resource’s accessibility, like “Rights,” “Language,” and “Identifier” (which most of the time contains a URL address enabling access to the resource), as well as the field “Creator.” In descending order of importance we place the fields “Type” and “Format,” “Date” and “Coverage,” and “Source,” “Publisher,” “Contributor,” and “Relation.” Following the above order of importance, we assign numerical values to

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2012 DOI: 10.1002/asi

731

TABLE 1. profile.

DC field Identifier Title Creator Format Editor Date Language Rights Type Contributor Description Coverage Relation Source Subject

Weights of relative importance of DC fields according to user

Weight (admin profile)

Weight (instructor profile)

0.200 0.200 0.100 0.100 0.100 0.051 0.022 0.051 0.100 0.008 0.022 0.008 0.008 0.008 0.022

0.067 0.166 0.067 0.050 0.017 0.033 0.067 0.067 0.050 0.017 0.166 0.033 0.017 0.017 0.166

the weights, always making sure their sum to be 1. The exact weights of importance are shown in Table 1. Last, we need to set the minimum number of occurrences (the m(f) parameters) of a field in order to be considered complete. As mentioned, the m(f) of a multivalued field is determined either by reality itself, or by the needs and requirements of the application that uses the metadata. In this regard, the number of creators of a resource, or the number of languages contained in a multilanguage document is an objective, real fact, while the number of dates (“DC.Date”) of a resource is subjective information depending on the number of the events of the resource’s life that the application considers sufficiently important so that their associated points or periods of time are worthwhile to be registered in the metadata record describing the resource. In the general case, given the fact that measurement of completeness is a process that can take place at any time (long after the indexing of the resource) and cannot necessarily involve experienced indexers or people who are aware of the nature and history of the resources (for example, when measuring completeness of metadata harvested in massive quantities from third-party repositories), any information about the real life of the resources cannot be easily available. In the absence of knowledge of the true values, there is no other way but to make an estimation for the m(f) parameters that is most likely to approximate their real values. This estimation may take into account any common characteristics of the resources and the requirements that we set as users of the metadata application. Hence, for the majority of the fields we set m(f) = 1, as we require only one occurrence of the field in order to consider it complete. For the rest of the fields: We set m(f) = 3 for the field “Creator,” since for the kind of the resources from which the particular resource was drawn, three creators is an average estimate. We set m(f) = 2 for the field “Identifier,” since we expect two identifiers, a global one (e.g., URI, ISBN, etc.) and a local one to identify the resource in the repository of origin. We set m(f) = 2 732

TABLE 2. Minimum number of occurrences of a DC field in order to be considered complete. DC field Identifier Title Creator Format Editor Date Language Rights Type Contributor Description Coverage Relation Source Subject

Minimum no. of values 2 1 3 1 1 2 1 1 1 1 1 1 1 1 5

for the field “Date,” since we expect two dates associated with authorship and publication of the resource. Last, we set m(f) = 5 for the field “Subject,” since in this field one can register keywords describing the resource, thus we expect five keywords (five occurrences of the field with associated values) in order to consider it complete. The minimum number of occurrences of a field in order to be considered complete is shown in Table 2. Configuration of the IEEE LOM Metadata Record Thirty out of the 77 fields of IEEE LOM metadata schema have been defined as multivalued. Moreover, the metadata schema is hierarchical, i.e., many of its fields have internal structure (aggregate fields), comprising subfields, down to the depth of four levels, while the values of some of the simple fields can be expressed according to the langstring datatype, which offers the capability of having various translations of the same alphanumeric value in different languages. The parameters we need to set in order to configure the system, before performing any measurement, are the minimum number of values for each multivalued field in order to consider the field complete and the relative weights of importance for each field. We consider a langstring datatype field to be complete having only one langstring value. Thus, the minimum number of languages for langstring fields is set to 1 and the value of the c parameter for the relative contribution of the different values of a langstring datatype field to the measure of its completeness (in relation to its different translations) is also set to 1. The weights of importance for each field were set for the two user profiles, the administrator profile and the instructor profile. The weights for each field were set according to the relative importance of the field to its parent field. For example, in the first level (the nine categories), fields (categories) a repository administrator is, mainly, interested in are “1 General” and “4 Technical,” since these categories contain fields

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2012 DOI: 10.1002/asi

TABLE 3.

Weights of relative importance of IEEE LOM fields for admin and instructor profiles.

Field

Weight (admin)

Weight (instructor)

Field

Weight (admin)

Weight (instructor)

Field

Weight (admin)

Weight (instructor)

1 1.1 1.1.1 1.1.2 1.2 1.3 1.4 1.5 1.6 1.7 1.8 2 2.1 2.2 2.3 2.3.1 2.3.2 2.3.3 3 3.1

0.222 0.375 0.400 0.600 0.188 0.125 0.075 0.100 0.063 0.038 0.038 0.133 0.300 0.100 0.600 0.300 0.400 0.300 0.133 0.400

0.278 0.025 0.400 0.600 0.250 0.125 0.188 0.188 0.088 0.050 0.088 0.067 0.300 0.250 0.450 0.300 0.400 0.300 0.022 0.400

3.1.1 3.1.2 3.2 3.2.1 3.2.2 3.2.3 3.3 3.4 4 4.1 4.2 4.3 4.4 4.4.1 4.4.1.1 4.4.1.2 4.4.1.3 4.4.1.4 4.5 4.6

0.400 0.600 0.300 0.300 0.400 0.300 0.100 0.200 0.222 0.143 0.429 0.286 0.036 1.000 0.250 0.250 0.250 0.250 0.036 0.036

0.400 0.600 0.300 0.300 0.400 0.300 0.100 0.200 0.078 0.286 0.143 0.343 0.114 1.000 0.250 0.250 0.350 0.150 0.043 0.043

4.7 5 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 6 6.1 6.2 6.3 7 7.1 7.2

0.036 0.022 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.091 0.078 0.417 0.417 0.167 0.056 0.400 0.600

0.029 0.200 0.052 0.273 0.052 0.052 0.091 0.182 0.091 0.052 0.052 0.052 0.052 0.078 0.417 0.417 0.167 0.056 0.400 0.600

for recognizing and identifying the described learning object, as long as fields for its technical manipulation. Next come the categories “2 Lifecycle” and “3 Meta-metadata,” which offer the administrator information about the version of the learning object, its contributors, and the metadata record itself that describes the object. In the 3rd order of importance we place “9 Classification,” in the 4th “6 Rights,” in the 5th “7 Relation” and in the last 6th order we place the categories “5 Educational” and “8 Annotation,” which are of the least importance to an administrator. Thus, we assign numerical values to the weights of fields of the same level, always making sure that their sum equals 1. In much the same way, we set weights for all the remaining fields down the hierarchy, as well as for all the fields for the instructor profile. All the weights set are shown in Table 3. As for the minimum number of values that each multivalued field should have in order to be considered complete, what mainly influences this parameter is the requirement of the application and the reality of the described resources, which are learning objects of multimedia content. For the majority of the fields, m(f) is set to 1. A small number of fields require the value of m(f) to be greater than 1. For example, for the field “1.5 General.Keyword” we require three keywords (m(f) = 3) in order to consider the field complete. Likewise, for the field “2.3 Lifecycle.Contribute,” we require two instances of this field (m(f) = 2). This is due to the fact that each instance of this field acquires its structured meaning by its subfield “2.3.1 Lifecycle.Contribute.Role” and that the number of the most significant roles for this kind of resources is two—“author” and “publisher.” All m(f) values greater than 1 are shown in Table 4. At the bottom line of the configuration process of the completeness metrics system, what needs to be stressed is that it adjusts the measurement according to the requirements of the

Field

Weight (admin)

Weight (instructor)

7.2.1 7.2.1.1 7.2.1.2 7.2.2 8 8.1 8.2 8.3 9 9.1 9.2 9.2.1 9.2.2 9.2.2.1 9.2.2.2 9.3 9.4

0.800 0.400 0.600 0.200 0.022 0.333 0.333 0.333 0.111 0.375 0.500 0.250 0.750 0.500 0.500 0.063 0.063

0.800 0.400 0.600 0.200 0.111 0.333 0.333 0.333 0.111 0.375 0.500 0.250 0.750 0.500 0.500 0.063 0.063

TABLE 4. Minimum number of occurrences of an IEEE LOM multi-valued field in order to be considered complete. Field

Minimum no. of values

1.5 2.3 4.1 7 8 9 9.2 9.2.2 9.4

3 2 2 2 2 3 3 2 3

end users and makes the metrics highly customizable on the basis of the exact representation of a resource (in terms of its metadata completeness) that an end user will consider ideal so as to assign the top score 1 of a full and complete metadata record. The above examples offered practical directions and guidelines on how to resolve the issues of setting values to the parameters that configure the system. In this process, a source of concern might raise the m(f) parameter representing the minimum number of values a multivalued field should have in order to be considered complete. In the examples presented, the m(f) parameters, the values of which are determined by reality (and not by the subjective assessment of an end user), are supposed to be unknown, hence simple hints to estimate and approximate the real values were provided to overcome this uncomfortable situation. This was done in order to present the whole process without having to access the primary sources of information for each resource (such as the case would be if a considerable number of metadata records to be tested for completeness were harvested from third-party repositories). However, it is

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2012 DOI: 10.1002/asi

733

FIG. 4. Completeness measurement results of the DC record of Figure 2. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

obvious that using the proposed metrics system to measure completeness—after being configured according to reality and the requirements of its users—is one issue, while the estimation of an attribute of a resource (by trying to “guess” its true value)—when information about the resource’s real life is inaccessible—is another, completely different issue; thus, the two issues should not be confused. Measurement Results After having configured the metadata completeness metrics system based on user profiles, metadata schemata and types of the described resources, we conducted the measurements by performing the relevant mathematical calculations on the two XML encoded metadata records displayed in Figures 2 and 3. A software tool, developed for this task in order to automate the process, was utilized. The application of Equation (3) produced the following results. Completeness Measurement of the DC Record We measured completeness for each one of the 15 fields of the DC standard, as well as for the whole record itself (displayed in Figure 2). Three different completeness measures were calculated, according to three independent configurations of the metrics system corresponding to different user profiles. The first configuration was based on the administrator profile, the second one was based on the instructor profile 734

(parameter values in Table 1), while the third one followed the traditional approach, which considers all fields to be of equal importance and ignores the possibility of fields with multiple cardinality (as the case is with the DC standard). In the third profile, called “flat profile,” all 15 fields of the simple DC are equally weighted and the m(f) parameter for all fields is 1. Figure 4 displays the completeness measurement results of the DC record of Figure 2. Completeness measures of the DC fields shown are those calculated according to the administrator and instructor profile (taking into account the parameters of Table 1), while in the flat profile completeness of each field is valued as either 0 or 1, since multivalued fields are not considered. Comparing the completeness measures of the record for the two profiles, we conclude that the record is “more complete” for an instructor in relation to an administrator, since the amount of information it contains suits this particular user’s needs and requirements more. Applying the completeness metrics on a slightly differentiated version of the DC record of Figure 2—after having reduced it by using only one “Identifier,” instead of two— we get decreased completeness for the administrator and instructor profile, while for the “traditional” flat profile, fewer identifiers, have no effect to the completeness of the record. Figure 5 displays completeness for the “degraded” DC record with only one “Identifier.” For the administrator profile, completeness of the record falls from 71% to 61%, while for the

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2012 DOI: 10.1002/asi

FIG. 5. Completeness measurement results of the “degraded” DC record of Figure 2, with one “Identifier.” [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

instructor profile it is reduced from 43% to 39%. For the flat profile, completeness remains unchanged (47%). Thus, the proposed “fine-grained” metrics system proves its capability to highlight and measure even slight differences in the informational content (measured in terms of field instances) of a metadata record, which are reflected in its completeness measure. Completeness Measurement of the IEEE LOM Record We measured completeness for each one of the 77 fields (simple or aggregate) of the IEEE LOM standard by applying the recursive formula of Equation (3) in the XML encoded IEEE LOM metadata record of Figure 3. Two different measurements were implemented corresponding to two different configurations of the metrics system: one for the administrator profile and one for the instructor profile. Absence of a “flat” profile is due to the fact that there is no previous fully designed and implemented attempt on any related work to measure completeness of a record following a hierarchical schema. Table 4 is populated with completeness measures for each one of the IEEE LOM fields, which were calculated according to the values of the parameters defined for each one of the two profiles. The measure of completeness of all missing fields is 0. Completeness of the whole record is displayed in the last row of Table 5. Table 5 quantitatively expresses the structure of the metadata record based on user requirements. What is really achieved by these measurements is the ability to measure

completeness at any node of the metadata hierarchy. Measuring only the number of populated fields is not enough to obtain a clear understanding of the amount of information a hierarchical record holds. This is because by only measuring the number of occurrences of a field belonging to a parent aggregate multivalued field, the hierarchy of the schema is ignored and it is not possible to distinguish whether this particular field occurs more than once in the same instance of its parent field, or its occurrences belong to different instances of its parent. Discussion The way the proposed metrics system for measuring metadata completeness was designed offers the ability to differentiate each one measure of completeness for any slight variations in the amount of information loaded into the metadata fields acting as placeholders of values. This is a result of admeasuring the effect of any multiple values a multivalued field might have to the total measure of completeness. One less “Identifier” in the DC record of Figure 2 results in reduced completeness for the administrator and instructor profile, while for the traditional “flat” profile, one less value has no effect to the completeness measure of the record. Another significant advantage of the proposed metrics system is its ability to spot and quantify problems of completeness at any node of a hierarchy of metadata fields. The system follows the hierarchy of the schema and computes existence indicators of field values, weighing these indicators

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2012 DOI: 10.1002/asi

735

TABLE 5. Completeness measures of fields of the IEEE LOM record displayed in Figure 3.

Field 1 1.1 1.1.1 1.1.2 1.2 1.3 1.5 2 2.3 2.3.3 3 3.1 3.1.1 3.1.2 3.2 3.2.1 3.2.2 3.2.3 3.4 4.1 4.1 4.3 5 5.2 6 6.1 RECORD

Completeness (admin profile)

Completeness (instructor profile)

57% 100% 100% 100% 100% 100% 89% 8% 30% 100% 90% 100% 100% 100% 100% 100% 100% 100% 100% 33% 100% 100% 27% 100% 42% 100% 30%

78% 100% 100% 100% 100% 100% 89% 9% 30% 100% 90% 100% 100% 100% 100% 100% 100% 100% 100% 33% 100% 100% 9% 100% 42% 100% 41%

according to the importance of each field in relation to its parent node. The demonstrated measurement of completeness of the IEEE LOM record of Figure 3 proves this capability. Moreover, the proposed metrics system adds a new dimension to the measurement of completeness by taking into account the representation level of a single value of a simple field. This level constitutes information, which, in many cases, might be of significant importance for the metadata application and influences completeness of the field according to the assigned application weight of importance (the value of 1 − c of Equation (5)). This dimension was not taken into account in the measurement of the IEEE LOM record of Figure 3. However, this capability of the metrics system can provide valuable information to the metadata application users. For example, in multilingually oriented applications, where different translations of text are of great importance, such “nuances” of information might represent a key factor to affect the measure of metadata completeness. Undoubtedly, a point of discussion is the configuration of the metrics system, i.e., the selection of the parameter values, before conducting the actual measurement and doing the math calculations. Depending on the metadata schema, which defines the metadata elements and their interrelations within the schema structure, a significant number of parameters is required to be set manually. Although the need to set the parameters may appear as an additional burden to metadata users, the truth is that such tasks have already been 736

implemented by application profile creators—contributing, this way, to the effort of balancing the conflict between completeness and simplicity noted by Stvilia et al. (2004). The purpose of an application profile is to adapt or combine existing schemas into a package that is tailored to the functional requirements of a particular application, while retaining interoperability with the original base schemas (Duval, Hodgins, Sutton, & Weibel, 2002). In Hillmann and Phipps (2007), an application profile is characterized as a “template for expectation,” as the authors claim that there is high potential in using it to quantify metadata quality in terms of completeness by introducing the notion of expectation (expressed in the application profile) to determine how complete an individual metadata record might be. The configuration parameters we proposed constitute one aspect of this so-called “template for expectation” and can be included in the provisions of an application profile. Since the purpose of an application profile can be served by specific mechanisms, one of which is cardinality enforcement (Duval et al., 2002), we can deal with the issue of defining the configuration parameters of the proposed completeness metrics by enriching the outputs of this mechanism. Indeed, just as an application profile specifies whether a metadata element is mandatory or optional—based on the importance of the element for the particular community of use—it can also specify the relative importance of each element by assigning an arithmetic value as the element’s weight. Following the same logic, an application profile can specify the exact multiplicity of an element in order to be considered complete as well. The configuration process can be implemented within the context of creating an application profile and can benefit from any relevant effort to facilitate or automate this broader task. For example, the weights of importance of metadata elements can automatically obtain values (providing a suitable software interface were available) by the relative frequency users tend to use each element when issuing queries to a metadata repository (Ochoa & Duval, 2009). Measuring completeness of metadata using the proposed fine-grained metrics system, beyond the above-described benefits and advantages, can offer valuable help in fulfilling the specific requirements set by the context of use. The specific process or activity handling the metadata determines the pragmatics of measuring and defines the exact purpose of measuring. For example, measuring completeness of metadata can be used as an important tool to evaluate automatic metadata generation methods and techniques. Another potential application of completeness measuring might be in cases where targeted measures of completeness can be used as additional criteria to filter the results produced by a search engine. For example, when searching for learning objects, a teacher preparing a lesson might wish to make sure that the metadata of the returned results will contain a certain amount of educational information. Hence, the teacher might set a threshold value for the completeness measure of the metadata of the returned learning objects at specific elements (educational fields) and filter the results according to this criterion.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2012 DOI: 10.1002/asi

Conclusion – Future Work In this paper a metrics system for measuring the completeness of metadata was presented. In an effort to treat the inadequacies of the traditional approach, which is based on measuring completeness of a metadata record by generally counting the presence or absence of values in fields, the proposed system defines completeness at the field level in a recursive way following the hierarchy of the metadata schema. Multivalued and aggregate fields were taken into consideration, as well as the representation level of semantically equivalent information in the metadata fields. The result is a set of metrics that takes into account the needs and requirements of the application level (determined by weighting factors specified by the particular process or activity of the metadata application) and can be easily implemented by automated means. The proposed metrics system was put into action and used to measure completeness of real-world metadata records. The results prove that the system is able to fulfill its requirements. The next step following this work consists of an experimental implementation of the metric system and its application on a working database of metadata in a digital repository. The results for various targeted measures of completeness of particular fields (especially in a hierarchical metadata schema) are expected to provide valuable information in order to draw interesting conclusions about completeness, and quality of metadata in general, which would otherwise be impossible to reach. References Arazy, O., & Kopak, R. (2011). On the measurability of information quality. Journal of the American Society for Information Science and Technology. 62(1), 89–99. Barton, J., Currier, S., & Hey, J.M.N. (2003). Building quality assurance into metadata creation: An analysis based on the learning objects and e-prints communities of practice (pp. 39–48). In Proceedings of the International Conference on Dublin Core and Metadata Applications: Supporting Communities of Discourse and Practice. Singapore: Dublin Core Metadata Initiative. Bruce, T.R., & Hillmann, D.I. (2004). The continuum of metadata quality: defining, expressing, exploiting (pp. 238–256). In D.I. Hillmann, E. Westbrooks (Eds.), Metadata in Practice. Chicago: ALA Editions. Bui,Y., & Park, J. (2006). An assessment of metadata quality: A case study of the National Science Digital Library metadata repository. In Haidar Moukdad (Ed.), CAIS/ACSI 2006 information science revisited: Approaches to innovation. Dublin Core Metadata Element Set. Retrieved from http://dublincore.org/ documents/dces/ Dushay, N., & Hillmann, D. (2003). Analyzing metadata for effective use and re-use In Proceedings of the International Conference on Dublin Core and Metadata Applications: Supporting Communities of Discourse and Practice (pp. 1–10). Singapore: Dublin Core Metadata Initiative. Duval, E., Hodgins, W., Sutton, S., & Weibel, S. (2002). Metadata principles and practicalities. D-Lib Magazine, 8(4). Retrieved from http://www.dlib.org/dlib/april02/weibel/04weibel.html Friesen, N. (2004). International LOM Survey: Report (Draft). Retrieved from http://arizona.openrepository.com/arizona/bitstream/10150/106473/ 1/LOM_Survey_Report2.doc Greenberg, J. (2003). Metadata and the World Wide Web (pp. 1876–1888). In M. Dekker (Ed.), Encyclopaedia of library and information science. New York: Marcel Dekker.

Greenberg, J., Spurgin, K., & Crystal, A. (2005). AMeGA (Automatic Metadata Generation Applications) Project. Final Report, University of North Carolina & Library of Congress. Retrieved from: http://www.loc.gov/catdir/bibcontrol/lc_amega_final_report.pdf Guinchard, C. (2002). Dublin Core use in libraries: a survey. OCLC Systems & Services, 18(1), 40–50. Hillmann, D.I., & Phipps, J. (2007). Application profiles: Exposing and enforcing metadata quality (pp. 52–62). In International Conference on Dublin Core and Metadata Applications: Application profiles: Theory and practice (pp. 27–31). Singapore: Dublin Core Metadata Initiative. Hughes, B. (2004). Metadata quality evaluation: Experience from the open language archives community. (pp. 320–329). In Z. Chen et al. (Eds.), International Conference on Asian Digital Libraries (ICADL 2004). Lecture Notes in Computer Sciences, 3334, 320–329. Learning Object Metadata. Retrieved from: http://ltsc.ieee.org/wg12/files/ LOM_1484_12_1_v1_Final_Draft.pdf Liddy, E., Allen, E., Harwell, S., Corieri, S., Yilmazel, O., Ozgencil, N., . . . (2002). Automatic metadata generation & evaluation. In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 401–402). New York: ACM Press. Margaritopoulos, M., Margaritopoulos, T., Kotini, I., & Manitsaris, A. (2008). Automatic metadata generation by utilising pre-existing metadata of related resources. International Journal of Metadata, Semantics and Ontologies, 3(4), 292–304. Margaritopoulos, T., Margaritopoulos, M., Mavridis, I., & Manitsaris, A. (2008). A conceptual framework for metadata quality assessment. In International Conference on Dublin Core and MetadataApplications: Metadata for Semantic and Social Applications (pp. 104–116). Singapore: Dublin Core Metadata Initiative. Moen, W., Stewart, E., & McClure, C. (1998). Assessing metadata quality: Findings and methodological considerations from an evaluation of the US Government information locator service (GILS). In Proceedings of the Advances in Digital Libraries Conference (pp. 246–255). Los Alamitos, CA: IEEE Computer Society. Moreira, B., Gonçalves, A., Laender, A., & Fox, E. (2009). Automatic evaluation of digital libraries with 5SQual. Journal of Informetrics, 3(2), 102–123. Najjar, J., Ternier, S., & Duval, E. (2003, November). The actual use of metadata in ARIADNE: An empirical analysis. Paper presented at the ARIADNE Conference, Leuven, Belgium. Najjar, J., Ternier S., & Duval, E. (2004, June). User behavior in learning object repositories: An empirical analysis. Paper presented at the EDMEDIA 2004 World Conference on Educational Multimedia, Hypermedia and Telecommunications, Lugano, Switzerland. Ochoa, X., Cardinaels, K., Meire, M., & Duval, E. (2005, June). Frameworks for the automatic indexation of learning management systems content into learning object repositories. Paper presented at the ED-MEDIA 2005 World Conference on Educational Multimedia, Hypermedia and Telecommunications, Montreal, Canada. Ochoa, X., & Duval, E. (2009). Automatic evaluation of metadata quality in digital libraries. International Journal on Digital Libraries, 10(2–3), 67–91. Open Archives Initiative (2008). OAI Protocol for Metadata Harvesting (OAI-PMH) v2.0. Retrieved from: http://www.openarchives.org/OAI/ openarchivesprotocol.html Sicilia, M.A., García, E., Pagés, C., Martínez, J.J., & Gutiérrez, J. (2005). Complete metadata records in learning object repositories: Some evidence and requirements. International Journal of Learning Technology, 1(4), 411–424. Stvilia, B., Gasser, L., Twidale, M., Shreeves, S., & Cole, T. (2004). Metadata quality for federated collections. In Proceedings of the Ninth International Conference on Information Quality (pp. 111–125). Boston: MIT Press. Stvilia, B., Gasser, L., Twidale, M., & Smith, L. (2007). A framework for information quality assessment. Journal of the American Society for Information Science and Technology, 58(12), 1720–1733.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY—April 2012 DOI: 10.1002/asi

737

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.