Parallel data intensive computing in scientific and commercial applications

Share Embed


Descripción

Parallel Computing 28 (2002) 673–704 www.elsevier.com/locate/parco

Parallel data intensive computing in scientific and commercial applications Mario Cannataro a, Domenico Talia

b,*

, Pradip K. Srimani

c

a

ICAR-CNR, Via P. Bucci 41-C, 87036 Rende (CS), Italy DEIS, University of Calabria, Via P. Bucci, Cubo 41-C 87036 Rende (CS), Italy Department of Computer Science, Clemson University, Clemson, SC 29634-0974, USA b

c

Received 11 March 2001; received in revised form 20 November 2001

Abstract Applications that explore, query, analyze, visualize, and, in general, process very large scale data sets are known as Data Intensive Applications. Large scale data intensive computing plays an increasingly important role in many scientific activities and commercial applications, whether it involves data mining of commercial transactions, experimental data analysis and visualization, or intensive simulation such as climate modeling. By combining high performance computation, very large data storage, high bandwidth access, and high-speed local and wide area networking, data intensive computing enhances the technical capabilities and usefulness of most systems. The integration of parallel and distributed computational environments will produce major improvements in performance for both computing intensive and data intensive applications in the future. The purpose of this introductory article is to provide an overview of the main issues in parallel data intensive computing in scientific and commercial applications and to encourage the reader to go into the more in-depth articles later in this special issue.  2002 Elsevier Science B.V. All rights reserved. Keywords: Data intensive algorithms; Parallel computing; Parallel databases; Parallel I/O

*

Corresponding author. Tel.: +39-984-494726; fax: +39-984-839054. E-mail addresses: [email protected] (M. Cannataro), [email protected] (D. Talia), srimani@ cs.clemson.edu (P.K. Srimani). 0167-8191/02/$ - see front matter  2002 Elsevier Science B.V. All rights reserved. PII: S 0 1 6 7 - 8 1 9 1 ( 0 2 ) 0 0 0 9 1 - 1

674

M. Cannataro et al. / Parallel Computing 28 (2002) 673–704

1. Introduction to data intensive computing Today the information stored in digital memories is enormous and its size is still growing very rapidly. Whereas until some years ago the main problem was the lack of enough information, the challenge now seems to be the large volume of information to deal with and the associated complexity to process it and to extract and visualize relevant parts useful for a given application. NASA satellite systems [1] for earth observation (EO) generate 50 gigabytes of images each hour, and the Human Genome project [2] collected several terabytes of data on human genetic code that must be analyzed. Since 1993 the New York Stock Exchange has been generating a database of USA market transactions of about 2 gigabytes per month. In the very near future, the Large Hadron Collider (LHC) at CERN [3] will produce several petabytes of data per year. In order for these huge databases and data repositories to be of any meaningful use, new efficient algorithms and strategies will be needed to process the data quickly and efficiently. Data intensive applications are defined to be those that explore, query, analyze, visualize, and in general, process very large-scale data sets. Data intensive applications can benefit from the use of parallel and distributed computing systems both to improve performance and quality of data management and analysis. When data intensive algorithms, applications and tools are implemented on high-performance computers, they can analyze massive databases in a reasonable time. Performing the analysis of massive data sets will almost certainly require the sustained application of some tens of teraflops per second of computing power. Faster processing also means that users can experiment with more models to understand complex data. In summary, high performance parallel and distributed computing makes it practical and cost-effective for users to analyze larger quantities of data. For researchers and professionals working in the area of parallel computation, the data intensive applications are growing in importance, and introduce interesting new dimensions such as irregularity, data representation and storage, decentralized data access, multiple parallelization strategies and heterogeneity that have not been so critical in scientific and numerical computing. For organizations that want to use data intensive applications in their day-to-day work, parallel computation offers increased performance, which in turn may translate into commercial advantage. Much of recent research in practical parallel computing appears to be driven by the demands of the scientific and engineering applications. Grand challenge applications and, more recently, the DOE’s ASCI project have provided a great incentive to the development of high performance parallel architectures, programming environments, and software tools for performing large-scale complex simulation of physical phenomena. Large-scale data intensive parallel applications have become the leading consumer of high performance parallel computing because of their complexity as well as the sheer size of the data. The demand for data intensive applications (e.g., data mining, knowledge management, web computing and e-commerce) is growing very rapidly, and such applications can potentially become the biggest consumers of parallel computing. The rapid growth in data intensive applications also stimulates the demand for raw data storage capacity. Applications such as data warehous-

M. Cannataro et al. / Parallel Computing 28 (2002) 673–704

675

ing, data mining, online transaction processing, and multimedia internet and intranet browsing have resulted in the near doubling of the total storage capacity shipped on an annual basis. This paper describes the main aspects of parallel data intensive computing. The paper provides an overview of the main issues of high-performance data intensive computing and discusses some applications areas that show different aspects of research and systems solutions for data intensive computing. The organization of this paper is as follows. In Section 2, we discuss the main areas of information processing, such as databases, data warehouses, online analytical processing (OLAP) and data mining, where parallel computing is effectively used both to speed up the data management and to improve the analysis techniques. Section 3 describes parallel hardware and storage systems issues and Section 4 presents some relevant scientific and business application areas where the deployment of data intensive parallel algorithms and systems is a crucial factor. Finally Section 5 concludes the paper and outlines some future challenges in the area.

2. Information processing and management overview Traditional data intensive applications are normally designed in a layered fashion. The most common decomposition uses a three layer architecture (assuming a client/ server scenario): 1. The User layer is responsible of the interaction between the user and the application. 2. The Application layer embeds the application logic (e.g., business rules of a sell process). 3. The Data layer offers an high level interface to access data, hiding data organization details and data access primitives. The simplest approach to enhance the performance of such an existing application on parallel and distributed architectures, without changing the behavior of the application, is to parallelize each layer of the architecture separately, avoiding bottlenecks in the interface mechanisms between layers. For instance, at the User layer, typically implemented on a sequential PC or a simpler terminal (PDA, WAP phone, etc.) executing a web browser, the performance is usually enhanced by increasing the system performance or by employing low level parallelism. When we consider thousands or millions of user accessing an application, the network becomes the bottleneck; protocols that utilize parallel communication primitives at lower levels (e.g., p-sockets) should be employed, when bandwidth increase or use of priority based allocation schemes are not cost effective. Availability of low cost, off-the-shelf, high bandwidth communication networks and protocols and the emergence of peer-to-peer protocols make it possible to develop distributed applications composed of thousands/million of cooperating agents, each with its own processor, memory, disks, file system, database, etc. Although

676

M. Cannataro et al. / Parallel Computing 28 (2002) 673–704

emerging applications use decentralized interaction modes (peer-to-peer protocols) and the amount of data being used is increasingly semi-structured [4] (XML, RDF, HTML), centralized database systems will continue to play a significant role in the data intensive application arena. Specialized DBMS for semi-structured data has started being deployed (e.g., [5]) and traditional DBMS has started adding features to handle such data. But, as of now, most of the data analysis applications have been developed for centralized databases. We can roughly identify the following classes: • online transaction processing (OLTP) applications, such as electronic commerce, flight reservation, that access the database with small, short-lived transactions, and that access a small portion of the data, requiring fast response time, and high throughput, due to the large volume of concurrent requests; • decision support applications, such as data mining and OLAP [6], that access the database with complex, long-lived transactions, and that access a large portion of the total data; they typically require hours or days to execute, the major requirement in such applications being faster execution time; • database utility commands: their execution is yet another difficult problem because execution occurs while the system remains operational and the data remains available to other applications; the execution of such commands need be online (data remains available), incremental (access parts of the database), and parallel and recoverable. 2.1. Parallelism in database systems In parallel database systems the sources of parallelism are various, yielding to different grain size in parallel activities [7]. At the lower layer, data are partitioned (eventually replicated) among computational elements of a parallel architecture. Considering more abstract layers we have: • parallel access and retrieval of the stored data (e.g., sort, scan, etc.), • parallel execution of a single database operation (e.g., join, index creation, etc.), • intra-transaction parallelism, i.e. parallel (concurrent) execution of different database operations belonging to the same logic unit of work, • inter-transaction parallelism, i.e. parallel execution of different transactions. At the lower layer the parallelism is mainly originated by the data decomposition and allocation over a set of different physical disks, managed in parallel by different processors. At the higher layers we have functional parallelism. Relational queries are ideally suited to parallel execution: they consist of uniform operations applied to uniform streams of data (relations). Since each operation produces a relation, the operators can be composed into highly parallel dataflow graphs, which can lead to pipelined parallelism (two operators work in series) or partitioned parallelism (by partitioning input data to independent operators, each one working on a different data partition) [8]. The inter-transaction parallelism is exploited by executing in par-

M. Cannataro et al. / Parallel Computing 28 (2002) 673–704

677

allel the different transactions submitted to the system and maintaining their ACID (atomicity, consistency, isolation, durability) properties by using sophisticated concurrency control algorithms. Many modern DBMS exploit this form of parallelism, with the main goal of optimizing throughput (transactions executed per time unit) and average response time. All modern DBMS exploit the inter-transaction parallelism and partitioned parallelism (parallel access to the stored data), and with minor emphasis the intra-transaction parallelisms; but, they offer few support to parallel execution of single database operators. The reason is that such DBMS are mainly designed for OLTP applications, whose requirements are completely different from decision support ones. Often, these applications require execution of a single SQL operator, the select statement, which has very complex parameters, nested sub-selects, etc. So, to meet requirements of decision support queries (to increase speed-up, or to maintain scale-up), parallel execution of single database operator is crucial. Examples of such operators are the parallel join (implemented by using sorting or hash functions) and the parallel index creation. A particular, very complex join used in decision support applications is the so-called star-join (we discuss more about this in the data warehouse section). Traditionally, three types of system architectures are used to implement parallel database systems [8,9]. Shared nothing also known as massively parallel processing (MPP). Each processor has its own memory and peripherals, and data is exchanged by message passing only, through a general-purpose or ad-hoc communication networks. Examples are: clusters, IBM SP2, Tandem, NCR [10]. The principal characteristics of MPP-architectures are: • great scalability, up to hundreds or thousands of processors, • communication can be a major bottleneck, • scalability requires computation of appropriate grain size with respect to communication bandwidth and latency, • data partitioning is static, • data is not shared among processors, • data managed by different processors must be combined using message passing, • great reliability and availability due to fault isolation. Shared everything also known as symmetric multi-processor (SMP). Memory and disks are shared among processors. Examples are: Convex, IBM DB2 UDB, Sequent, Sequoia, SGI, Sun [11]. The principal characteristics of SMP architectures are: • • • • • •

lower scalability with respect to MPP (such systems scale to tens of processors), easy programming, having the choice of different programming paradigms, efficient and easy to implement load balancing, due to shared memory, data shared among all processors, data combined via shared memory, data can be dynamically partitioned among processors,

678

M. Cannataro et al. / Parallel Computing 28 (2002) 673–704

• lower reliability and availability due to low fault isolation because of the shared memory. Shared disk: Each processor has its own memory, but it shares the disks with other processors, through a communication network. This is a hybrid solution between MPP and SMP. It has the advantage of easing application migration from a centralized architecture (single node configuration) and offers a good reliability. Since this architecture is quite similar to the SMP, we consider only MPP and SMP architectures in the rest of the paper. The main drawback of MPP systems for parallel databases is the need of static partitioning. Whereas dynamic partitioning could, in principle, be implemented by message passing, it is not cost-effective due to the large volumes of data to be moved across the network. The essence of MPP programming paradigms is to move control instead of data; thus, it is difficult to change the number of partitions and to size them such that they are balanced for a wide range of applications. Moreover, it is hard to treat difficult value distributions (skews). On the contrary, in SMP, dynamic partitioning allows assignment of data to processors on demand, leveraging the shared memory to move data in an efficient way. Shared memory also allows for flexible load balancing and offers many options for parallel algorithms (from data sharing to message passing). SMP systems are usually more manageable than MPP ones. On the other hand, MPP architecture scales better than SMP, allowing higher speedup. In conclusion, the model beyond parallel database lends itself to MPP systems (in spite of the difficulties caused by static partitioning). However, decision support applications pose some new problems for parallel database systems that can be more easily solved with SMP architectures. On the other hand, cluster-based architectures composed of off-the-shelf components are emerging as low-cost, easy-to-build MPP systems. The price/performance ratio will probably win, in the mean period, over different performance criteria (such as easy programming and load balancing) that would favor SMP systems.

2.2. Parallel data warehouses and OLAP Data warehousing and OLAP are the core elements of decision support applications. As mentioned above, decision support applications impose different requirements on database technology compared to traditional OLTP applications. Data warehousing [6,12,13] is a collection of decision support technologies enabling the decision maker (executive, manager, analyst) to do the job (making decisions based on information) better and faster. A data warehouse is a ‘‘subject-oriented, integrated, time-varying, non-volatile collection of data used primarily in decision making’’ [14]. In recent years, data warehouses have been developed for many domains: manufacturing, retail, financial services, transportation, telecommunications, utilities, and healthcare. The data warehouse and the operational database(s) are maintained separately; the former has to support OLAP, whereas the latter also continues to support OLTP applications. OLTP applications typically automate traditional

M. Cannataro et al. / Parallel Computing 28 (2002) 673–704

679

day-to-day data processing tasks of an organization, such as order entry and banking transactions. They consist of short, atomic, isolated transactions requiring detailed, up-to-date data, and accessing a few (tens of) records (typically on their primary keys). Consequently, the database is designed to minimize concurrency conflicts. Main requirements are consistency, recoverability and transaction throughput. Data warehouses, in contrast, are devoted to decision support, where historical, summarized and consolidated data is more important than detailed, individual records. Because data warehouses contain consolidated data, coming from several operational databases, they are larger than operational databases. The typical workload is represented by complex queries that access a large portion of data performing a lot of scans, joins, and aggregates. So, main requirements are query throughput and response times. 2.2.1. Conceptual model and OLAP A data warehouse conceptually builds on a multi-dimensional data model. It comprises of: a set of numeric measures (i.e. the objects of analysis such as sales, budget), each one depending on a set of dimensions (the context for the measure, as the time/ location of sell), whose values can vary along hierarchies (e.g. day, week, month, quarter, year). As an example, if we consider a sale amount as a measure, possible dimensions could be the city, the product type, and the date when the sale was made. In this way, each measure is a value in the multi-dimensional space of dimensions that uniquely determine it. The data organized using the multi-dimensional data model are named datacubes (see Fig. 1). OLAP is a set of techniques and operations allowing the querying and exploration of the multi-dimensional data model. The spreadsheets and reports are the main front-end application for OLAP.

Fig. 1. A datacube sales (measure), time, product and location (dimensions), source Han [15].

680

M. Cannataro et al. / Parallel Computing 28 (2002) 673–704

Fig. 2. Data warehousing multi-tier architecture (source Han).

2.2.2. Data warehouse architectures Fig. 2 [6,15] shows the architecture of a typical data warehouse. It comprises the back-end tools and utilities that extract, transform, load data from the data sources (operational databases, etc.), to the data storage, that is the core of the data warehouse. Here, data can be stored using a relational data model (e.g., star schema) or a multi-dimensional data model implemented by arrays. The OLAP engine implements the OLAP operations by means of lower level operations. Finally, usually at the user layer, we have tools for data analysis, reporting, querying and data mining. OLAP servers are usually implemented by using different approaches, depending on the degree of integration with the relational database. 1. Relational OLAP (ROLAP) servers use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces. The main strength of ROLAP servers is that they exploit the scalability and the transactional features of relational systems. 2. Multi-dimensional OLAP servers directly support the multi-dimensional view of data through an array-based multi-dimensional storage engine: (sparse matrix techniques) and fast indexing to pre-computed summarized data. This approach has the advantage of good indexing properties, but provides poor storage utilization, especially when the data set is sparse. 3. Hybrid OLAP servers use relational database at the low level, and arrays at the high level.

M. Cannataro et al. / Parallel Computing 28 (2002) 673–704

681

4. Specialized SQL servers provide advanced query language and query processing support for SQL queries over star and snowflake schemas in read-only environments. Data warehousing systems use a lot of data extraction and cleaning tools, and load and refresh utilities for populating warehouses. In short, the main back-end tools and utilities that complete a data warehouse are: • data extraction: get data from multiple, heterogeneous, and external sources, • data cleaning: detect errors in the data and rectify them when possible, • data transformation: convert data from legacy or host format to warehouse format, • load: sort, summarize, consolidate, compute views, check integrity, and build indices and partitions, • refresh: propagate the updates from the data sources to the warehouse. 2.2.3. Parallel processing in data warehouse In data warehouses, i.e. in read-only environments of decision support systems, parallelism is used in low level operations, such as scan, indexing etc., as well as in high level OLAP operations. For example, piggybacking scan (used in [16]) reduces the total work as well as response time by overlapping scans of multiple concurrent requests. Moreover, multi-dimensional analysis in OLAP and scientific and statistical databases use operations requiring summary information on multi-dimensional data sets. They are aggregate operations along one or more dimensions of numerical data values and/or on hierarchies defined on them. In recent years, people have started to investigate how to parallelize data warehouse operations. Here we provide an incomplete summary. Garcia-Molina et al. [17] highlight the need for providing parallelism in all the phases of data warehouse life, from maintenance, to OLAP execution. They describe some of the distributed and parallel issues of creating, maintaining and querying a data warehouse. A very complex task is the efficient maintenance of a data warehouse, that becomes even more challenging when parallel and distributed algorithms are used. Since warehouse update is performed by a set of distributed processes that receive data from autonomous sources, maintaining warehouse consistency is a difficult task. In general, there exist different ways to implement the maintenance tasks, so their parallelization is an important research area with important implications on performance. Another task that could benefit from parallel/distributed implementation is the warehouse load, because traditional approaches either have too much overhead or repeat all of the work in the case of interrupt. The CUBE operator, introduced by Gray et al. [18], generalizes the GROUP-BY operator in computing aggregates for every possible combination of the specified attributes. The data structure used to implement the datacube is the main issue in computing the CUBE and other complex OLAP operations. There is a trade-off between reducing the storage space and the corresponding increase in access time for each sparse data structure, in comparison to multi-dimensional arrays. This trade-off

682

M. Cannataro et al. / Parallel Computing 28 (2002) 673–704

depends on many parameters such as number of dimensions, size of each dimension and degree of sparsity of the data. Complex operations such as required for OLAP can be very expensive in terms of data access time if efficient data structures are not used. Recently, the original sequential algorithm for CUBE has been enhanced by means of shared-memory or message passing parallel implementations. The work by Ng et al. [19] presents four parallel versions of the Iceberg-Cube algorithm on a cluster of PC. The algorithms are scalable, but their performance depends on many parameters, such as density of the cube, small versus high dimensionality, and support for truly online processing. Thus, there is no algorithm that works efficiently for all situations. Datta et al. [20] propose a parallel processing techniques for the data warehousing environment. They offer some partitioning and indexing techniques called DataIndexes for a shared-nothing architecture. Goil and Chaudhary in the PARSIMONY system suggest a chunking method to provide a multi-dimensional index structure for efficient dimension oriented data access for OLAP [21]. PARSIMONY is a system implementing OLAP operations on multi-dimensional datacube. Performance results for high dimensional data sets on a distributed memory parallel machine (IBM SP-2) show good speedup and scalability. Although, as claimed by the authors, this research could be extended to other shared-nothing environments, the cluster environment offers several new challenges. Rudra and Gopalan [22] describe some problems and issues arising when non-dedicated clusters are used to run data warehousing applications. In the case of a dedicated cluster, parallel computation can be exploited on the entire cluster, as the resources are shared by all the workstations. On the other hand, in a non-dedicated cluster environment, individuals own workstations and applications are executed by stealing idle cycles. Parallel computing on a dynamically changing set of non-dedicated workstations is called adaptive parallel computing. This adaptive nature of non-dedicated clusters introduces a new problem to exploit parallelism in data warehouse. Rudra and Gopalan investigate the use of adaptive clusters as applied to data parallelism in data warehousing and propose three adaptation strategies. 2.3. Parallel data mining Data mining is the semi-automated analysis of large volumes of data, looking for the relationships and knowledge that are implicit in large volumes of data and are ‘interesting’ in the context of an organization’s use of the data. Research and development work in the area of knowledge discovery and data mining concerns the study and definition of techniques, methods, and tools for the extraction of novel, useful, and implicit patterns from data. Knowledge discovery in large data repositories can find what is interesting and representing it in an understandable way [23]. Mining large data sets requires large computational resources. In fact, data mining algorithms working on very large data sets take a very long time on conventional computers to get results. One approach to reduce response time is sampling. But, in some case reducing data might result in inaccurate models, in some other case is not useful (e.g. outliers identification).

M. Cannataro et al. / Parallel Computing 28 (2002) 673–704

683

The other approach is parallel computing. High performance computers and parallel data mining algorithms can offer the most efficient way to mine very large data sets [24,25]. Several parallel data mining algorithms have been designed for association rule discovery, classification and clustering [26]. It is not uncommon to have sequential data mining applications that require several days or weeks to complete their task. Therefore parallel computing systems can bring significant benefits in speeding up the performance of data mining and knowledge discovery applications by exploiting the inherent parallelism in data mining algorithms. There are three broad strategies: independent parallelism, task parallelism, and SPMD parallelism. In task parallelism (or control parallelism) each process executes different operations on (a different partition of) the data set. In SPMD parallelism a set of processes execute in parallel the same algorithm on different partitions of the data set and processes exchange partial results. Finally, independent parallelism is exploited when processes are executed in parallel in an independent way; generally each process has access to the whole data set. These three strategies are not necessarily alternative for parallelizing data mining algorithms. They can be combined to improve both performance and accuracy of results. In particular, some hybrid approaches showed interesting scalable performance in the implementation of parallel association rules and classification algorithms [26,27]. In combination with strategies for parallelization, data partition strategies are also important. In parallel data mining, different data partition strategies can be used • sequential partitioning: separate partitions are defined without overlapping among them; • cover-based partitioning: some data can be replicated on different partitions; • range-based query: partitions are defined on the basis of some queries that select data according to attribute values. As in other data intensive applications, some architectural issues should be addressed in the parallel implementation of data mining techniques: • • • • • •

distributed memory versus shared memory implementation, interconnection topology of processors, optimal communication strategies, load balancing of parallel data mining algorithms, memory usage and optimization, and I/O impact on algorithm performance.

These issues (and others) must be taken into account in the process of exploiting parallelism in data mining algorithms. The architectural issues are strongly related to the parallelization strategies and there is a mutual influence between the knowledge extraction strategy and the architectural features. For instance, increasing the degree of parallelism in some cases corresponds to an increase of the communication overhead among the processors. However, communication costs can be also balanced by

684

M. Cannataro et al. / Parallel Computing 28 (2002) 673–704

the improved knowledge that a data mining algorithm can get from parallelization. At each iteration, the processors share the approximated models produced by each one of them. Thus each processor executes a next iteration using its own previous work and also the knowledge produced by the other processors. This approach can improve the rate at which a data mining algorithm find a model for data (knowledge) and make up for lost time in communication. Parallel execution of different data mining algorithms and techniques can be integrated to obtain a better model not just to get high performance but also high accuracy. Here we list some promising research directions in the parallel data mining area: • parallel algorithms, environments and tools for interactive high performance data mining and knowledge discovery; • parallel text mining; • parallel and distributed Web mining; • integration of parallel data mining with parallel data warehouses; • parallelizing all phases of the knowledge discovery in data (KDD) process and support of efficient data warehouses. Besides these very promising areas, we would like to mention the importance of the integrated use of clusters and grids for distributed and parallel knowledge discovery. Grid integrated clusters of computers that execute the same or different data mining or KDD algorithms can be seen as massively parallel computers that mine very large data sets. The development of software architectures, environments and tools for grid-based data mining will result in Grid-aware distributed and parallel knowledge discovery (PDKD) systems that will support high performance data mining applications on geographically distributed data sources [28]. Two books [29,30] describe a large number of parallel data mining algorithms and systems and present distributed data mining frameworks. 2.4. From OLAP to online analytical mining High quality of data in data warehouses (integrated, consistent, cleaned data), the availability of information processing structure surrounding data warehouses (ODBC, OLEDB, Web accessing, service facilities, reporting and OLAP tools), and the development in data mining algorithms and tools are the driving forces that will accelerate the evolution of OLAP toward online analytical mining (OLAM) [15]. This new class of applications will provide OLAP-based exploratory data analysis, such as data mining with drilling, dicing, pivoting, etc., and online selection of data mining functions, such as integration and swapping of multiple mining functions, algorithms, and tasks. In almost all of the tasks of OLAM and OLAP it is possible to exploit parallelism or distribution of data. Research in the field of parallel and distributed data mining has been outlined in the previous section, whereas the exploitation of parallelism in OLAM engines is a future research theme. The paper [31] presents an application of OLAP and Data Mining in the biotechnology industry.

M. Cannataro et al. / Parallel Computing 28 (2002) 673–704

685

In addition, the authors indicate future developments of such systems showing how data warehouse and related techniques (OLAP and DM) can be usefully applied to the computational science domain. 2.5. Web and digital libraries In recent years the Internet, and in particular the World Wide Web (WWW), has emerged to be the ideal platform for developing globally available distributed applications. The Web uses a powerful communication paradigm (TCP/IP with all its application protocols, such as HTTP, SMTP, etc.) to offer an easy yet rich user access through browsing and multimedia. Moreover, its openness to new architectural and communication standards (e.g. the emerging peer-to-peer protocols and their related applications) facilitates both the integration of different programming paradigms and the support of different platforms and user terminals. As an example, the Java platform scales from a powerful parallel multiprocessor to a small, simple smart card. Web applications, i.e. applications designed with WWW standards, or in general to be used over the Internet, such as electronic commerce, search engines, digital portals, digital libraries (DLs), distance learning, etc., combine different features that make them radically different from previous applications of information technology (IT) [32]. The main motivations for these differences are: 1. Web applications are a hybrid between hypermedia [33] and information systems (IS). 2. Web applications are usually accessed globally, i.e. by different classes of users, distributed worldwide, using different devices. As in hypermedia, information is accessed in an exploratory way, different from information systems where access is provided through fixed or form-based interfaces. Moreover, the way in which the hypermedia is explored and presented can change subsequent content and layout of information. The navigational property of hypermedia, i.e. their graph-based representation, adds new ways to manipulate and search documents, not available or easily obtainable using the traditional (relational) IS data model. As in information systems, the increasing size of managed data, the increasing demand for availability and reliability and the worldwide distribution of applications require use of consolidated and reliable architectural solutions, such as DBMS, client-server computing, (distributed) object-oriented computing (e.g. CORBA). Nevertheless, emerging computing platforms, such as Enterprise Java Beans or Microsoft COM+ and .NET, are integrating in traditional development platforms. Peculiar requirements of web-based applications are [34]: 1. Necessity to support heterogeneous classes of users (in terms of interests, computer skills, etc.) of the applications. Thus we need new man-machine interfaces to provide service to different classes of users and different classes of terminals. The requirements are:

686

M. Cannataro et al. / Parallel Computing 28 (2002) 673–704

• customization and possibly dynamic adaptation of content structure, presentation layout and navigation primitives (e.g. related information buttons, summary buttons, etc.); • support of exploratory access through navigational interfaces, offering a high level of graphical quality, animation, different kind of rendering; • support of proactive behavior, i.e., for recommendation and filtering. 2. Ubiquitous application availability (anytime, anywhere, anyhow). The application has to be available anytime, from any place, anywhere, where the user could potentially be, i.e. accessibility via wired and wireless devices and networks; and should be accessed, in a ‘‘scalable’’ manner by using heterogeneous user terminals (anyhow), characterized by different user interface, processing and storage capacity, etc., such as PC, PDA, WAP-phones, handheld devices, etc. 3. Use of globally available information sources. This requires to manage a mix of structured (databases) and semi-structured (file systems, HTML, XML, multimedia) data sources, distributed over the Internet, with different security policies [35]. Moreover, we have the usual requirements of IS and data-intensive applications: (1) performance, security, scalability, and availability; (2) interoperability with legacy systems and integration of heterogeneous data sources; (3) ease of evolution and maintenance. The architectural solution largely adopted to deploy web applications is the multi-layered client/server architecture (see Fig. 3). As an example, Fig. 3 shows a 4-tier architecture used to deploy WAP-based applications, i.e. applications that need to be accessed via wireless networks using terminals employing the Wireless Application Protocol [36]. We use this example to illustrate the details of the architecture layers as follows:

Fig. 3. An example of a multi-layer client-server architecture.

M. Cannataro et al. / Parallel Computing 28 (2002) 673–704

687

1. User layer: Essentially, it is a wired or wireless device running a browser and a communication protocol. Wired terminals usually run an HTML browser communicating via HTTP over TCP/IP; Wireless terminals can employ different, specialized browsers (e.g. a WML l-browser) and protocols (e.g. WAP). 2. Network layer: This layer serves to adapt the protocols and formats between the Application and the User layers. In traditional web applications this layer simply does not exist. In advanced, multi-channel applications, the Network layer is essential to decouple different environments. As an example, it is common to have an application that can be accessed via a GSM phone using WAP or SMS, or via a plain telephone using DTMF tones and text-to-speech functionalities. In this case, the Network layer, which acts as an HTTP proxy, has to adjust the web environment to both wired and wireless telephone environments. 3. Application layer: It implements the so-called application business logic. It simply can be a HTTP server, but usually is a more complex suite of software modules, named Application Server. In traditional web applications, e.g. an IS adapted to be accessed from the Web, the Application Server can be a wrapper around a legacy application, or can directly implement the logic of the IS. On the other hand, in modern web applications it can also implement the adaptive part of the application, necessary to face the different classes of users accessing the system. This component can ‘‘simply’’ adapt the content to different terminals, or could use web mining techniques to classify the user’s behavior before the adaptation process. 4. Data layer: It is usually composed of a set of databases; in modern applications, it can also integrate different sources of data, such as legacy applications and semistructured data (HTML, XML). Enhancing the performance of a Web application requires enhancing each layer of the implementing architecture, avoiding bottlenecks between the different layers. Moreover, due to the overwhelming requests coming to a globally available application, techniques to balance the requests among different servers are to be used. To face the problems related to efficient collection of request along the network, content delivery networks (CDNs) are usually employed. They are on the frontier between high performance distributed computing and distributed caching; the interested readers can refer to [37]. The load involving a globally available web application usually is a mix of: • A large number of small-sized tasks, requested by remote users (e.g. search requests to a search engine, reservations to a travel agency, and so on); • A smaller number of complex task, (e.g. maintenance commands). This load interests both the Application and the Data layers. Techniques and methodologies for parallel databases and data warehouses systems have already been reported. So we concentrate on the parallel implementation of application servers. It is worth mentioning that new architectures employing fully distributed and decentralized paradigms, such as those based on the peer-to-peer protocols are emerging;

688

M. Cannataro et al. / Parallel Computing 28 (2002) 673–704

they only allow the sharing of data (morpheus, napster, etc.), the details can be found in [38] 2.5.1. Scalable internet servers Due to the popularity of many Internet applications, the load on Internet servers (web servers, proxy/caching servers, firewalls, etc.) is rapidly increasing. The next generation of Web services should support a very high rate of requests with varying needs imposing a big challenge for the networking infrastructure and the server environments. Since the network bandwidth increases faster than the server capacity, the server could be the next bottleneck of Web services. Recent studies [39] predict that in few years Internet servers able to sustain 10 million hits/min will be needed. At the same time, the servers must exhibit a very high level of robustness and availability while maintaining short transaction response times. The efficient execution of Web applications will need to employ parallelism into the Application layer, and overall, into the Web server: (1) intra-server parallelism, i.e. parallel execution of the web server tasks (e.g. a task or a thread for each HTTP request); (2) inter-server parallelism, i.e. concurrent execution of different web servers. The major technical problems facing the designers of high scalable Internet servers are [40]: 1. The servers should provide high performance, high scale-up to handle huge content and hit rates and high speed-up to accommodate dynamic content generation. Moreover, the architecture should adapt to the increasing LAN speed. 2. Support of distributed Internet services by means of peer-to-peer computing. 3. Design of web server O/S and I/O needs to be revisited in the context of Internet workload. 4. Support for quality of service. 5. Overload and load control should be addressed and efficiently managed, as in telecommunications. 6. Support of high performance, large data repository for server performance and traffic logs; these logs will increasingly serve as the basis for data mining and dynamic user profiling. 7. Support of wireless services; limited power in wireless terminals requires asymmetric protocols that concentrate processing load on the server, such as dynamic adaptation of content and layout to different user and terminal needs. Another important issue is the implementation of large server farms. Internet Service Providers need to implement different web services (e.g. a different web site for each customer) by using a single point of access. We briefly comment on the common named-based virtual hosting technique, available with HTTP 1.1. Major HTTP servers now allow name-based virtual hosting (or host header support) rather than the older IP based virtual hosting. In particular, new web server HTTP 1.1 compliant allows configuring the new named-based virtual hosting feature to support multiple web sites with a single IP address. The combined use of the named-based virtual hosting feature and the DNS Round Robin techniques allows achieving highly avail-

M. Cannataro et al. / Parallel Computing 28 (2002) 673–704

689

able Web server clusters (using processor redundancy to enhance availability) or highly parallel Web server cluster (using processor redundancy to increase concurrency) [41,42]. 2.5.2. Digital libraries DLs [43,44] are probably the most significant examples of data-intensive Web applications. They provide access to heterogeneous digital contents, e.g. publications, multimedia, scientific data, etc. They represent an important top layer in the data management stack. They provide a more powerful set of tools to manipulate data sets, by supporting services on top of data repositories. Usually services can be invoked against any of the content stored in the DL, and often they are registered as methods of object-relational databases. The access to data sets is metadata-based through catalog services. With the present state-of-the-art in the case of multimedia data sets (e.g. digital images, audio, video) the metadata is generated a priori, i.e. when the data is entered into the DL. (Examples: medical image data bases, video clips with media suppliers, etc.) The limitation is that, if a relevant piece of metadata a user needs for his particular search was not extracted when the data was entered into the DL, the data cannot be retrieved. What is needed are dynamic retrieval techniques, that can execute at run time (data search time) and that can be tailored to meet particular user’s demands. Examples of such techniques can be found in [83] for MPEG compressed news feeds and in [84] for images from digital image libraries. These techniques are still in their infancy, but it is clear that the processing needs for such systems are orders of magnitude higher than when the (static) metadata is used. The data storage and handling problem remains the same; but added to this are the very compute intensive algorithms for the dynamic feature extraction needed to decide whether an object in the library matches (in some way) the search criteria. Common services/requirements for DL are: publication of new data sets, attribute-based information discovery, access to heterogeneous data resources, authentication and access control techniques, scheduling and load balancing for computing and I/O resources, distributed execution of DL services [45]. Nevertheless, current implementations are loosely coupled to local resources, i.e. they only use local file systems and algorithms (services) are executed on local databases. Research topics in DL are: the full support of heterogeneous data integrations, their generalization in a distributed environment where services are invoked against non-local resources. Since distributed DLs will require high-bandwidth networks, high-performances computers and common authentication and security mechanisms, they are natural candidates to be implemented over computational grids, leveraging ongoing efforts in basic and data-management services developed in the grid field. 2.6. Data intensive grids Computational grids are geographically distributed environments for high performance computation that manage a large number of systems offering them as a unified system accessible to users via a single interface. Grid middleware offers services

690

M. Cannataro et al. / Parallel Computing 28 (2002) 673–704

for managing large numbers of diverse computational resources administered by independent organizations, and provides application developers with a simplified view of the resulting computational environment. These grids provide the computational infrastructure for powerful new tools for scientific investigation and virtual organization operations, including desktop supercomputing, smart instruments, collaborative environments, and distributed supercomputing. During the development and use of grids, people realized that access to distributed data is typically as important as access to distributed computational resources. Distributed scientific and engineering applications typically require access to large amounts of data; moreover, grid applications also require widely distributed access to data from many places by many people (for example, as in virtual collaborative environments). In many cases, application requirements call for the ability to read large datasets and to create new datasets. They often do not require the ability to change existing datasets; consequently, the so-called Data Grids have been developed. A Data Grid is a distributed infrastructure that allows to store large local datasets, has locally-stored replicas of datasets from remote locations, and accesses remote datasets that are not replicated locally. In order to support these features, Data Grid middleware provides both a means of managing different types of datasets and a number of basic mechanisms that generalize the requirements of most data intensive applications. The main goal is to offer a generic infrastructure in the form of core data transfer services and generic data management libraries. These fundamental building blocks can then be used in a variety of interesting ways to build systems and applications. Three main important research projects aiming at the development of Data Grids are the Globus Data Grid, the European Data Grid and the Particle Physics Data Grid. The Globus Data Grid project [46] is currently engaged in defining and developing a persistent Data Grid environment that offers • a high-performance, secure, robust data transfer mechanism, • a set of tools for creating and manipulating replicas of large datasets, • a mechanism for maintaining a catalog of dataset replicas. The Globus Data Grid services are • GridFTP. This service provides secure FTP mechanisms to move data. This protocol provides a superset of the features offered by the various Grid storage systems currently in use. • RSDF: Replica selection and data filtering services select a replica that will provide an application with data access characteristics that optimize a desired performance criterion; • RM: Replica Manager service is to create (or delete) copies of file instances, or replicas, within specified storage systems. • MS: Metadata information management services provides a means for publishing and accessing metadata about the data grid itself, including information about file instances, the contents of file instances;

M. Cannataro et al. / Parallel Computing 28 (2002) 673–704

691

• SS: Storage system and data access services provides functions for creating, destroying, reading, writing, and manipulating file instances and storage systems. For each of the listed services, a C and/or Java application programming interface (API) is defined for use by developers; the higher-level components of a PDKD system can use these services by calling the corresponding APIs. The European Data Grid is a project [47] funded by the European Union with the aim of setting up a computational and data-intensive grid of resources for the analysis of data coming from scientific exploration. The project, started in 2001, is developing a grid infrastructure for handling and distributed storing of the very large amount of data coming from the LHC detector at a rate of 100 Megabytes/sec that corresponds to about 8.6 terabytes/day. This massive production of data related to physics experiments will result in about 3 or more petabytes per year that will be stored in different sites located in Europe that will also handle replicas of data coming from the CERN laboratories. The main goal of the Data Grid initiative is to develop and test the technological infrastructure that will enable the implementation of scientific ‘‘collaboratories’’ where researchers and scientists will perform their activities regardless of geographical location. It will also allow interaction with colleagues from sites all over the world as well as sharing of data and instruments on a scale previously unimaginable. The project will devise and develop scalable software solutions and test beds in order to handle many Petabytes of distributed data, tens of thousand of computing resources (processors, disks, etc.), and thousands of simultaneous users from multiple research institutions. The project covers three real data intensive computing applications: (1) high energy physics (HEP), (2) biology and medical image processing, (3) EOs. The Data Grid project participants are directly collaborating with key members of the Globus project and with the Grid Physics Network (GriPhyN) project funded by the US National Science Foundation. The software produced will extend the state of the art in international, large-scale, data intensive grid computing into the framework of the Global Grid Forum, the initiative born to coordinate US, European and Asian projects. Instead of a flow of people and resources to the laboratory, the Data Grids send out data from the laboratories to the universities and research centers where direct access to the data by faculty, research staff and students will vastly increase its impact. Demonstrations taking advantage of direct data access would extend this educational impact to the public, providing a rare opportunity to see science in the making and inspiring the next generation of scientists. A Data Grid combines computing, data storage, and networking in a highly transparent infrastructure that can be mined efficiently by the research community to develop scientific results. There are important advantages that derive from deploying computing resources in the form of a Data Grid. A highly decentralized grid will enable a user situated anywhere in the world to efficiently access data and mobilize largescale resources for scientific analysis or for decision support systems. But the most profound result is the flow of data outward from the laboratories to their intellectual communities, strengthening the research infrastructure and broadening access to researchers, students and professionals.

692

M. Cannataro et al. / Parallel Computing 28 (2002) 673–704

3. Parallel hardware and storage systems Data intensive applications demand high performance, low-cost, highly reliable computing systems offering balanced computing, storage, I/O, and network communication performance. Cluster computing, a cost-effective technique to connect many small-scale computers to build a large-scale parallel computer, is the first step to build more powerful, interconnected, high performance computing systems. Cluster computers put more demand on the I/O subsystem, and specifically storage. The last few years have seen a dramatic increase in microprocessor performance, and computer system performance is doubling every 18–24 months (Moore’s Law). On the other hand, the rate of increase in storage access speed is much lower than processor performance rate. Improvement in disk access times must face mechanical constraints, so it has been less than 10% per year. The storage system performance is becoming a major bottleneck in system performance, and it will limit the speedup of sequential as well as parallel systems, as implied by the Amdahl’s law: ‘‘speedup is limited by the slowest system component’’. A lack of faster storage systems is probably the most important bottleneck in designing Data intensive applications, since they involve large amount of data, often stored in a hierarchy of storage systems (with decreasing performance, from RAM, to disk, cd-rom, tape, etc.). Moreover, in parallel systems (such as clusters) it is necessary to improve I/O performance to balance increasing processor performance. Hence, providing large storage capacity with high access speed is now a critical issue to be considered in the design of computer systems [48]. The research in the field of disk arrays and parallel I/O faces both the large capacity and high performance problems. Integrating multiple disks together allows creating highly reliable, mass storage systems. Accessing disk arrays in parallel allows improving I/O performance. An important aspect is the availability of the storage system that is not simply enhanced by grouping more disks, but using an integrated approach to solve the capacity and performance problems. The innovative redundant arrays of inexpensive disks (RAID) idea [49] improves reliability, capacity, and performance, of large-scale storage systems. Other than RAID there are several methods to improve the performance of storage systems: parallel I/O, caching, pre-fetching, server-less file systems, adaptive techniques based on storage access patterns. Caching and pre-fetching mechanisms allow retrieving more used/useful data in a cache, using sophisticated techniques to increase the cache hit-ratio, thus reducing disk I/O access time. Parallel file systems (PFSs) use several devices in parallel. Furthermore, some high performance storage subsystems such as storage area networks (SANs) and network-attached storage (NAS) provide efficient I/O interfaces with rich semantics. SANs enable the storage to be externalized from the server, and allow storage to be shared among multiple servers. Some of them are described in the following. Starting from the basic RAID architecture, some RAID-based high performance (distributed) mass storage systems have been developed: e.g. Petal and RAID-x. Also, the network-attached secure disk (NASD) is yet another new way to design network attached storage systems.

M. Cannataro et al. / Parallel Computing 28 (2002) 673–704

693

Often in high-performance storage systems the I/O bandwidth is limited by the memory bandwidth of the server. RAID-II attempts to face this problem using a crossbar memory system (called the XBUS board) that connects the disks directly to the high-speed network. Using the crossbar, large data requests bypass the server allowing better disk array bandwidth to file server clients. Petal is a collection of network-connected servers that cooperatively manage a pool of physical disks. Petal provides an abstraction of collections of data blocks called virtual disk and guarantees availability of such block-organized storage system. NASD provides scalable storage bandwidth avoiding the cost of dedicated servers that are in charge of transferring data from peripheral devices (e.g. SCSI) to clients. NASD implements four main ideas: direct transfer to clients, secure interfaces, asynchronous non-criticalpath oversight, and variable sized data objects. The RAID-x (redundant array of inexpensive disks at level x) architecture allows distributed I/O over server-less cluster computers. It employs orthogonal striping and mirroring across all distributed disks in the cluster, significantly improving parallel I/O bandwidth, and hiding disk-mirroring overhead in the background. RAID-x allows enhancing scalability and reliability of cluster computing applications. 3.1. Parallel I/O and parallel file systems File system are fundamental components of operating systems that hide low-level details associated with accessing storage devices and logically couples storage devices and caches. PFSs attempt to exploit the parallelism of high performance computers, specifically the availability of many I/O devices to reduce latency and to increase throughput of I/O operations. Moreover, they should hide the details of the storage physical organization and increase data availability and reliability leveraging parallel storage systems. Traditional parallel (e.g. MPP and SMP) systems use a small number of I/O nodes, i.e. nodes connected to some storage media (disk, tape, SAN, etc.), using several disks in parallel (e.g. RAID) to increase I/O bandwidth and data availability. Moreover, some consolidated strategies (e.g. two phase I/O, data sieving and collective I/O) are commonly employed to improve the overall I/O throughput [50]. Nevertheless, these systems can suffer I/O bottlenecks since many processors share a relatively small number of disks. On the other hand, cluster computers can provide a larger data storage capacity, since usually each node has at least one disk attached. The main drawback is the fact that data has to be accessed through the cluster network that is typically slower than the I/O bus. However, since each node has at least a local disk, the I/O bottleneck can be reduced by increasing locality in data access: only data that really has to be shared by different nodes needs to be sent across the cluster network. Locality in data access can be improved by adding more disks on some nodes. This also allows increasing the overall storage capacity in a cost effective manner, without degrading the per-node I/O bandwidth and latency. However, it is a complex task to find the distribution of data that maximize I/O throughput, or in other words, which reduces the need to move data between nodes. Other than application needs, there

694

M. Cannataro et al. / Parallel Computing 28 (2002) 673–704

are many parameters to be considered: data transfer rates and sizes of disks, network bandwidth, disk contention, network contention, capacity and speed of memory and cache modules. We briefly describe the more common parallel I/O techniques that form the core of PFSs and specialized parallel I/O applications. 3.1.1. Parallel I/O techniques The main goals of parallel I/O techniques can be summarized as follows: • increase the overall I/O bandwidth by using as much as possible disks in parallel; • decrease the I/O latency, by reducing the number of disk accesses; • reduce the application execution time, by overlapping computation, network communications and I/O operations; I/O techniques can be implemented at different levels of the operating systems. We can distinguish between the following three generic classes. Application level techniques organize application’s memory such that main memory variables are directly mapped to disk objects. This way, using data where it is stored, it is possible to exploit data locality. In the two-phase method, all data requests are first locally searched, and then are sent to the requesting nodes over the network [51,52]. In data sieving, large sequential blocks of the disk are first read into main memory. When requested, smaller data blocks are extracted and delivered, avoiding disk latency [53]. Device level techniques, such as disk-directed I/O or server-directed I/O, reorganize the application’s logical I/O requests and apply an optimal sequence of physical disk accesses, with the main goal of optimizing the application’s performance [54,55]. Representatives of this approach are access anticipation methods. Caching and pre-fetching techniques, such as informed pre-fetching [56] or twophase data administration [57], try to foresee (look ahead) the I/O access behavior of the application to reduce disk latency. The look-ahead can be based on programmer’s hints, can be learned from previous runs, or the behavior can be identified via runtime data flow analysis. 3.1.2. Parallel file systems Parallel I/O techniques are the building blocks to build high-performance software architectures, in particular specialized application libraries and general-purpose PFSs. Some recent research prototypes attempt to introduce optimization techniques based on information provided by the user or extracted from the application ‘‘behavior’’ (intelligent I/O systems). Application libraries are a set of highly specialized I/O functions that allow experts of a specific application field to build high-performance solutions. However, such libraries are often highly specialized, complex to use for non-expert users, leading to monolithic software for specific problems. MPI-IO and I/O extension of the standardized MPI library are two examples of such libraries [57,58]. Portable PFS (PPFS) is an application library that supports different application-level data management strategies [59]. As an example, the application using

M. Cannataro et al. / Parallel Computing 28 (2002) 673–704

695

PPFS can announce access patterns, control caching and pre-fetching. PFSs, on the other hand, attempt to solve the problems of application libraries by offering a standard interface to the storage system, which is coherent with the primitives of sequential file systems. PFSs have received much attention both from industries and academia; currently available systems can be divided into three different classes: commercial PFSs, distributed file systems, and research prototypes. Commercial PFSs: Every major commercial parallel computer industry offers a proprietary PFS. Well-known proprietary PFSs are PFS for the Intel Paragon [60], PIOFS (Vesta) and GPFS for the IBM SP [61], HFS for the HP Exemplar [62], and XFS for the SGI Origin2000 [63]. Vesta is a PFS designed for cluster computers with parallel I/O subsystems, offering parallel file access. Instead of traditional file systems, where a file is a sequence of bytes, files in Vesta are a set of partitions (or disjoint sequences of bytes) that are accessed in parallel. The partitioning (static or dynamic) reduces the need for synchronization and coordination during the access. Some control over the layout of data is also provided, so the layout can be matched with the anticipated access patterns. These file systems are independent from the application but due to their proprietary design they often lack flexibility, because only pre-designed, hardware-dependent access techniques are supported. Moreover, they are available only on the specific platforms on which the vendor has implemented them. However, vendors are moving their proprietary PFSs to the cluster arena; for example, CXFS is the cluster version of the SGI XFS file systems developed for Linux. Distributed file systems are designed for distributed access to files from multiple client machines, and in particular they are not designed for high-bandwidth concurrent writes that parallel applications typically require. So, they are not usually suitable for MPP, SMP and cluster computers. Some notable distributed file systems are NFS [64], AFS/Coda [65], InterMezzo [66], xFS [67], and global file system (GFS) [68]. GFS is a distributed file system allowing clients to connect to (centralized, yet parallel) storage devices through switched networks called SANs. GFS offers an interface to a virtual centralized file system: each client views a unique virtually attached global storage device. To maintain consistency, GFS implements a locking mechanism. Zebra is a network file system that increases throughput by striping file data across multiple servers. The idea of Zebra is to collect all the new data from each process into a single stream, which is then striped along multiple servers. This approach allows high performance for reads and writes of large files as well as for writes of small files. A similar to RAID approach is used to increase availability. An extension of network file systems is represented by server-less network file systems: the nodes of the system cooperate as peers to provide file system services. They store, cache, or control basic data blocks using a decentralized approach; moreover, a node can take over a failed component, providing high availability using redundant data storage. Leveraging the speed of local area networks, these file systems provide better performance and scalability than traditional file systems. Research PFSs: Some research projects in the areas of parallel I/O and PFSs are: PIOUS[69], PPFS [59], Galley[70], Armada [71], Panda [72], ViPIOS [73] and parallel virtual file system (PVFS). PIOUS focuses on viewing I/O from the viewpoint of

696

M. Cannataro et al. / Parallel Computing 28 (2002) 673–704

transactions, PPFS research focuses on adaptive caching and pre-fetching, and Galley looks at disk-access optimization and alternative file organizations. The PVFS [74] is a PFS for Linux clusters. Although a research system, it provides high bandwidth for concurrent read/write operations from multiple processes or threads to a common file. Moreover, it supports multiple APIs: a native PVFS API, the UNIX/POSIX I/O API, and the MPI-IO API. Applications developed with the UNIX I/O API can be used with PVFS files. PVFS is primarily a user-level implementation; i.e., to install it is not necessary to modify the Linux kernel. PVFS currently uses TCP for all internal communication. As a result, it does not depend on any particular message-passing library. On the other hand, this is a current limitation of PVFS. In fact, even on fast gigabit networks, the communication performance is limited to that of TCP on those networks, which is usually unsatisfactory. Many of these research systems can be considered as intelligent I/O systems. In fact, similar to database systems, a logical I/O environment is provided to the application developer that describes what he/she wants and the systems tries to optimize the I/O requests and accesses. These systems are general and flexible, but are mostly research prototypes, not intended for everyday use by others.

4. Application areas Examples of data intensive applications include satellite data processing, medical image databases, high performance relational databases, data and web mining, web search and indexing, telemicroscopy, distributed interactive simulation, high-volume visualization, and detailed modeling of complex phenomena. It is critical that future parallel machines be designed to accommodate the characteristics of these classes of data intensive applications. There are a number of data intensive applications, including: • Large scientific databases. Very soon a few dozen petabyte archives with the appropriate data extraction mechanisms required to support gigabit applications will be available. Generally, scientific instruments and platforms create this data. Examples include astronomical databases, HEP databases, nuclear physics databases, climatic databases, genetics databases, medical databases, and databases of network data. • Simulation data. With the increasing power of parallel computers and computer clusters, simulations can today produce terabyte size data sets. Another source of data intensive applications arises when scientists try to remotely access these simulations. • E-business data. With more and more business being conducted over the web, large e-business data sets are becoming common. Many of these data sources are distributed, with customer data in a data center in a one city, online data in a data center in another city, and inventory data in a data center in a third city for example. Making informed decisions in real or near real time using this distributed data is another important source of gigabit applications.

M. Cannataro et al. / Parallel Computing 28 (2002) 673–704

697

• Internet games. Interactive online games over the Internet are going to become increasingly popular and soon they would play a very important role in the entertainment market. As games become very complex and highly interactive, they involve large amount of data to be stored and transmitted on the net. 4.1. Scientific applications In some of these application areas Data Grids could play a significant role. In the future, companies and public administrations would enjoy the same benefits that follow from transparent computing and data access linking distant sites. This benefit would extend to every part of the economy as the computing domain expands to wireless devices and as mobile agents connect these elements to global networks. Given the sustained exponential increase of computing power and network throughput over time, this vision of computing and advanced networking could be realized by the end of the first decade of the 21st century. To illustrate how Data Grids can be used, we outline two data intensive projects in the scientific arena. The Earth Systems Grid [75] is an experimental data grid for scientists collaborating on climate studies. The data is collected from ground and satellite-based sensors or generated via simulations. Scientists can register and ‘‘publish’’ their data for use by the community. Applications allow scientists to specify parameters for climate model visualizations using intuitive settings and then gather the required data from the community’s distributed data systems. A metadata catalog is used to identify relevant datasets and files, and Globus data grid technologies are used to locate ’’nearby’’ copies of the files and to transfer data to the local software using high-speed data transfer. Another interesting project is the GriPhyN one. The GriPhyN collaboration is a team of experimental physicists and IT researchers who plan to implement the first petabyte-scale computational environments for data intensive science in the 21st century. The CMS and ATLAS experiments at the LHC at CERN will search for the origins of mass and probe matter at the smallest length scales; laser interferometer gravitational-wave observatory will detect the gravitational waves of pulsars, supernovae and in-spiraling binary stars; and Sloan Digital Sky Survey will carry out an automated sky survey enabling systematic studies of stars, galaxies, nebula, and large-scale structure. 4.2. E-commerce E-commerce applications are especially content-rich and data intensive. Parallel and distributed computing platforms with large storage facilities are essential to successfully integrate all components of a distributed business application and to ensure data integrity and acceptable response time. The growth of Internet technologies has unleashed a wave of innovations that are having tremendous impact on the way organizations interact with their partners and customers. In order to survive the massive competition created by the new online economy, traditional business is under intense pressure to take advantage of the information revolution made possible by

698

M. Cannataro et al. / Parallel Computing 28 (2002) 673–704

the Internet and the web. Millions of organizations (of any size) are moving their main operations to the web for more automation, efficient business processes, and global visibility. With the advent of the Internet and the web, both business-to-customer (B2C) eservices (e.g., virtual malls, customized news delivery, traffic monitoring and route planning) and business-to-business (B2B) e-services were implemented. B2B e-services allow organizations to form alliances, joining their applications, databases, and systems in order to share their costs, skills and resources in offering value-added services. Examples of B2B e-services include procurement, customer relationship management, finance, billing, traffic information services, accounting, human resources, supply chain and manufacturing. Both these forms of e-commerce require the storage and the access to very large data repositories. The infrastructure that is needed to support e-services is much broader than traditional transaction processing systems. Indeed, today’s e-commerce systems are complex assembly of web servers, databases, legacy applications, workflows, middleware and networking services, etc. E-service infrastructures need to address several challenging requirements including interoperability, scalability, availability, quality of service, dependability, security, rapid development and deployment, evolution, and business process management. In considering the future of e-business, service providers will need hundreds of servers to work in concert as digital marketplaces become increasingly complex. A digital marketplace not only has to process transactions, it also has to coordinate information from all participating systems so that all the buyers and sellers in the system have access to all relevant information in real time. This computing infrastructure will be based on distributed parallel computers equipped with data management capabilities and resources. 4.3. Search engines Most studies on the amount of information on the Internet (i.e., volume, number of publicly accessible Web pages and Hosts, number of users, etc.) show tremendous growth [76]. For example, the following numbers refer to the last Matrix Information and Directory Services (MIDS) measures on Internet hosts: • 1969: the ARPANET is established (SRI-Menlo Park and UCLA exchange test messages); • 1999: 64,446,260 Internet hosts; • 2001: 156,571,000 Internet hosts. Although estimates can be different, the size of the Internet appears to be growing at an exponential rate. Moreover, checking periodically older previsions, ‘‘ it appears that existing estimates significantly underestimate the size of the Web’’ [77]. Given the enormous volume of the Internet, users are increasingly using search engines and search services to find specific information. Usually search engines index a portion of the WWW producing an internal database. When asked, they all do keyword

M. Cannataro et al. / Parallel Computing 28 (2002) 673–704

699

searches against the database, but often various factors, such as the size of the database, frequency of update, search capability and speed, may lead to different results. A recent survey on information retrieval on the Web can be found in [78]. Usually, because search engines manipulate documents, full-text indexes seem a reasonable solution. Traditionally these structures associate each indexed word with the documents in which the word occurs (i.e. its URL) and eventually its position(s) in the document(s) (offset or relative position). This is efficient for answering possibly sophisticated keyword searches. To support efficiently thousands of simultaneous queries over millions of documents many search engines, such as Altavista or Google, use clusters of computers to retrieve the Web references and maintain structures in main memory that (i) cover the whole database, (ii) allow answering the most frequent queries fast, and (iii) minimize access to the secondary memory when this is unavoidable. The main techniques used by search engines to classify web pages are: • Indexing: ‘‘indexing is building a data structure that will allow quick searching of the text’’ [79]. Four approaches to indexing documents on the Web are (1) human or manual indexing; (2) automatic indexing; (3) intelligent or agent-based indexing; and (4) metadata, and annotation-based indexing. • Clustering, i.e. grouping similar documents together to expedite information retrieval [80]. • Ranking algorithms. Many techniques have been developed for ranking retrieved documents for a given input query. Although information about the ranking algorithms of major search engines is not publicly available, it seems that many of them use vector space models, i.e. each document is modeled by a vector, whose coordinates represent each attribute of the document [81]. Meta-search engines or meta-crawlers (e.g. Dogpile, Mamma, Metacrawler, SavvySearch) send searches in parallel to several search engines. They usually do not have any internal database, since they use those of the search engines. Since metasearch engines do not allow many search variables, their best use is to find hits on obscure items or to see if something is on the Internet. Given the increase in Internet volume and search engine usage, the volume of indexed web pages is also increasing at an exponential rate. Fig. 4(a) and (b) shows, respectively, the number of indexed pages as of 15 August, 2001, and its increasing rate in the period December 1995– June 2001, for the major search engines (source Search Engine Watch) [82]. The numerous available search engines are compared against each other by many web services available on the Internet. In summary, searching Internet is a challenging data-intensive problem, that can exploit parallelism, and currently does so partially at many levels: indexing and searching of the database, clustering of similar pages and ranking and filtering of retrieved pages. From the architectural point of view, parallelism can be exploited at the Application layer (comprising the web server) and at the Data layer (parallel database systems). Of course, all the techniques to collect the user requests in an efficient way, avoiding congestions and overloads have to be employed, e.g. by distributing

700

M. Cannataro et al. / Parallel Computing 28 (2002) 673–704

Fig. 4. Indexing in search engines (GG ¼ Google, FAST ¼ FAST, AV ¼ AltaVista, INK ¼ Inktomi, WT ¼ WebTop.com, NL ¼ Northern Light, EX ¼ Excite), source search engine watch.

the load over many server located in different points of the Internet, as in CDN (see [37]).

5. Conclusion and future challenges The dramatic increase of data stored in digital memories require more and more efficient techniques of storage, access and analysis of very large data repositories. In this setting, parallel and distributed computing can offer effective algorithms, techniques and tools for efficient implementation of data intensive applications. Despite the current achievements that are bringing many benefits in several scientific and commercial areas, nowadays data intensive applications are facing many challenges, including • High level frameworks. Current network frameworks do not directly support the needs of data intensive application developers. For data intensive applications to be developed better tools for distributed data management and network utilization are needed. Data grid middleware is a notable approach in this direction. • Geographic distribution. As the amount of stored data grows and as the number of people interested in accessing it grows, new problems will emerge, including data distribution policies, data caching policies, data replication, metadata definition and management, data multi-casting. • Balancing network bandwidth with data archive sizes. Distributed applications require access to external memory data at comparable rates in order to build effective end-to-end applications. • Efficient analysis. There are two basic ways of interacting with large data sets: extracting small subsets for further analysis and study, and looking for patterns and

M. Cannataro et al. / Parallel Computing 28 (2002) 673–704

701

correlations in large sets. The latter (data analysis or data mining) requires highperformance computational resources not just to get results in less time, but also to perform more accurate analysis that produce better models.

Acknowledgements The authors are grateful to Dr. Gerhard Joubert for many helpful suggestions that improved the contents and the presentation of this paper.

References [1] U.M. Fayyad, S.G. Djorgovski, N. Weir, Automating the analysis and cataloging of sky surveys, in: Advances in Knowledge Discovery and Data Mining, AAAI MIT Press, Cambridge, 1996, pp. 471– 494. [2] K.H. Fasman, A.J. Cuticchia, D.T. Kingsbury, The GDB (TM) human genoma database, Nucl. Acid. R. 22 (17) (1994) 3462–3469. [3] W. Hoschek, J. Jaen-Martinez, A. Samar, H. Stockinger, K. Stockinger, Data management in an international data grid project, in: Proceedings of the IEEE/ACM International Workshop on Grid Computing Grid’2000, LNCS vol. 1971, Springer Verlag, 2000, pp. 77–90. [4] S. Abiteboul, P. Buneman, D. Suciu, Data on the Web: From Relations to Semistructured Data and XML, Morgan Kaufman Publishers, San Francisco, CA, 1999. [5] H. Schoening, J. Wosch, Tamino––An internet database system, in: Proceedings of the 7th International Conference on Extending Database Technology (EDBT), LNCS vol. 1777, Springer Verlag, 2000. [6] S. Chaudhuri, U. Dayal, An overview of data warehousing and OLAP technology, ACM SIGMOD Record 26 (1) (1997) 65–74. [7] A. Reuter, Methods for parallel execution of complex database queries, Parallel Computing 25 (13– 14) (1999) 2177–2188. [8] D.J. DeWitt, J. Gray, Parallel database systems: the future of high performance database systems, CACM 35 (6) (1992) 85–98. € zsu, P. Valduriez, Distributed and parallel database systems, ACM Computing Surveys [9] M. Tamer O 28 (1) (1996) 125–128. [10] M.J. Flynn, K.W. Rudd, Parallel architectures, in: The Computer Science and Engineering Handbook, 1997, pp. 482–495. [11] T.F. Leighton, Introduction to Parallel Algorithms and Architectures: Algorithms and VLSI, Morgan Kaufmann Publishers, San Francisco, CA, 2001. [12] J. Widom, Research problems in data warehousing, in: Proceedings of the 4th International Conference on Information and Knowledge Management (CIKM), November 1995. [13] P. Vassiliadis, C. Quix, Y. Vassiliou, M. Jarke, Data warehouse process management, Information Systems 26 (3) (2001) 205–236. [14] W.H. Inmon, Building the Data Warehouse, QED Technical Publishing Group, Wellesley, MA, 1992. [15] J. Han, M. Kamber, Data Mining, Concepts and Techniques, Morgan Kaufmann Publishers, San Francisco, CA, 2000. [16] P.M. Fernandez, Red brick warehouse: a read-mostly RDBMS for open SMP platforms, in: Proceedings SIGMOD’94, 1994, p. 492. [17] H. Garcia-Molina, W.J. Labio, J.L. Wiener, Y. Zhuge, Distributed and parallel computing issues in data warehousing, in: Proceedings of ACM Principles of Distributed Computing Conference, 1999. [18] J. Gray, A. Bosworth, A. Layman, H. Pirahesh, Data cube: a relational aggregation operator generalizing group-by, cross-tab, and sub-total, in: Proceedings ICDE’96, 1996, pp. 152–159.

702

M. Cannataro et al. / Parallel Computing 28 (2002) 673–704

[19] R.T. Ng, A.S. Wagner, Y. Yin, Iceberg-cube computation with PC clusters, in: Proceedings SIGMOD 2001, 2001. [20] A. Datta, B. Moon, H.M. Thomas, A case for parallelism in data warehousing and OLAP, in: Proceedings of DEXA Workshop’98, 1998, pp. 226–231. [21] S. Goil, A.N. Choudhary, A parallel scalable infrastructure for OLAP and data mining, in: Proceedings of IDEAS 1999, 1999, pp. 178–186. [22] A. Rudra, R. Gopalan, Adaptive use of a cluster of PCs for data warehousing applications: some problems and issues, in: Proceedings of ACM SAC 2000, vol. 2, 2000, pp. 698–703. [23] M.J.A. Berry, G. Linoff, Data Mining Techniques for Marketing, Sales, and Customer Support, Wiley Computer Publishing, 1997. [24] A.A. Freitas, S.H. Lavington, Mining Very Large Database with Parallel Processing, Kluwer Academic, Dordrecht, 1998. [25] D. Skillicorn, Strategies for parallel data mining, IEEE Concurrency 7 (1999). [26] M.J. Zaki, Parallel and distributed association mining: A survey, IEEE Concurrency 7 (1999) 14–25. [27] M. Joshi, G. Karypis, V. Kumar, ScalParC: A scalable and parallel classification algorithm for mining large datasets, in: Proceedings of the International Parallel Processing Symposium, 1998. [28] M. Cannataro, D. Talia, P. Trunfio, Knowledge grid: high performance knowledge discovery services on the grid, in: Proceedings 2nd International Workshop GRID 2001, LNCS 2242, Springer Verlag, 2001, pp. 38–50. [29] M.J. Zaki, C.-T. Ho (Eds.), Large-Scale Parallel Data Mining, LNAI State-of-the-Art Survey, vol. 1759, Springer Verlag, 2000. [30] H. Kargupta, P. Chan (Eds.), Advance in Distributed and Parallel Knowledge Discovery, AAAI Press, 2000. [31] N. Huyn, Data analysis and mining in the life sciences, SIGMOD Record 30 (3) (2001) 76–85. [32] B. Myers et al., Strategic directions in human–computer interaction, ACM Computing Surveys 28 (4) (1996) 794–809. [33] J. Nielsen, Computer Science and Engineering Handbook, CRC Press, Boca Raton, FL, 1996. [34] P. Fraternali, Tools and approaches for developing data-intensive web applications: A survey, ACM Computing Surveys 31 (3) (1999) 227–263. [35] M.F. Fernandez, A. Morishima, D. Suciu, Efficient evaluation of XML middle-ware queries, in: Proceedings SIGMOD 2001, 2001. [36] M. Cannataro, D. Pascuzzi, A component-based architecture for the development and deployment of WAP-compliant transactional services, in: Proceedings of HICSS-34, Hawaii, 3–8 January, 2001, IEEE Computer Society Press, 2001. [37] Scalable Internet Services, special issue, IEEE Internet Computing 5 (4) (2001) 36–75. [38] K. Kant, R. Iyer, V. Tewari, On the potential of peer-to-peer computing: Classification and Evaluation, Technical Report, available at http://kkant.ccwebhost.com/download.htm. [39] K. Kant, P. Mohapatra, Scalable internet servers: issues and challenges, in: 1st International Workshop on Performance and Architecture of Web Servers (PAWS-2000), 2000. [40] K. Kant, P. Mohapatra, Current research trends in internet servers, in: 2nd International Workshop on Performance and Architecture of Web Servers (PAWS 2001), 16–17 June, 2001, Cambridge, MA. [41] Configuring Apache and IIS for High Availability Web Server Clustering, available at http:// www.polyserve.com. [42] Improving Apache, available at http://www.availability.com. [43] Special Issue on Digital Libraries, Communications of the ACM 44 (5) (2001). [44] The SMETE Digital Library, available at http://www.smete.org/. [45] R.W. Moore et al., Data Intensive Computing, in The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann Publishers, San Francisco, CA, 1998. [46] A. Chervenak, I. Foster, C. Kesselman, C. Salisbury, S. Tuecke, The data grid: towards an architecture for the distributed management and analysis of large scientific data sets, Journal of Network and Computer Applications (2001). [47] H. Jin, T. Cortes, R. Buyya (Eds.), High Performance Mass Storage and Parallel I/O, Wiley Press, New York, USA, 2001.

M. Cannataro et al. / Parallel Computing 28 (2002) 673–704

703

[48] D.A. Patterson, G. Gibson, H. Katz, A case for redundant arrays of inexpensive disks (RAID), in: Proceedings of the ACM SIGMOD 1998, 1988, pp. 109–116. [49] E. Schikuta, H. Wanek, Parallel I/O, in: M. Baker (Ed.), Cluster Computing White Paper, University of Portsmouth, UK, 2000. [50] R. Thakur, A. Choudhary, An extended two-phase method for accessing sections of out-of-core arrays, Scientific Programming 5 (4) (1996). [51] A. Choudhary, R. Thakur, R. Bordawekar, S. More, S. Kutipidi, PASSION, optimized I/O for parallel applications, IEEE Computer 29 (6) (1996) 70–78. [52] A. Choudhary, R. Bordawekar, M. Harry, R. Krishnaiyer, R. Ponnusamy, T. Singh, R. Thakur, PASSION: Parallel and scalable software for input–output, SCCS-636, ECE Department, Syracuse University, September 1994. [53] D. Kotz, Disk-directed I/O for MIMD multiprocessors, ACM Transactions on Computer Systems 15 (1) (1997) 41–74. [54] K. Seamons, Y. Chen, P. Jones, J. Jozwiak, M. Winslett, Server-directed collective I/O in Panda, in: Proceedings of Supercomputing ’95, IEEE Computer Society Press, San Diego, CA, 1995. [55] A. Tomkins, R Patterson, G. Gibson, Informed multi-process prefetching and caching, in: Proceedings of the ACM Sigmetrics ’97, Washington, June 1997. [56] T. Fuerle, O. Jorns, E. Schikuta, H. Wanek, Meta-ViPIOS: harness distributed I/O resources with ViPIOS, Journal of Research Computing and Systems (1999), Special Issue on Parallel Computing. [57] W. Gropp, E. Lusk, R. Thakur, Using MPI-2: Advanced Features of the Message-Passing Interface, MIT Press, Cambridge, MA, 1999. [58] Message Passing Interface Forum. MPI-2: Extensions to the Message-Passing Interface, July 1997. available at http://www.mpi-forum.org/docs/docs.html. [59] J. Huber, C.L. Elford, D.A. Reed, A.A. Chien, D.S. Blumenthal, PPFS: a high performance portable parallel file system, in: Proceedings of the 9th ACM International Conference on Supercomputing, ACM Press, Barcelona, 1995, pp. 385–394. [60] Intel ScalableSystems Division. Paragon system user’s guide, 1993. [61] P.F. Corbett et al., Parallel file systems for the IBM SP computers, IBM Systems Journal 34 (2) (1995) 222–248. [62] R. Bordawekar, S. Landherr, D. Capps, M. Davis, Experimental evaluation of the Hewlett–Packard exemplar file system, ACM Sigmetrics Performance Evaluation Review 25 (3) (1997) 21–28. [63] XFS: A next generation journalled 64-bit filesystem with guaranteed rate I/O. Available at http:// www.sgi.com/Technology/xfs-whitepaper.html. [64] H. Stern, Managing NFS and NIS, O’Reilly & Associates, 1991. [65] P.J. Braam, The Coda distributed file system, Linux Journal 50 (1998). [66] P.J. Braam, M. Callahan, P. Schwan, The InterMezzo filesystem, in: Proceedings of the O’Reilly Perl Conference 3, August 1999. [67] T.E. Anderson et al., Serverless network file systems, in: Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles, ACM Press, 1995, pp. 109–126. [68] K.W. Preslan et al., A 64-bit, shared disk file system for Linux, in: Proceedings of the Seventh NASA Goddard Conference on Mass Storage Systems, IEEE Computer Society Press, 1999. [69] S.A. Moyer, V.S. Sunderam, PIOUS: a scalable parallel I/O system for distributed computing environments, in: Proceedings of the Scalable High-Performance Computing Conference, 1994, pp. 71–78. [70] N. Nieuwejaar, D. Kotz, The Galley parallel file system, Parallel Computing 23 (4) (1997) 447– 476. [71] R. Oldfield, D. Kotz, The Armada File System, available at http://www.cs.dartmouth.edu/dfk/ armada/design.html. [72] K. Seamons, Y. Chen, P. Jones, J. Jozwiak, M. Winslett, Server-directed collective I/O in Panda, in: Proceedings of Supercomputing ’95, IEEE Computer Society Press, San Diego, CA, 1995. [73] T. Fuerle, O. Jorns, E. Schikuta, H. Wanek, in: Meta-ViPIOS: Harness Distributed I/O Resources with ViPIOS, Journal of Research Computing and Systems (1999), Special Issue on Parallel Computing.

704

M. Cannataro et al. / Parallel Computing 28 (2002) 673–704

[74] P.H. Carns, W.B. Ligon III, R.B. Ross, R. Thakur, PVFS: A parallel file system for Linux clusters, in: Proceedings of the 4th Annual Linux Showcase and Conference, Atlanta, GA, 2000, pp. 317–327. [75] I. Foster, S. Hammond, Prototyping an Earth system grid, in: Workshop on Advanced Networking Infrastructure, National Center for Atmospheric Research, Boulder, CO, 1999. [76] Matrix Information and Directory Services, available at http://www.matrix.net. [77] S. Lawrence, C. Giles, Searching the World Wide Web, Science 280 (1998) 98–100. [78] M. Kobayashi, K. Takeda, Information retrieval on the web, ACM Computing Surveys 32 (2) (2000). [79] R.A. Baeza-Yates, Introduction to data structures and algorithms related to information retrieval, in: W.B. Frakes, R. Baeza-Yates (Eds.), Information Retrieval: Data Structures and Algorithms, Prentice-Hall, NJ, 1999, pp. 13–27. [80] P.G. Anick, S. Vaithyanathan, Exploiting clustering and phrases for context-based information retrieval, SIGIR Forum 31 (1) (1997) 314–323. [81] R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, Addison-Wesley, Reading, MA, 1999. [82] Search Engine Watch, available at http://searchenginewatch.com/. [83] G. Falkemeier, G.R. Joubert, O. Kao, A system for analysis and presentation of MPEG compressed newsfeeds, in: J.-Y. Roger, B. Stanford-Smith, P.T. Kidd (Eds.), Business and Work in the Information Society: New Technologies and Applications, IOS Press, 1999, pp. 454–460.  Trous wavelet transformation, [84] O. Kao, G.R. Joubert, Efficient dynamic image retrieval using the A in: Proceedings of The Second IEEE Pacific-Rim Conference on Multimedia, Advances in Multimedia Information Processing, LNCS 2195, Springer, 2001, pp. 343–350.

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.