Industrial Analytics Pipelines

July 9, 2017 | Autor: Jiang Zheng | Categoría: Design Patterns, Data Science, Big Data

Descripción

Industrial Analytics Pipelines K. Eric Harper, Jiang Zheng, Sam Ade Jacobs, Aldo Dagnino, Anton Jansen 2, Thomas Goldschmidt3, Adamantios Marinakis4 ABB US Corporate Research Center 940 Main Campus Dr. Raleigh, NC USA, 27606 2 ABB Corporate Research, Västerås, Sweden 3 ABB Corporate Research, Ladenburg, Germany 4 ABB Corporate Research, Baden-Dättwil, Switzerland {eric.e.harper, jiang.zheng, sam.jacobs}@us.abb.com, [email protected], [email protected], 4 [email protected]

Abstract— Decreasing cost and increasing capabilities of instrumentation, networks and data repositories have pervaded the industrial automation and power markets and opened the door for large scale collection and analysis of data. There are a variety of potential architectures and technology stacks that can be applied to these types of activities. However, no single technology stack or architecture fits all the scenarios. With limited data science training and experience, it is difficult and time consuming for highly specialized domain experts to choose the optimal approach. In this paper, we introduce an architectural pattern for the design of a flexible core analytics platform which is extensible using different pipelines. The pipeline pattern provides an accelerated start to implementing industrial analytics applications. The platform enables domain experts to compose pipelines in series and in parallel at scale with the right quality attribute trade-offs to deliver significant business value. Our use of the proposed platform is illustrated with real-world industrial applications, which necessitate various data handling and processing capabilities. These examples show the importance of the platform to non-data experts: reducing the learning curve for applying data science, providing a systematic rating process for choosing the pipeline types, and lowering the barriers for industrial businesses to leverage analytics. Keywords— industrial analytics; data science; product line architectures

I. INTRODUCTION Industrial companies take pride in advanced engineering: no one else knows the detailed design and operation of their products, or is better suited to analyzing measured data collected under laboratory and field conditions. Insights come from system models and the physics, mechanics, and dynamics of the component interactions. Better yet, the engineers have extensive experience solving customer issues which lead to product refinements and enhancements. Many engineering models are approximations, for example linear simplifications of non-linear behavior. Approximations can be validated using statistical models. This is especially true in reliability engineering [1] and process control [2]. Statistics has its origins in the analysis of population and market data collected by hand, finding commonality and variance in the samples. Modern statistics depends on IT resources: computer databases and processors that can store and evaluate large amounts of data. Statistics in non-engineering domains has evolved into the fourth paradigm [3]: data-intensive scientific discovery. Known

as data science, this discipline incorporates exploration of data and construction of technology stacks for delivering business value. Application of data science in industry brings together operational (OT) and information (IT) technology platform stakeholders, all anticipating the potential for generating significant business value. Each domain is invested in its own architectures with accepted standards and common practices. Choosing the best software architecture depends on identifying and addressing the non-functional requirements, and this creates common ground for stakeholders to design industrial analytics applications. Industrial analytics is the intersection of data science and software architecture. Industry has reported success with targeted applications [4] where data science has been employed for specific analytics tasks. However, what most of these cases have in common is that they rely on heavily specialized solutions. An implementation provide business value in one scenario, but the solution in another application domain might require a completely different approach to be realized optimally. A variety of analytics approaches have emerged for defined types of tasks commonly referred to as “big data”. Different initiatives in the open source community, many of them governed by the Apache Software Foundation such as Apache Hadoop [17] or Apache Cassandra [35], have emerged in recent years. Distributions of open source software [5] begin to combine these implementations as comprehensive technology stacks. Our core platform builds on these distributions as a product line architecture, where analytics algorithms can be composed in series and in parallel to support the broadest number of applications. We can configure analytics applications with the available choices implementing orchestration from data capture and pre-processing, to analysis and visualization. Functional requirements do not determine optimal selection when choosing from the variety of algorithm development style alternatives. The non-functional requirements must be weighed and prioritized. Quality attribute measures are a proven technique and process for assessing software architecture nonfunctional requirement alternatives and tradeoffs. This same approach can be used for evaluation of analytics alternatives. Our proposed pipeline architecture pattern brings the benefits of big data analytics to industrial domains by lowering the knowledge required to deliver applications and allowing for systematic reuse.

Market signals regarding industrial analytics and strong encouragement from our senior management put our team on a multi-year journey to investigate and review analytics alternatives, and then apply our knowledge to application domains benefiting our businesses and customers. The architecture pattern and evaluation framework described in this paper are results from those efforts. The main contributions of this paper are an architectural pattern allowing different technology stacks to be composed together in a reusable way and an approach for choosing the right technology stack for industrial analytics. The rest of the paper is structured as follows: Section II presents related work, Section III introduces the concept of analytics pipelines. Section IV presents our approach for choosing the optimal analytics pipelines, Section V summarizes three industrial case studies in which the pipeline approach has been applied. Section VI discusses the lessons learned and limitations. Section VII concludes this paper and discusses our future work. II. BACKGROUND A ND RELATED WORK Three areas form the foundation for our contributions to industrial analytics: data science, big data architecture, and software quality attributes. A. Data Science The mission of data science [44] is to extract knowledge from data, with or without subject matter expertise, using scientific discipline. The steps are to collect and clean raw data, explore the characteristics and relationships, and develop models and algorithms that uncover patterns and predict outcomes. Finally, the results need to be delivered in format and terms that can be understood by non-technical stakeholders. The complexity of data science computations and the corresponding value of results evolves as more insight is discovered from the application domain. For example as shown in Figure 1 below, historical data lends hindsight based on data exploration and correlation with influencing factors, followed by application of investigative algorithms motivating collection of additional data to enhance forecasting, and finally development of models providing better understanding of the domain, automating and optimizing subsequent calculations.

Figure 1 Data science stages [6]

The most effective data scientists are typically well-versed in both science and technology. Given this background, there are no limits to solution alternatives. For example, algorithms and models can be constructed and executed in highly distributed deployments. This results in complex IT and software architectures specifically tuned to the application domains, and algorithms represented as coded programs: difficult to integrate and extend. In reality these cross-functional skills are difficult to find [7]. There is a talent gap in combined expertise (IT, programming, math, statistics and domain-specific), curiosity, storytelling, and cleverness. For industrial analytics, business driven, not exploratory analytics is desired and unfortunately business goals do not translate directly into analytics projects. The talent gap creates an opportunity to standardize analytics tools, and automate algorithm specification and development processes. Significant progress, founded in open source community effort but extended by hosted platform commercial offerings [8] heralds the adoption of data science by a broader audience. B. Big Data Architecture Standardization of big data architectures has accelerated with innovations in search, especially using Apache Hadoop [17]. One key limitation of the original Hadoop MapReduce [15][16] framework is a restricted computational model. Subsequent to its emergence as the defacto large-scale data processing platform, the big data community recognized and advocated for a more generic and streamlined architecture platform with the realization that big data is not a single analysis pattern or a single data storage technology. The earlier proponents in this regard are two leading Hadoop distribution vendors: Hortonworks [30] and Cloudera [31]. These and additional vendors provide integrated platforms that support alternative processing engines like SQL, interactive and real-time queries, and streaming as part of their offerings. Subsequently, YARN [32] was introduced and adopted as a formal abstraction to encapsulate the need for resource management among different applications executing within a Hadoop cluster. In a similar fashion to Hadoop and its distributions, the University of Berkeley's AMPLab [33] and its commercial startup venture, Databricks [34], identified an alternate architecture for big data. From its inception, AMPLab introduced the Berkeley Data Analytics Stack (BDAS) with Spark [18][19][20] (in-memory computation engine alternative to disk-based Hadoop) at its foundation. BDAS aims to integrate batch analytics, streaming, SQL processing, graph analytics, and machine learning within the same framework . Databricks extends this idea by providing a commercially hosted-cloud platform in support of BDAS components such that end-users are shielded from the difficulties associated with big data architecture platforms. Additionally, GraphLab [21][22] focuses on streamlining graph analytics and machine learning for data scientists. It supports an end-to-end prototype-to-production approach that simplifies the data analytics process. Its data architecture includes support for different data sources, highly efficient

multicore and out-of-core graph analytics and machine learning toolkits, and tools to explore and visualize data. C. Software Quality Attributes Given the variety of choices for realizing analytics, it is difficult and time consuming for highly specialized domain experts to choose the optimal approach that creates business value. In software engineering, quality attributes of a software system refer to constraints on the manner in which the system implements and delivers its functionality [29]. From a requirements engineering perspective, functional requirements define operations that the system must be able to perform. Nonfunctional requirements (NFR) describe how well the system must perform its functions, and how to measure these aspects of the system. Quality attributes are characteristics of NFRs, such as performance, reliability, scalability, security, and usability. Many industry standards, such as ISO 25011, ISO/IEC 9126-1, and architecture books and materials by the Software Engineering Institute (SEI), have provided NFR taxonomies that define quality attributes. Improving one quality attribute of the system can have negative effects on other quality attributes. For example, making a system more secure can make it harder to use, or making it easier to use can make it less secure (security vs. usability); Using platform-specific features can make a system run faster, but that often makes it much more costly to port to another platform (performance vs. portability). Trade-offs must be made by taking into consideration the relative importance of each of the system qualities. Many of the important properties of analytics alternatives are quality attributes. This suggests the potential for a systematic approach for choosing the appropriate combination of analytics tools and algorithms for an application domain. III. PIPELINE ARCHITECTURE In this section, we describe the origins of our analytics pipeline pattern and define its components. A. Architectural Basics Successful analytics applications typically address a single problem and use specific technology choices. Unfortunately this means the next application needs to start from scratch. However, in practice this does not happen. Instead, the previous solution is reused (copied), even if the trade-offs are ill-suited to the new problem. There are two reasons for this. First, familiarity and success can lead to reasoning done from the perspective of a previous solution rather than letting the problem at hand inform the appropriate quality attribute priorities. Second, the effort required to reconsider the previous architecture decisions might be too costly. In this paper, we envision a product line approach [14] for analytics applications, by facilitating systematic reuse both on the infrastructure and software level, reducing the effort required to deliver analytic applications. A core platform provides the common functionality and infrastructure for industrial analytics applications, expediting application delivery. This lowers the knowledge and effort necessary to create business value from machine data.

Our core platform needs to be flexible enough to deliver different quality attribute trade-offs when it comes to the choices for industrial analytics applications. We therefore came up with the notion of an analytics pipeline; a flexible way of performing analytics with our core platform. A pipeline is an environment with specific technology choices that can store data and execute analytics programs. Analytics pipelines are composable in serial and parallel combinations. Inspiration for our analytics pipeline comes from the work of Nathan Marz, who described the so-called Lambda Architecture (LA) pattern [9]. This pattern combines batch and speed (e.g. streaming) processing layers, which work in parallel, as shown in Figure 2. All data coming to the system is dispatched to both the batch and speed processing layers. The batch layer is responsible for maintaining the master data set in an immutable way and to pre-compute the batch views. Another layer, the serving layer takes the batch views and indexes them, so they can be queried in an ad-hoc and low-latency way.

Figure 2 Lambda architecture pattern

The speed layer has the responsibility to compensate for the high latency in updates to the serving layer (as the batch layer can have several hours of input latency) and only contains recent data. Hence, from a querying perspective, the data is split based on its freshness: the “old” data is in the serving layer (coming from the batch layer), and the “fresh” (e.g. from the last couple of hours) data is coming from the speed layer. The Lambda pattern has benefits and disadvantages. It offers a highly scalable and fast response system, which can reliably store and retrieve huge amounts of data. On the other hand, the code in the batch/serving and speed layers need to be kept in sync for the system to deliver consistent results from the two data paths. Furthermore, operation costs are higher as different computing clusters have to be maintained for each layer. The combination is an instance of a more generic architecture pattern in which different pipelines are composed to realize a particular architectural trade-off. The Lambda architecture pattern defines which type of pipelines should be used and how they are connected to deliver well known tradeoffs. However, conceptually there is no reason why other pipelines could not be chosen and composed to get other tradeoffs. For our core platform a one-size-fits-all approach does not meet our goals. Different application domain requirements are often in direct conflict with each other. Hence, we need flexibility in the platform to choose a trade-off that makes sense in the specific context in which the platform is used. We provide some examples of these specific trade-offs and contexts in Section V. To realize this flexibility in the core platform, we need to have a design where different technologies, with their specific trade-offs, can be composed together thereby empowering the

platform users to tailor the platform to their specific trade-off needs. The architecture pattern that realizes this flexibility in our core platform is depicted in Figure 3. In this pattern, new data arrives at a Dispatcher, which sends the data to the relevant Analytics pipelines. The Analytics pipelines offer an environment in which data is analyzed and stored. Clients can either pull data from the Analytics pipeline (e.g. by a query) or get data pushed to them (e.g. by being sent a notification). To make Analytics pipelines composable, an Analytics pipeline can also feed the Dispatcher with data, allowing other Data pipelines to tap into these data streams. With this any hierarchical composition can be made, as one data pipeline can feed an arbitrary number of other data pipelines. Abstractly seen, our pattern is a more detailed and distributed instantiation of the pipes and filters [10] architectural pattern.

Ideally, the platform acts an Analytics As A Service (AAAS) platform which, similar to a Platform as a Service (PaaS), allows developers to deploy Analytics Apps. The platform manages deployment of an Analytics App without the developer having to care about infrastructural aspects, and decides and optimizes where and how the Analytics App is deployed. For example, it could decide to create a new analytics cluster to accommodate the app or co-locate it in an existing one. For the AAAS to work, an Analytics App would have to define some additional meta-data (see Figure 5). Besides an Implementation, an Ingres Specification is needed for the AAAS to know which data the Analytics App depends on and what type of analytics pipeline can execute the Analytics App. A QoS Specification defines the app run-time qualities (e.g. result delivery deadlines). The optional Outgress Specification defines the data the Analytics App exposes to other Analytics Apps to make the pipeline composable. Analytics App QoS Specification

Ingress Specification

Implementation

Outgress Specification

Figure 3 Analytics pipeline architecture

Figure 4 shows the components and connectors in an Analytics pipeline. First, the Ingress component is connected to the Dispatcher and has the responsibility to transform the incoming data stream into something the Storage* component can handle. Storage* is a (temporary or long term) storage place (e.g. buffer, memory, disk, storage cluster, distributed file system), which makes the data available to other components. The Analytics component is configured with the actual analytic functionality, which is designed using the Engineering interface. A Scheduler manages concurrent access to Storage* and schedules the Analytics tasks. Clients use the Client interface to access the data (including analytics results) of the Analytics component. The Outgress component is responsible for exposing the (analytics) data of the Analytics pipeline to other Analytics pipelines through the Dispatcher.

Figure 4 Analytics pipeline functionality

From our platform perspective, we would like to have partial implementations of these Analytics pipelines for different technology stacks (e.g. Hadoop, Cassandra, etc.), which the platform users can tailor through the Engineering interface for their specific analytics needs. Furthermore, for our platform to be elastic it should support run-time instantiation of these Analytics pipelines. In this way, different teams can work in parallel on the platform.

AAAS

Figure 5 Analytics As A Service (AAAS)

To realize the analytics pipeline architecture pattern, many different alternatives can be used for the analytics technology choice and algorithm development style. See Table I below for a catalog of our selections. In the following section, we explain how each pipeline’s quality attributes can be used to make different trade-offs. B. Analytics Pipeline Types The Hadoop pipeline refers to traditional batch analytics with the Hadoop Distributed File System (HDFS) [16]. This pipeline provides large static capacity, and high latency and response execution times given that the computations are disk bound. It also supports reliability in terms of fault tolerance through data replication. Two technology choices for the Hadoop pipeline are Apache Hive [36] (SQL engine for Hadoop) and Pig [37] (high level abstractions for MapReduce programs). The Indexed pipeline supports indexing such that data processing and information retrieval can be done faster. Apache Lucene [38] and its open source enterprise search extension, Apache Solr [39], are two examples. Solr is one of the most popular enterprise or single-site search engines. Indexing provides real-time, scalable search with support for almost any type of data and file format. Therefore, scalability and data flexibility are expected to be two of the prominent quality attributes. The RDBMS pipeline refers to traditional relational database systems. One key attribute of RDBMS is its very high productivity. RDBMS is a legacy pipeline in the sense that many

transaction systems are built on it and it has served the business community for years. Many data scientists and business analysts are already familiar with RDBMS so there is no need for retraining. There seem to be no real competitor to RDBMS when it comes to storage and processing of structured data. The Key-Value Pair pipeline is an example of NoSQL (Not only SQL) database technology. Key-Value Pair and related column store models are direct alternatives to RDBMS row store with focus on simplicity of design, reliability (fault tolerance), and scalability. These quality attributes are expected to feature prominently with the Key-Value Pair pipeline, and provides a streamlined and efficient approach to store, process, and retrieve (using primary key) unstructured and schema-less data. The Time Series pipeline supports storing and processing time-series data. Typically, time-series data come from sensors and other instrumented devices. Given the frequencies of their occurrence and susceptibility to error due to missing data points, time series data is represented by large volumes requiring highly scalable and robust solutions with scalability, capacity, and reliability. The Streaming pipeline is characterized by low latency required to derive quick insight from streaming and real-time data. There is no need for large static capacity since data are consumed and processed on arrival. Much of the computation is done in memory (high dynamic capacity) and the overall response has to be very fast. Storm [40] and S4 [41] are two popular technology choices for streaming pipeline. The In-Memory pipeline enables large-scale data processing within a cluster's memory. Technologies in this category provide primitives and APIs that support large volumes of data in RAM (random access memory), and repeated computation on that data. Apache Spark [19] is at the forefront of In-Memory data analytics, and combines persistent data storage for high static capacity. The key differentiating attributes for In-Memory are scalable dynamic capacity, low latency, and low (quick) overall response time. With sufficient RAM, Apache Spark has demonstrated performance 100 times faster than Hadoop MapReduce in many of its publicly available benchmarks [18]. The Single Node pipeline is optimized to run on a single shared memory computer. These single node alternatives (e.g. R [23][24][25], Weka [26][27]) are limited in terms of scalability but with an expected high rating for productivity. Users of Single Node do not have to deal with complicated issues associated with distributed data processing. In most cases, data analysts prefer Single Node for quick prototyping and proof of concept before scaling out. The Graph pipeline provides both intuitive and visual representations of computational problems. GraphLab [21][22] and GraphX [42] are two alternatives. Many data analytics problems can be represented as graphs for data mining. Also some classical machine learning algorithms map to graphs for which simple but powerful theorems exist. Typically, the graphbased approach represents arbitrary data and systems as nodes and connections (of any type) between the nodes as edges, computed in memory or out-of-core. This provides both high static and dynamic capacities, and low analytics latencies.

The Custom pipeline refers to technology choices that may be of interest to data architects and analysts, but not covered in the list of pipelines already discussed. A custom pipeline provides a high level of data and algorithm flexibilities suitable for domain-specific data analytics problems, balanced by lower productivity. Including Custom provides support for assembling and adapting existing analytics implementations in series or parallel with other pipelines. TABLE I.

ANALYTICS PIPELINE T YPES

Label

Technology Choice

Development Style

Hadoop

Hive / Pig

SQL

Indexed

Solr / Lucene

Indexed Query

RDBMS

JDBC

SQL

Key-Value Pair

NoSQL

CQL

Time Series

Time Series

Summary, Aggregates

Streaming

Storm / S4

Java, Topology

In-Memory

Interactive

SQL, Script

Single Node

R / Weka IDE

Java, Extension Packages

Graph

GraphX, GraphLab

Spark

Custom

Custom

Java

IV. NON-FUNCTIONAL PROPERTY RATING OF THE PIPELINES In this section, we presents our approach for choosing the optimal analytics pipelines. A. Pipeline identification and rating methodology The iterative process of identifying and evaluating pipeline types is shown in Figure 6. The ratings evaluation framework for the pipeline types evolved over time, starting with simple computer resources and later extending to software quality attributes Based on the system context and capabilities of the core platform, we first established a baseline by: (1) identifying common pipeline types (such as Hadoop and RDBMS) and general non-functional properties (such as performance and flexibility); (2) conducting an internal individual survey among six experts for rating the different pipelines regarding these nonfunctional properties; and (3) consolidating initial results and revising rating in a group meeting. Outliers were identified based on the standard deviation of the ratings. Ratings with a standard deviation greater than 2.0 were explicitly discussed to avoid misunderstandings. The experts then revised their ratings after active brainstorming. Then we iteratively conducted the above three steps to expand and refine the analytics pipelines, clarify and refine the nonfunctional properties, redo individual surveys by the six experts based on updated pipelines and non-functional properties, discuss again the results, and revise ratings if needed. Table II shows the final version of the non-functional properties with their descriptions and the definition of their scales. Each of the participants voted on a scale between 1 and

10 for the non-functional properties of the individual pipelines, where 1 means the corresponding non-functional property is low/bad and 10 means it is high/good. Figure 7 shows the rating dimensions and values for a selected set of analytics pipelines, and the dotted line trace indicates ideal ratings for all the nonfunctional properties.

Figure 6 Pipeline identification and evaluation process TABLE II.

ability to handle almost arbitrary types of data as well as extreme horizontal scalability. -

Regarding productivity RDBMS still has the lead. This is mainly due to the vast availability of good tools as well as experts for developing for this pipeline. However, for all other attributes RDBMs are only rated mediocre. Especially, and maybe a bit surprisingly reliability is rated rather low.

-

Single node has great ratings for flexibility and productivity, but the least attractive properties for the remaining properties compared with the rest of the analytics pipelines.

-

One of the overall winners of the rating seems to be keyvalue stores as they apparently provide a high dynamic capacity, good scalability as well as the best reliability amongst the rated pipelines. As a specific pipeline on top of key-value stores, time-series databases get similarly good ratings. Of course this comes with the drawback of reduced flexibility.

-

Stream processing pipelines lead the area of both analytics latency as well as round-trip response. On the other hand the rating for productivity is rated fairly low. A reason for this is

ANALYTICS PIPELINE PROPERTIES

Property

Description

Data Flexibility

New/unknown data types, without data model modification

Algorithm Flexibility

Variety of supporting libraries, query representations

Productivity

Ratio between effort and cost

Static Capacity

Store or configure permanently

Dynamic Capacity

Process or manage data simultaneously with concurrent tasks

Analytics Latency

Time delay experienced in core data processing

Round Trip Response

Time elapsed between request and response

Scalability

Ease, speed and affordability of changing performance qualities

Reliability

MTBF, operation when faults occur, degree of recovery, no data loss

Scale 1=one data type, 4 = limited number of data types, 6=extensible data types, 10=any data 1=single algorithm, 10=support any algorithm 1=low productivity, 10=high productivity 1=no permanent data storage, GB=3,TB=4, PB=6 EB=7, ZB=8, YB=9, 10=unlimited 1=no permanent data storage, GB=3,TB=4, PB=6 EB=7, ZB=8, YB=9, 10=unlimited decade=1, year=2, month=3, week=4, day=6, hour=7, minute=8, s=9, ms=10 year=1, month=2, week=3, day=4, hour=5, minute=6, s=7, ms=9, us=10 1=no scaling, 2=fix number, 4= exponential, 7=limited linear, 10=linear millennium=10

B. Pipeline rating results Figure 8 shows the average rating properties compared for each pipeline type, and Figure 9 presents the pipeline rating profiles. Table III shows the averages and standard deviations based on the participant votes. The table values in bold warrant further attention. We identify some interesting findings based on the rating results as follows: - Not surprisingly, Hadoop was top rated in both data flexibility as well as static capacity. This reflects the capabilities of HDFS the storage side regarding both, the

Figure 7 Pipeline rating spider diagram

the complex development model for the analytics algorithms. -

Of course a custom-made solution has the best flexibility regarding data and algorithms as anything can be programmed in such a solution. However, it is also perceived the worst in productivity as all the custom code has to created and maintained.

-

Figure 9 suggests that if no extremely high value in one of the categories is required there are options that provide a good overall rating (identifiably by the covered area within the spider diagram). These are for example, Graph databases, in-memory solutions as well as indexed query and key-value pair based solutions.

C. Threats to validity There are several threats to the validity of the pipeline rating results. First of all, regarding the design of the rating we might have missed non-functional properties that would have been important to make sound decisions for specific problems faced in practice. Additionally, the ratings we gathered were not supported by detailed benchmarks for the various non-functional properties but more on subjective experience by the people responding to the rating survey.

10

Custom

Hadoop

Custom

RDBMS

Hadoop

Key Value Pair

Time Series

Streaming

Key Value Pair

Key Value Pair

Time Series

Streaming

8 6 4 2 0 Dynamic Capacity

Analytics Latency

Round Trip Response

Scalability

Graph Hadoop Indexed Key Value Pair RDBMS Streaming Time Series In-Memory Figure 8 Ratings for the different pipelines in comparison with highlighting of top-rated pipelines

Hadoop

However, we argue that having the results based on a few experts available is already a good starting point. Of course, it would improve the results if the rating process would be executed for a broader audience of experts to get greater statistical significance. V. APPLICATION C ASE STUDIES In this section, we present three industrial analytics cases that use some of the analytics pipelines explained above. The general description of each case is presented as well as the characteristics of the data they generate. Based on the data and the expected analytic functionality, the main quality attributes of each case are identified, which provide the rationale behind the selection of analytics pipelines for each case. The ideal rating values for each application case are shown in Figure 10, overlaid with the selected analytics pipeline ratings for comparison.

Indexed RDBMS Key-Value Pair Time Series Streaming In-Memory Single Node Graph Custom

A S A S A S A S A S A S A S A S A S A S

9.5 0.8 8.7 1.1 7.8 1.6 8.7 1.4 4.5 1.7 9.3 1.5 9.2 0.9 7.2 0.7 6.8 2.4 9.5 1.1

6.2 1.7 4.8 1.2 5.7 1.6 4.8 1.7 3.2 1.3 7.8 1.8 8.5 1.0 9.2 1.0 6.3 1.9 9.5 1.1

7.3 10.0 0.7 0.0 7.3 9.7 1.2 0.7 8.8 5.3 0.9 1.8 7.2 9.3 1.1 0.5 7.8 9.2 1.9 0.7 5.3 1.5 1.7 1.3 7.8 9.3 0.9 0.7 8.2 3.0 1.7 0.6 6.5 9.2 1.0 0.9 3.0 5.5 1.6 1.1

8.2 1.7 8.2 1.8 5.2 1.2 9.0 0.8 9.0 0.8 8.7 1.1 5.8 2.2 2.4 1.0 7.8 1.7 5.7 1.5

3.5 1.6 6.2 1.7 5.0 1.7 7.3 1.1 7.3 1.1 9.2 0.4 8.0 0.8 3.4 1.5 8.2 1.1 5.2 0.9

3.8 2.0 5.0 1.5 4.3 1.4 6.0 1.7 6.2 2.0 8.8 0.7 3.7 1.5 3.2 0.7 5.0 1.9 5.2 0.9

Reliability

Pipeline

PIPELINE RATING RESULTS Analytics Latency Round Trip Response

Data Flexibility Algorithm Flexibility

TABLE III.

Figure 9 Analytics pipeline ratings

Single Node

A. Case 1 Wide-Area Monitoring, Protection and Control of Electric Power Systems Interconnections for electric power production, transmission and distribution (referred to as “power systems” in the related community) make up one of the most complex systems created by human. Large systems like the North American or the European interconnections typically consist of thousands of generators, transmission lines, substations and electricity consumers. In order for reliable uninterrupted supply of electricity to be achieved, power systems need to be constantly operated such that they remain at a dynamic equilibrium, despite the uncontrollable, nature of some of their components (electric loads, renewable generation). Even more challenging, stability needs to be maintained in face of unexpected events, like the loss of a transformer, a transmission line or a generating unit.

Dynamic Capacity

Furthermore, there are threats resulting from the execution of the rating itself. First, we had a relatively small set of participants (6) which all, despite a quite different background, were from the same company, namely ABB Corporate Research. Additionally, even though all of the participants have a strong technical background in computer science, software architecture and data analytics, not all of the analytics pipeline technologies were known with the same extent to all the different participants.

Reliability

Scalability

Productivity Static Capacity

Static Capacity

Custom

Algorithm Flexibility

Productivity

Data Flexibility

9.0 1.2 9.0 1.2 4.3 1.4 9.2 0.7 9.2 0.9 8.7 1.1 8.0 1.3 2.0 1.1 9.0 1.4 4.7 0.7

9.2 0.9 8.7 1.1 5.0 1.3 9.5 0.8 9.3 0.9 6.5 1.7 6.3 1.7 2.6 1.0 7.8 1.6 4.5 1.1

To this purpose, transmission system operators install measurement devices and software systems that allow them to monitor in real-time the dynamic state of their networks, so that proper stabilization actions are timely taken when needed. Nowadays, these wide-area monitoring systems increasingly rely on time-synchronized voltage and current phasor measurements taken from all over the system every few milliseconds, processed and stored permanently using a time series analytics pipeline. The transmission system operators make selective use of such data in offline in-memory analysis to gain insight in the reasons for instabilities and to design proper counteractions and/or operational rules, to be applied in the future.

Figure 10 Application case ratings

Precisely, these phasor measurements need to be timely communicated from their respective remote locations eventually to a control center where special algorithms are executed in realtime. These algorithms take as inputs all the phasor measurements with the exact same timestamp and perform functions such as system state estimation, identification and monitoring of electromechanical oscillations, calculation of distance from voltage instability and others using streaming analytics pipelines. On top of this, special protection (also called remedial action) schemes are designed to allow for automatic actions that would stabilize the system when needed. Actuation of such schemes is based on wide-area measurements and it consists in opening switches, modifying setpoints of power flow control devices, changing generator setpoints and others. To be effective, such wide-area monitoring, protection and control systems have strict time and reliability requirements; the stabilizing action should happen at the right moment and at the right location, properties that fit with a streaming analytics pipeline for detecting events and coordinating the appropriate actions in real-time. The most important quality attributes for this application include: static capacity, analytics latency, round-trip response, scalability and reliability. To this purpose, the streaming and time-series analytics pipelines are the most suitable. Figure 10 (a) shows the streaming, time series, and in-memory analytics pipelines used and the ideal application ratings outline. B. Case 2 Fault Event Prediction in Power Distribution Distribution systems take power from bulk-power substations to consumers. Power substations and their associated equipment play an essential role in the distribution of electricity. Much of the power distribution infrastructure in the western world is over 50 years old. A key issue facing utilities is to

efficiently utilize their limited funds for maintenance and repair of power distribution lines. Studies in the UK show that more than 70% of the unplanned customer minutes lost of electrical power are due to problems in the distribution grid [11]. According to a survey conducted by the Lawrence Berkeley National Laboratory, power outages or interruptions cost the United States of America $80 billion annually [11]. Power distribution grid operators typically rely on manual methods and reactive approaches to outage diagnostics and location identification. In order to reduce time and costs, it is desirable to have automated prediction models that can forecast when a fault event may occur in a distribution network given certain conditions. Moreover, the efficiency of dispatching crews can be improved by having more automated diagnostic methods and fault predictive capabilities. There are many factors identified in the literature that can cause fault events in a power distribution grid [12][13]. These factors can be broadly classified into (a) physical properties of the distribution grid; (b) electrical values of grid; (c) weather conditions; (d) assets or components degradation in the grid; and (e) type of grid infrastructure. The analyses conducted to predict faults in the power distribution grid use historical data from weather conditions (precipitation types, volumes of precipitation, pressure, temperature, wind speed, lightning, and humidity), grid electric value readings at the time of a fault event (current, power, and voltage values), power loadings, and the type of distribution grid infrastructure (overhead or underground). The data analyzed in this application consists of electrical values that are generated by sensors in intelligent electronic devices connected at the beginning of a power distribution feeder. These electric values are generated at rate of every second. Another type of data is measurements collected from weather stations located in various points along the power distribution grid. These data are collected every five minutes and need to be aligned with the electric value data. The analysis needs to be in near real-time to monitor the power distribution grid. To achieve the desired analytic results, a streaming analytics pipeline implements this application as latency, round-trip response, scalability, data flexibility, and algorithm flexibility are the most important quality attributes. In this application, the requirements for data flexibility, algorithm flexibility, analytics latency, round trip response, and scalability are quite high due to the near real-time requirements. Figure 10 (b) shows the streaming analytics pipeline used and the ideal application ratings outline. C. Case 3 Intelligent Alarm Management Large chemical, refining, power generation, and other processing plants need control systems to keep the processes operating successfully to produce high-quality products. Plant operators in supervisory centers use control systems to monitor production processes and manage possible alarms that may occur when process quality values are out of specification. Modern control systems produce large quantities of data and potentially can show a large volume of alarms to the operator. The trend in data availability, automation, analytics, visualization, cheap sensors, and data storage have continued to

increase the sophistication and complexity of control systems, and this in turn has also increased the need for alarm management systems capable of efficiently simplifying the work of plant operators. Alarm management systems incorporate ergonomics, instrumentation, systems thinking, advanced analytics, and sophisticated visualization to guide the design of an alarm system and increase its usability. Most often the major usability problem is too many alarms annunciated in a plant upset, commonly referred to as alarm flood or alarm burst. There can also be other problems with an alarm system such as poorly designed alarms, improperly set alarm points, ineffective annunciation, unclear alarm messages, and others.

business value. Our architecture is described in terms of expectations and the capabilities needed to address those concerns.

Modern control systems need to have Intelligent Alarm Management (IAM) capabilities that can dynamically filter the process alarms based on historical plant operation and conditions so that only the currently significant alarms are annunciated to the operators. The fundamental purpose of dynamic alarm annunciation is to alert the operator to relevant abnormal operating situations. This feature focuses the operator’s attention by eliminating extraneous alarms, providing better recognition of critical problems, predicting alarms given certain events, and insuring swifter, more accurate operator response to any abnormalities in the plant.

Systematic pipeline type selection is enabled by our evaluation process and over time each pipeline type becomes easier to deploy. However, it is crucial that the experience gained from creating an implementation is captured in detailed steps so the next implementers do not start from scratch. Validation of the business value delivery should be added to the community knowledge to improve confidence in the specific pipeline ratings.

This application case refers to a big data IAM system that allows various types of sophisticated analyses of large volumes of historical alarm data to help reduce the number of redundant and nuisance alarms, predict alarms given certain events, and identify tripping alarms that may cause costly stoppages. The measurements analyzed in this application are events generated by supervisory systems, either part of process control or alarm management dedicated to collecting state transitions. Alarms are generated in real-time as a facility (manufacturing, process manufacturing, petrochemical, nuclear, etc.) operates and these alarms are either analyzed in batch to fine-tune the alarm management system, or in real time to present critical alarms to the operator. To achieve the required analytic results in this application, the streaming and graph analytics pipelines were used. The most important quality attributes in the application include: algorithm flexibility, scalability, productivity, and reliability. The requirements for algorithm flexibility and scalability are quite high as different analytic methods have been developed to analyze alarms. Figure 10 (c) shows the streaming and graph analytics pipelines used, and the ideal application ratings outline. VI. DISCUSSION The evolution of our pipeline pattern architecture and analytics pipeline types has its roots in our collaborative agile architecture process [43]. In this section we summarize the lessons learned and limitations of our investigation. A. Lessons Learned The technology choices for analytics alternatives are rapidly evolving. Basing our architecture on the current (or even proposed future) industry solution is risky at best. The architectural decisions need to be driven by business and common customer needs with the potential for generating

Very few industrial organizations have the resources and luxury of extended time to market to develop custom analytics infrastructures. Leveraging open source community work products can leap business initiatives forward. With that said, combining disparate release versions to create best of breed deployments is fraught with pain. It is more efficient to start with a commercial analytics framework distribution where those issues have already been resolved.

B. Limitations Our architecture pattern is not an optimal solution. It is a trade-off between flexibility and complexity: complexity in IT operations as well as the lack of data transparency. Each pipeline type brings with it a set of algorithm development styles that may not be compatible with the engineering staff holding essential subject matter expertise for the application domain. The pipeline type rating process includes surveying technology experts for their guidance on the value of each pipeline property. An expert with affinity to a specific pipeline type is hopeful that all properties are maximized, based possibly on the future potential of the technology stack as opposed to its current practice. The ratings are therefore driven by subjective perceptions and would benefit from a significantly larger sample audience for consistency. VII. CONCLUSION AND FUTURE WORK Industrial analytics provides a framework to derive actionable knowledge for critical decision making from data generated and collected from industrial systems, machines, equipment, and industrial processes. The primary raw material in industrial analytics is data. Data generated in industrial systems can originate from various sources (equipment sensors, process sensors, humans, computers, etc.), can be of various forms (numerical thresholds, measurement data, extrapolated values, textual data, pictorial data, etc.), and can come at a large ranges of rates (at millisecond, second, minute rates). Similarly, depending on the types of decision needed to be made in industrial systems, the analyses of the generated data require responses within a large spectrum of time intervals (real time, near real time, every minute, every hour, every week, etc.). Throughout our experiences in developing architectures of a variety of industrial analytics applications, the concept of analytics pipelines has played a key role to define the “right” architecture. Analytics pipelines provide a framework for a product line environment to develop industrial analytic applications. The analytics pipeline approach provides a core platform that delivers a functionality infrastructure to develop

architectures of industrial analytic applications based that satisfy their quality attributes. The core platform described in this paper has proven to be very flexible in providing different quality trade-offs for architecting industrial analytic applications. An analytics pipeline provides an environment with specific technology choices that can store data and execute analytics that result in actionable knowledge. The work presented in this paper will continue to evolve as new applications are developed. For example, new quality attributes will need to be formalized and incorporated into the current analytics pipelines structure in the future such as security, availability, maintainability. Moreover, the analytics technology continues to proceed at very rapid pace, and this evolution is expected to bring new dimensions that will need to be considered in the analytics pipelines platform. ACKNOWLEDGMENT We would like to thank ABB Corporate Research for sponsoring and supporting this work.

[17] [18]

[19] [20]

[21]

[22]

[23]

[24]

REFERENCES [25] [1] [2] [3] [4]

[5] [6] [7]

[8] [9]

[10] [11]

[12]

[13]

[14] [15]

[16]

K. C. Kapur and M. Pecht, “Reliability Engineering,” John Wiley and Sons, Hoboken, New Jersey, 2014. F. Ruggeri, R. Kenett and F. Faltin, “Encyclopedia of Statistics in Quality and Reliability,” John Wiley and Sons, Hoboken, New Jersey, 2007. T. Hey, S. Tansley, and K. Tolle, “The Fourth Paradigm: Data-Intensive Scientific Discovery,” Microsoft Research, 2009. W. Ruh, “Bringing Dark Data into the Light, Bits and Bytes blog,” http://billruh.geblogs.com/2014/09/05/%EF%BB%BFbringing-darkdata-into-the-light/, September 2014. Tom White, “Hadoop: The Definitive Guide”, O’Reilly Media, Sebastopol, CA, 2012. Nancy W. Grady and Wo Chang, “Big Data Interoperabilty Framework: Definitions”, ACM (preprint used with permission), 2014. Foster Provost and Tom Fawcett. “Data Science and its Relationship to Big Data and Data-Driven Decision Making”, Big Data, vol. 1, issue 1, pp. 51-59, March 2013. Collins, E., "Big Data in the Public Cloud," Cloud Computing, IEEE, vol.1, no.2, pp.13,15, July 2014. N. Marz and J. Warren, “Big Data – principles and best practices of scalable realtime data system,” ISBN: 9781617290343, 2013, Manning Publishing co. M. Shaw and D. Garlan, “Software Architecture: perspectives on an emerging disipline,” ISBN 0131829572, 1996, Prentice Hall, inc. J. Hamson, “Urban Network Development”, Power Engineering Journal, 2001, pp. 224-232.L. K. Hamachi and J. Eto, “Understanding the Cost of Power Interruptions to U.S. Electricity Consumers”, September 2004. P. Heine, J. Turunen, M. Lehtonen, and A. Oikarinen, “Measured Faults during Lightning Storms”, Proc. IEEE Power Tech 2005, Russia, 2005, pp.1- 5 O. Quiroga, J. Meléndez, and S. Herraiz, “Fault-pattern discovery in sequences of voltage sag events” 14th International Conference on Harmonics and Quality of Power (ICHQP), 2010, pp. 1 – 6. J. Bosch, “Design & Use of Software Architecture, Adopting and evolving a product-line approach,” Addison Wesley, 2000. J. Dean, S. Ghemawat, "MapReduce: simplified data processing on large clusters," Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, pp.10, December 2004. K. Shvachko, H. Kuang, S. Radia, R. and Chansler, "The Hadoop Distributed File System," Proceedings of the 2010 IEEE 26th Symposium

[26]

[27] [28]

[29] [30] [31] [32]

[33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43]

[44]

on Mass Storage Systems and Technologies (MSST'10), pp. 1-10, May 2010. Apache Hadoop, http://hadoop.apache.org/, retrieved in November 2014. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, "Spark: cluster computing with working sets," Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, pp. 10-10, Boston, MA, June 2010. Apache Spark, http://spark.apache.org/, retrieved in November 2014. Apache Spark Team, "MLlib Programming Guide," http://spark.incubator.apache.org/docs/latest/mllib-guide.html, retrieved in November 2014. Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M. Hellerstein. "Distributed GraphLab: a framework for machine learning and data mining in the cloud," Proceedings of the VLDB Endowment 5, No. 8, pp. 716-727, 2012. Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. "Graphlab: A new framework for parallel machine learning," arXiv preprint arXiv:1006.4990, 2010. R. Ihaka and R. Gentleman, "R: a language for data analysis and graphics," Journal of Computational and Graphical Statistics, Vol. 5, No. 3, pp. 299–314, 1996. R Development Core Team, "R: a Language and Environment for Statistical Computing," Vienna: R Foundation for Statistical Computing, 2004. R Development Core Team, “R for Machine Learning,” http://cran.rproject.org/web/views/MachineLearning.html, retrieved in November 2014. M. Hall, "Mining Big Data using Weka 3", http://www.cs.waikato.ac.nz/ml/weka/bigdata.html, retrieved in November 2014. M. Hall, "Weka 3: Data Mining Software in Java," http://www.cs.waikato.ac.nz/ml/weka/, retrieved in November 2014. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, "The WEKA data mining software: an update," ACM SIGKDD Explorations Newsletter, Vol. 11, No. 1, June 2009. L. Bass, P. Clements, and R. Kazman, Software Architecture in Practice. Second Edition, Addison-Wesley, 2007. Hortonworks, http://www.hortonworks.com/, retrieved in November 2014. Cloudera, http://www.cloudera.com/, retrieved in November 2014. Apache Hadoop NextGen MapReduce (YARN), http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarnsite/YARN.html, retrieved in November 2014. AMPLab, University of California at Berkley, https://amplab.cs.berkeley.edu/, retrieved in November 2014. Databricks, http://www.databricks.com/, retrieved in November 2014. Apache Cassandra, http://cassandra.apache.org/, retrieved in November 2014. Apache Hive, http://hive.apache.org/, retrieved in November 2014. Apache Pig, http://pig.apache.org/, retrieved in November 2014. Apache Lucene, http://lucene.apache.org/core/, retrieved in November 2014. Apache Solr, http://lucene.apache.org/solr/, retrieved in November 2014. Apache Storm, https://storm.apache.org/, retrieved in November 2014. Apache S4, http://incubator.apache.org/s4/, retrieved in November 2014. Spark GraphX, https://spark.apache.org/graphx/, retrieved in November 2014. Harper, K. Eric, and Aldo Dagnino. "Agile Software Architecture in Advanced Data Analytics." Software Architecture (WICSA), 2014 IEEE/IFIP Conference on. IEEE, 2014. V. Dahr, “Data Science and Prediction”, Comm. ACM, Vol. 56, No. 12, pp. 64-73, December 2013.

Lihat lebih banyak...

Industrial Analytics Pipelines

Descripción

Comentarios