Efficient compilation of large rule bases using logical access paths

June 13, 2017 | Autor: Timos Sellis | Categoría: Information Systems, Accessibility, DATABASE MANAGEMENT SYSTEM, Database, Rule Based, Information System

Share Embed

Laporkan tautan ini

Descripción

0306-4379190 53.00+O.OO Copyright mc 1990 Pergamon Press plc

hjormarion SystemsVol. 15, No. I, pp. 73-84,1990 Printed in Great Britain. All rights reserved

EFFICIENT

COMPILATION USING LOGICAL

TIMOS K. SELLIS,?

Computer Science Department,

NICK

OF LARGE RULE BASES ACCESS PATHS

RoussoPouLost

and RAYMOND T. NG

University of Maryland, College Park, MD 20742. U.S.A

(Received for publication 18 October 1989) Abstract-This paper presents an efficient way to compile and store large deductive databases. Deductive rules are modeled through logical access paths and then integrated into a global structure, the Rule-base Access Path Schema (RAP Schema). The RAP Schema stores information on the interrelationships that exist among the computations required for rule executions. Ground data as well as data derived through some rule executions and cached for later reuse are also modeled in the RAP Schema. Some of the advantages of using this rule base organization scheme include the automatic detection of shared computations, the inclusion of cached results of rule executions and query results in the knowledge base. The details of the compilation and query processing phases are the focus of the paper.

1. INTRODUCTION The problem of defining and processing large rule bases has attracted a lot of interest recently. It has been widely recognized that expert systems will be used in many future applications, including manufacturing and comengineering processes, munications. Such systems utilize a very large number of rules, which, according to Hayes-Roth [l], is expected to reach the 100,000 figure by 1990. It is conceivable that such a large knowledge base cannot, and perhaps should not, for space reasons, reside in This is exactly the point where main memory. database management systems (DBMS) come to play. In particular, relational database systems seem very attractive due to the common foundations they have with logic programming, namely first-order logic [2]. In the last 5 yr we have seen an “explosion” in database research related to logic programming. The basic idea has been the introduction of logic in the database world, not only as a sound foundation for formalizing concepts, but as a query language as well. This paper discusses an issue that the authors believe to have a significant impact on the performance of any system working on a large rule base-the problem of compiling, storing and accessing a large set of rules. It is assumed that the underlying media for such a system is a relational database which is used to store data (or facts). Many projects currently under way make this assumption. The effect this has on the design of the system is that eventually user requests will be translated to calls to the underlying DBMS,

tAlso with Advanced

the University of Maryland Computer Studies (UMIACS).

Institute

for

which of course stores its data in secondary storage. Therefore, one must make sure that the final queries produced are processed efficiently. We have witnessed a lot of significant efforts in this direction, which shows the increasing interest. However, most of the proposals, with the exception of [3], assume that the rule base is given in a “flat” form, rules are simply stored in this system and called upon, if needed, to further expand a user query. Once all rules have been substituted with their definitions, the resulting query is processed by the DBMS. Intelligent algorithms for unification as well as recursive-query processing have been devised to assist in providing efficient execution plans. It is clear that given a rule base, one would like to compile it, identify the various “access paths” and store the resulting information in an intelligent way that will allow efficient query processing. This has the desirable effect that existing knowledge about the infrastructure of the rule base (e.g. which rule definitions use which other rule definitions) can be identified only once and be used later. In this paper we extend this idea of rule compilation based on the concept of Logical Access Path Schema (LAP Schema) [4]. The LAP Schema provides an integrated representation of shared assess paths. The explicit detection and representation of alternative and/or common subpaths was the basis of optimization of a LAP Schema as described in [5]. Since the execution of a rule includes one or several alternative access paths, many of which are common to other rules as well, the compilation of a rule base into a LAP Schema and the use of the latter for improving efficiency of rule execution seems very appropriate. Another advantage of the LAP Schema is the ability to support cached results. Frequently-used access paths may be executed and the results be stored for

74

TIMOS

K.

future reuse. The above organization can support materialized results in an efficient way using compilation and integration techniques. Finally, we provide intelligent mechanisms for query processing. Access paths are detected from the LAP Schema and further processed based on the ideas of MultipleQuery Optimization [6]. Such techniques allow sharing of the execution effort among several subpaths. The organization of the paper is as follows: Section 2 presents the basic idea behind the rule base compilation and describes both previous work and our approach based on logical access paths. Then in Section 3 we study the problem of rule integration, that supports dynamic environments where new rules are added to or deleted from the system. Section 4 then looks at the problem of query processing. We conclude in Section 5 with a summary and future research issues. A preliminary version of this work appeared in [7]. 2. RULE COMPILATION

unique rule identifier predicate name number of arguments argument list

and the schema of RULEBODY ruleid: bodyno: pname: numarg: arg:

is:

unique rule identifier body number predicate name number of arguments argument list.

For example, the following rule base: Rl. R2. R3. R4.

parent(X, Y) tfather(X, Y) parent(X, Y) +mother(X, Y) ancestor(X, Y)+parent(X, Y) ancestor(X, Y)cparent(X, 2) ~ancestor(Z,Y)

will be represented as

ruleid RI R2 R3 R4

RULE Table head numarg parent parent ancestor ancestor

2 2 2 2

al.

ruleid

RULEBODY Table bodyno numarg pname

Rl R2 R3 R4 R4

arg (X, (X, (X, (X,

Y) Y) Y) Y)

1 1 1 1 2

father mother parent parent ancestor

2 2 2 2 2

arg (X, (X, (X, (X. (Z,

Y) Y) Y) Z) Y)

The original rule can be easily reconstructed by taking the natural join of RULE and RULEBODY on the field ruleid. Clearly the above method is straightforward but one needs to provide support for the efficient execution of the join, or in general the search for rule descriptions, given the head predicate and the type of arguments, i.e. variables, constants or even function symbols. Such a method is described in [9, lo], where it is suggested that a rule such as R4 be represented as: c(ancestor(X,

One straightforward way of organizing rule bases is what we will call the “flat” method. Rules are simply stored in relations in the underlying DBMS and retrieved if needed. For example, in [8] where Prolog is interfaced to a relational database management system (INGRES), two relations are used to represent rules in a database, namely RULE and RULEBODY. The schema of RULE is: ruleid: head: numarg: arg:

SELLIS et

Y),

parent(X, Z),

ancestor(5 Y)))

and stored as textual data in the RULE relation. In addition, an index based on multi-attribute hashing techniques is used. Each relation field determines a bit string and the bit strings of all the fields account for the tuple descriptor, which is stored in the index. It is a rather sophisticated scheme that takes into account both the predicate names and the types of the arguments. In the same paper the authors suggest several optimizations for such hashing schemes to improve search performance. However, they recognize the fact that rule bodies may have variable length and they suggest that each set of rules defining the same predicate be grouped into a separate relation. This in turn makes access more expensive, since retrieval of a set of rules requires accessing multiple relations. Moreover, a disadvantage of the method lies in its lack of recording the basic structure of the rule base, i.e. the interdependencies among the various rules. Kellog et al. recognized the above fact in [3]. They suggest that in addition to some kind of an index like the one presented above (Rule/Goal Index), a Predicate Connection Graph (PCG) should be maintained to reflect the high-level interrule structure. For example, Fig. 1 shows the predicate connection graph and the rule/goal index for the above example rule base. Modeling interrule associations with the rule/goal index provides an efficient mechanism for identifying applicable rules. However, we find that the PCG is rather general and does not capture information enough to speed-up query processing as well. For example, the proof schema of [3] does not model adequately sharing among rule executions which greatly affects query processing in the cases where sharing of data accesses is observed at a great extent [6].

Efficient compilation of large rule bases

----

Pred

T

__--_______-_----___--

Rule/Goal

Index

Fig. 1. Example

of a Rule/Goal

Our approach is intended to model interrule associations at a deeper level. There are two major goals in this effort. First, we would like to represent rules at a level which facilitates query processing as well as the detection of relationships. Compilation of rules at the “intensional” level, similar to the PCG, does not add very much to the performance of the query processor, since it fails to reveal the details of the computations to be performed. Second, one may decide to cache the results of executing some rules due to the frequency in which such rules are used. This feature is especially interesting in the case of static databases but it has been proven to be useful for view processing as well as in very dynamic environments 14, 5, 111. Following the ideas of [4] we choose to model the PCG at a much deeper level, the access path level, using the idea of constructing Logical Access Paths (LAP). A Logical Access Path (LAP) is a logical representation of the computation involved in answering a query made against the database. It consists of a target relation representing the answer to a given query, a set of base relations, a set of intermediate relations created during the process of answering the query and the sequence of the relational operators

14

(4 Fig. 2. Graphical representation (a) selection: (b) projection;

I

Index and PCG

invoked in answering the query. LAPS are integrated in a schema, called the Logical Access Path Schema (LAP Schema) [4], so that the sub-paths shared by several LAPS are collapsed together and uniquely represented. This is one of the unique advantages of using LAPS, i.e. the exploitation of possible shared computations. In order to build cost-effective LAP Schemas, the frequency of subpath sharing was the basis of the collective optimization technique developed in [5]. The LAP Schema can be represented by a unique directed graph in which the nodes are labeled with the names of the relations and/or results of computations. Nodes are related to each other through the relational operators represented by directed arcs. The arcs that correspond to some of the well-known unary and binary relational operators are shown in Fig. 2. Since each rule is a conjunction of relational operators, the LAP Schema is an integrated representation of all the rule plans of execution. Originally, the LAP Schema was acyclic without an explicit representation of recursive statements. We extend it here to represent recursive definitions as well. Figure 3 shows the LAP Schema for the example rule base of Fig. 1. The top level nodes correspond to base

(b)

(4 of relational operators [4]: (c) join; and (d) union.

Fig. 3. Using LAP Schema to represent recursive rules (F, M, P, A stand for father, mother, parent and ancestor, PO is a temporary result).

TIMOSK. SELLISef al.

16

relations while the remaining nodes represent target and/or intermediate results. Note the multiplicity of the paths from parent to ancestor. Queries are also represented by directed but acyclic query graphs. The objective is to integrate a given query graph with the current LAP Schema in order to find alternative access paths that can be used to answer the query and select the one that will minimize the execution time. In order to achieve this, all the connections among the relations in the Schema and the relations in the graph have to be identified. This process is described in detail next.

3.2. The Rule-Base Access Path Schema Given the above definitions, processing a ruie can be thought of as a sequence of two major steps. First, a set of selections and joins is performed over a set of predicates, the ones appearing in the body. Then, the predicate in the head of the rule is constructed using a projection operation. We will refer to such a sequence of operations as the rule execution plan, or just ruleplun if this does not cause any ambiguity. For notational simplicity we will represent rule plans as follows: let

3. THE RAP SCHEMA This section first presents a model for rules and describes a structure for representing rule interrelationships, the Rule-base Access Path Schema (RAP Schema). It then discusses the algorithm for integrating rules into a RAP Schema, i.e. rule compilation. The last part of this section describes how the RAP Schema supports deletion of rules and negated predicates. 3.1. A model for rules We assume that a rule base R is given as a set of U-predicates (for User -defined predicates) (PI, P2, _. . , P,), each predicate defined on a set of attributes. A rule is comprised of a head and a body. The body of a rule is a conjunction of U-predicates, selection and join clauses. A selection clause is an S-predicate (for System predicate) which has the form OP(A, cons), where A is an attribute of some U-predicate, OP E {EQ, NE, LT, LE, GT, GE} and cons some constant. A join clause is a clause of the form P,(xl,xz,...,~,)APz(ylryz,...,yrn), where P, , P, are U-predicates and xi, yi are attributes of P, and Pz, respectively. Notice that the above model supports not only the standard join operator in relational algebra, but also other operators such as: Intersection-This is the case where the two U-predicates have exactly the same number of identical attributes, i.e. P,(+,X2,.‘~,%) A P&, x2, . . . , x,). Cartesian product-This is the case where the two U-predicates have no attribute in common, i.e. the general form P,(x,, x2,. . ,x,) AP,(Y,,Yz,. . * ,.Y,,). We will use the term “predicate” hereafter to indicate U-predicates unless there is confusion. The head of the rule is used to incorporate the projection operator, i.e. a way to limit the number of attributes that appear in the head predicate. In addition, there may be more than one rule and with the same head predicate. This corresponds to the union operator in relational algebra, and we will use this operator in the following to model multiply defined predicates.

AM%,,

~..,X,,iJA

..’ AP,(X”,,,...,X,,i”)

be a rule. Pi, i 2 1 can be either a U- or S-predicate. The corresponding rule plan is:

where: X=the set {xO,I,X~,~,...~X~,~“}~ nx = the projection on X, J= the sequence of joins performed in the body, and cry= a set of selections performed on U-predicates and on attributes in set Y. The above plan is the canonical form we use to represent rule plans. Clearly, there may be more than one way (i.e. rule plan) to process a series of relational operators. Transformation rules that are used to generate equivalent, but more efficient plans, will be discussed in the following subsection in the context of integration rules. As mentioned in the previous sections, given a rule base, one would like to organize the rules in some kind of network to represent the interrelationships among various rules. We will use the idea of logical access paths to construct a similar structure based on rule plans, instead of just the predicates involved in rule definitions. That is, we will attempt to include the info~ation contained in both the PCG and the Rule/Goal Index of Fig. I, in a single more complex structure, the Rule-Base Access Path Schema (RAP Schema). The RAP Schema is a labeled direction graph RAP(V, A, L,, L,), V, A, L, and L, being the set of vertices, arcs, vertex labels and arc labels, respectively. Vertices correspond to predicates and results of selection, projection, join and union operations. Arcs are used to show how each operation uses its arguments. Labels of vertices are defined as follows: l l

The label of predicate P is P. The label of a vertex corresponding to an intermediate result, i.e. the result of performing some operation on a predicate or a set of predicates, is a unique, system-generated label.

17

Efficient compilation of large rule bases

(‘4

identical to C, or D is an instance of C, then D is subsumed by C. Variable substitution exploits the transitivity of relational predicates, such as GT(r, t) = GT(r, s) A GE(s. t), to extend subsumption. For example, let C = R(u, c) A GT(c, 5) and D = R(x, p) A GT(y, IO) be two clauses. Clause C does not subsume D using the subsumption algorithm. Let C be transformed using the variable substitution C’ = R(u, c) A GT(a, z) A GE@, 5). Applying now the subsumption algorithm we can find a substitution B = {x,/u, IO\z} that makes C’B ED [tautology GE(lO,5) is removed]. Therefore, C’ subsumes D. Variable substitution may be necessary in both clauses. For example let C = R(x, jj) A GT(y, 10) and D = R(u, v) A GE(v, 5). Clause D does not subsume the body of C using the subsumption algorithm. Variable substitution on C produces C’= R(x, y) A GT(t, -_)A GE@, 10) which still does not either subsume or is subsumed by D. However, if we apply variable substitution on D to obtain D’ = R(u, v) A GE(c, w) A GE(w, 5) then using the substitution 0 = {x[u, (0,lO[~i} we obtain D’O = R(x, J) A GE(y, IO) A GE(lO,5) which clearly subsumes C’. The variable subsumption is systematically applied to any clause containing a relational predicate containing a constant v using the folIowing axioms:

y/a,

(4

(4

Fig. 4. Examples of RAP Schema: (a) selection; (b) projection; (c) join; and (d) union.

Finally, arc labels are defined as follows: OF

wF EF

U

to indicate a selection clause F, to indicate a join clause on the set F of attributes, to indicate a projection on the set F of attributes, and to indicate a union of two predicates.

Arcs corresponding to binary operations are labeled with identical labels. Figure 4 shows the operations shown in Fig. 2 of the previous section and the corresponding RAP Schemas. A RAP Schema is built by incrementally adding new rule plans whenever new rules are defined. The construction process, referred to as rule integration, will be discussed next. 3.3. The transformation rules To avoid the confusion between the deductive rules and integration rules, we will refer to the latter ones hereafter as transformations. The first class of transformations corresponds to the relational operators directly related to the generalization hierarchy, i.e. selection (a) and union (u). The second class corresponds to the operators related to the aggregation hierarchy, i.e. join (w) and projection (n). Clearly, deductive rules are defined in terms of the above operations. A transformation is applicable if the result of applying the rule can be derived from one or more results of other rules by applying some operator 0. When a transformation can be used, the operator 0 is integrated in the RAP Schema as an alternate path. 3.3.1. ~~~surn~tio~. The generalization rules are based on the concept of subsumption of the formulas that correspond to the rule definitions. Definition [ 12]--A clause C subsumes a clause D if and only if there is a substitution 8 such that CY SD. D is called a subsumed clause. For example, if C = P(x) andD=P(a)VQ(a)and6J={aIx),thenCB=P(a). Since CB CD, C subsumes D. Note that if D is

y

GT(u, w) = (3v)GT(u, v) AGE(v, w) GE(u, w) = (3v)GE(u, v) AGE(v, w) LT(u, w) = (Jv)LT(u,v) A LE(v, w) LE(u, w) = (3v)LE(u, v) A LE(v, w) 3.3.2. Subsumption transformations. Let E be an algebraic expression that corresponds to the body of a rule. We define the partial order I on the set of algebraic expressions as follows: E, < EZ if we substitute the same predicates in both E, and E2 and the results obtained for E, is a subset of that corresponding for E:. Equivalence E, = El is defined if both E, < El and E, < E, hold. Using the subsumption as defined above, we obtain the following algebraic transformations:

1. Selection subsumption: crF(E) I uF. (E) iff F’ subsumes F (subsumption here is meant with or without variable substitution). 2. Projection subsumption: n,(E) I: n,,(E) iff FS F ‘. 3. Join subsumption: E,w&,IE,w~E, iff F’cF. 3.3.3. Negation transformatjon. The following transformation provides support for negated predicates in a RAP Schema. The last part of this section describes in detail how these transformations are used. Commuting selection and not operator: o,[nor(E)] = a,{not[a,(E)]).

TIMOSK. SELLIS et al.

78

3.3.4. Identify transformations. The following are algebraic transformations based on the equivalence of algebraic expressions [ 131. Such transformations are used to exploit identical access paths: 1. Commutative

AtCADAS(B),

law for joins:

E, w,Ez = Ez wFE, .

where variables are omitted for brevity, is the sequence of component rules:

2. Cascade of projections: I[&.

In order to adopt a canonical representation for rule plans, joins are performed on pairs of predicates in their lexicographical order. For example, the rule plan for the rule:

B,+S(B),

(E)] = n,(E) where FE F ‘.

B, +J@, , Ch

3. Cascade of selections:

A +-J(B,, D). QF,I5~#91=

4. commuting

5~2[5, (-91 =

selection and projection: if F ’ involves

nF[bF. (E)] = bF. [x#?)] attributes in F. 5. Commuting

CF,A&G. only

selection and join:

5F(EI wF.E2) 5 o~(E,)w~. Fare attributes of E,,

E, if all attributes of

provided 5f(El P+E,) E 5~~(E,)~~5F*(E~) that F = Fl A Fz and F, involves only attributes of E, and Fz only attributes of E,, provided a,(E, b+‘E,) = 5,,[5,,(E,)b+Ez] that F = F, A F2 and Fl involves only attributes of E, , but Fz involves attributes of both E, and E,. Commuting

selection and union:

Each component rule in the rule plan is integrated into the RAP Schema in the plan order. The integration algorithm consists of the following two phases. Given a rule of the form mentioned above, the first phase finds the predicates E, and T in the schema, and adds the appropriate arcs from Eis to T, if necessary. The second phase uses the transformations presented in the previous subsection to identify predicates in the RAP Schema which either subsume T or are subsumed by T. Appropriate arcs are then added to reflect the interrelationships. The algorithm is shown in Fig. 5. The inputs are a RAP Schema, possibly empty, and a component rule. The output is a new RAP Schema. As an example, Fig. 6 shows the RAP Schema for the

5~6%u~:,)=uF& ba,tEd. Commuting

projection

with join:

provided zF(EI D+ E2) = H.=,(E,)B+IcF~(F~) that F = F, uF, and that FL are attributes of E, and F2 are attributes of E,. commuting

projection with union:

Subsumption, negation and identity transformations are used to integrate rules in a RAP Schema. The integration algorithm is presented in the next subsection.

Integrate (RAP,R) A component rule R is of either one of the following forms: 1) T+ S (E,), 01 2) T+ J (El,&

).

begin ,/*Phase 1: batch

i~ca~~~g

(1.1)

Mark

(1.2)

for each predicate

all predicates

rule with existing mm *,I in RAPas

begin Find E, in RAP, If Ei does not exist then create Ei in RAF, end; Find Tin RAP; If Tdoes not exist then create Tin RAP, Add ares from Ei’s to T as stated in the rule R, if these arcs do not exist already ; Mark E.‘s I and T as “tested”;

3.4. The integration algorithm.

(1.3) (1.4)

The integration algorithm can be used for rule base compilation or for the integration of a newly-defined rule with an existing RAP Schema. As mentioned at the beginning of this section, the head and the body of a rule is used to construct a rule plan which consists of a sequence of “component” rules. Each of these component rules is of either one of the following forms:

/* Phase 2: Checking transformation

(1) a section clause: T+S(E,) where S represents a selection on predicate E,, or (2) a join clause: TeJ(E, , E2) where J represents a join on predicates E, and E,.

“untested”;

Ei do

(2.1)

for each predicate

F that

rules */ is “untested”

do

begin Check transformation rules for F and T Add arcs from F to T or vice versa if appropriate; Mark F as “tested”; end;

end; Fig. 5. The integration algorithm.

19

Efficient compilation of large rule bases

that no predicate is checked with T more than once. Due to the fact that arcs added in the second phase of the algorithm are independent of each other, it is sufficient to check each predicate with T only once. Compared with the original algorithm described in [7], this algorithm allows predicates that are in a cycle to remain in the Schema and therefore be used in future integration. For example, consider the RAP Schema for the following four rules: Ql. Q2. Q3. 04.

Fig. 6. An example of RAP Schema.

following

rule base:

Rl. R2. R3. R4. R5. R6. R7.

P4(x, y) P5(y, z) P67(x, y, z) P65(x, y, z) P2(x, y) P3(y, z) Pl (x, y, z)

+-PG(x, tP-l(y, tP6(x. +P6(x, + P4(x, +P5(y, +PZ(x,

y) z) y) y) y) z) y)

AGT(x, AGE(z, A P7(y, A P5(y, A GT(x, A GT(z, A P3(y,

5) 2) z) z) 10) 5) z)

Despite the possible existence of recursive rules, and thus cycles in the RAP Schema, the integration algorithm always terminates. Like the standard depth-first search algorithm for directed graphs, the integration algorithm marks predicates and ensures

(a)

(c)

After

“Person” “Taxable”

After

integrating

01 +Q2

phase

1 of integrating

Q4

Person(x) cEmployee(x) Person(x) +Owner(x) Employee(y) +-Person(x) Taxable(x) +-Person(x)

A Manages(x,

y)

Figure 7 shows the various stages of the integration. Note also that this algorithm integrates each component rule in an iterative manner, whereas the original solution integrates each level of a rule plananalogous to a component rule-in a recursive fashion. The integration algorithm shown in the Appendix is an optimized version of the one shown in Fig. 5. One optimization is the labeling of predicates in cycles so that these predicates can be processed separately by efficient techniques for evaluating recursive rules [14]. Furthermore, to reduce the number of times the transformation rules are tested, various conditions are included in the algorithm to filter off predicates which bear no relationship with T. 3.5. Deletion of rules As a functional opposite of integration, deletion of rules from a RAP Schema is worth supporting. The

(b)

IdI

After

After

and ore

marked “tested”.

Fig. 7. Integration of recursive rules.

integrating

integrating

~3

Q4

TWOS K. SELLIS etal.

80

Schema resulting from such an operation is identical to the Schema for the remaining rules. For example, if R4 of the rule base shown in Fig. 6 is to be deleted, the predicate P65 and all its arcs are removed. But deleting R6 is less straightforward as it requires removing the arcs from P5 to P3, from P7 to P3, and from P65 to PI, while retaining P3 and the arc from P3 to PI. Thus, rule deletion in general may involve predicates and arcs that are not directly specified in the rule being deleted. To facilitate the removal of arcs and predictes from a RAP Schema, the arc label of an arc is extended to include a rule tag which records the numbers of the rules through which the arc was initially created. These rule tags are generated during rule integration-at the same time when arcs are added. Specifically, in Step 1.4 of the integration algorithm in Fig. 5, the rule tag is the number of R. As for Step 2.1, the rule tag includes the numbers of all the rules used in transformation testing. Figure 8 shows the RAP Schema with rule tags for the rule base shown in Fig. 6. Deletion of rules now involves checking the tags and removing the appropriate arcs. A predicate is removed from the RAP Schema if no arc is incident on it. This process can be further enhanced by indexing the arcs on their rule tags. 3.6. Supported for negated predicates The model and operations for rules presented in the previous subsections can be extended to allow negated predicates to appear in the body of a rule. While we regard the computational and theoretical aspects of negation (such as negation as failure, stratification theory, etc.) beyond the scope of this paper, we focus here on supporting negated predicates at the RAP Schema level, i.e. at the level of access paths. The first extension is the addition of a not operator which defines a negated predicate in terms of its positive counterpart. The order of the rule plan from extended therefore is %V(er)l to

Fig. 9. RAP Schema with the nof operator.

~*~~~~*(~ot)J~. However, the not operator alone is not sufficient to promote sharing of access paths among rules. Consider for instance the following two rules: Sl . Q(x)tnot S2.

(P(x))

R(x)+P(x)

A LE(x, 10)

A LE(x, 7)

Their RAP Schema as shown in Fig. 9 does not exploit any sharing between the rules. But since the not operator can be expressed in terms of the difference operator in relational algebra, selection can be pushed across the not operator. The transformation rule:

a&ot(E)l = oy(nof[ad~)l), of Section 3.3.3 is used. Figure IO shows the effect of

this rule on the RAP Schema of 51 and 52. Notice that new temporary predicates such as PI are introduced. While in general we try not to add temporary predicates to the Schema, here we rely on such predicates to support a high degree of sharing. Demorgan’s kaws are also used to enhance sharing of access paths. Figure 11 is the RAP Schema after integrating the rule: 53. T(x)+not((P(x))

A EQ(x, 2)

0

RI .R6

no1

P2

t>0 Fig. 8. RAP Schema with tags.

Fig. 10. Commuting section with not.

Efficient compilation of large rule bases

Fig. I I. RAP Schema after integrating S3.

into the Schema of Fig. 10. If the meaning of the predicate P3 is regarded as “in P and (x = 2)“, then with an application of Demorgan’s laws, the meaning of P4 can be defined as “not in P or (X # 2)“. Similarly, since the meaning of P2 is “not in P or (x > IO)“, it is subsumed by P4. The arcs from P4 to 0, from P2 to T and from Q to T are created in the same way. Note, however, that some of these arcs may have little practical value, as all possible arcs are shown here for illustrative purposes. Furthermore, Demorgan’s laws can be used in situations similar to the following example. Consider the RAP Schema for the rule base: Tl.

P(x)+-not(Q(x))

A R(x)

T2. S(x)tnot(P(x)) As shown in Fig. 12, the Schema includes the arcs from 0 to S and from Rl to S, the result of applying Demorgan’s laws to the definition of P. Such use of Demorgan’s laws is advisable if Rl already exists in the Schema for the benefits of other rules.

Fig. 13. An example RAP Schema.

4. QUERY PROCESSING As mentioned in Section 2, queries are processed in a way similar to rule integration. The simplest case is a query which asks for the result of a rule execution without any variable instantiation. In this case, the RAP Schema is used to suggest the possible ways of deriving the answer to the query. Example 1

Consider the rule base of rules RI-R7 of Section 3 and shown in Fig. 13. Dark circles represent ground predicates or rules whose results have been cached due to prior executions. A query that asks for PI (x, y, z) can be performed in many ways, namely those that perform selections in various orders. However, the system identifies as the most promising method the following: 0 since this execution plan can use already stored results. The general case of queries with arbitrary conditions is handled similarly to rule integration. That is, a query:

is thought of as a two-step process. First a new rule:

Q -0(X,,,,

. . >%.,,k+~Ih>

AP,GQ.,,..~>

Fig. 12. One use of Demorgan’s law.

X2.,* 1 A

”

1-~I.,,)

A P” (Xn,, .

3 X,,‘” 1

is assumed to have been defined, where variables , x,,,~ are the free variables of Il-predicates, X o., , i.e. variables not used in selection clauses. Then, rule Q-O is attempted to be integrated with the Schema. We do not have to include Q-O in the Schema; this process can only be used to identify possible ways of processing the given query. One case where query predicates may be included in the RAP Schema is discussed at the end of the section.

82

TIMOSK. SELLISet al.

Example 2

Suppose again that the RAP Schema of Fig. 13 is given and the following query is asked: +-PB(y, z) A GE(z, 8) Then, an attempt to integrate the corresponding

rule:

Q-O+-PB(y,z)AGE(z,8) with this the RAP Schema would detect that the following way of processing this query is the most promising one: [CD* PSI

q

if more than one possible execution plan is derived for a given query, the system eliminates the cost of each of these plans and selects the cheapest one to execute. Example 3

Consider asking the query: c Employee(x) against the RAP Schema shown in Fig. 7, where a cycle exists. To answer the query, the system first recognizes that the predicate is in a cycle. The Appendix discusses how predicates in cycles can be identified. The system then initializes Person with facts from Employee and Owner, and joins Person with Manages to derive additions to Employee and Person. Such iteration continues until no new addition to Employee is generated. In fact, the system may as well use one of the many well-known techniques for reducing computation cost of processing recursive rules ([14] provides a good survey of such methods). In general, recursive rule processing is considered as an optimization strategy to be used by the query processor, not a part of the rule base processor. Notice that this is also the case in [3] where recursive rules are recognized and processed separately. cl In contrast to other schemes, our proposal offers the significant advantage of exploring a high degree of sharing among rule execution plans. First, consider the case of a rule that has multiple definitions (union). Among such definitions there may be a high overlap on the predicates found in the bodies of the rules. Clearly, our integration scheme will be able to identify common s&expressions at compile time and use them as effectively as possible at execution time. For example, consider a rule consisting of 10 different definitions and a query that requests the result of applying such a rule on the database. Then all 10 rules must be processed to find the answer to the query. If these ruies share expensive operations such as joins on common relations or selections that subsume other selections, it will be of great advantage to the query processor to use such common access paths. The work of [6, 1.5-l 71 provides algorithms for the efficient use of such common results.

Second, as mentioned above, one may decide not to discard results of queries due to the frequency of incoming similar requests. Then the RAP Schema can be easily used to accommodate query results as weil. We mentioned earlier that queries are really new rule de~nitions followed by a simple request for the result of rule execution. Therefore, integrating queries permanently, in contrast to the “virtual” integration performed above, is as easy. In such a system, query results can be cached and reused if beneficial, in the same sense that Finkelstein [ 181, Larson [ 19) and Roussopoulos [ 11,201 have used them for relational database systems. Sellis in 1211 provides a discussion on caching schemes and algorithms that support such schemes. Keilog et al. [3] also suggested some sharing among execution paths, but only limited to identical results and within the processing of a single query. As caching has been recognized as a crucial processing tactic in many other contexts, such as in [.5,22] in the context of view processing, in [23,24] for PROLOG to relational database management system interfaces, and in [25] for new generation relational database systems, we find that our proposal will be of significant value to future systems supporting very large rule bases. Finally, we comment on the advantage of our proposal in terms of query optimization. As mentioned above and illustrated through examples, the RAP Schema exploits several possible access paths to be used by the query processor. In the past the common approach, with the exception of PROLOG, has been to let the rule processor do its work by substituting predicates with rule definitions and finally generating a query to be processed by some underlying data manager. Using the RAP Schema we have traded some processing time at rule definition time for more efficiency at execution time. That is, the compilation of the rules, although it does not depend on the underlying data, it provides a lot more knowledge about the structure of the rules. The result is a rule base representation which allows the interrelationship of the input rules. It is this rule base representation that can then be used by the query processor to identify multiple ways of processing a given query. This compilation saves a lot of processing time by not having to recognize common substructures and repeatedly optimize the query in the context of several rules. 5. CONCLUSIONS We have introduced a new method for compiling large rule bases on the idea of logical access paths. Such an organization provides more advantages compared to previous approaches, mainly enhancing the performance of the query processor. Interrule associations are detected at a deep operation level and are used to build an access path schema. Such a

Efficient compilation of large rule bases schema is useful for: (1) the automatic detection of tions, (2) the inclusion of cached executions, and (3) the inclusion of cached executions in the knowledge

shared computaresults results base.

of

rule

of query

Several levels of optimizations can be performed on such a schema. First, an analysis of space and time requirements can be attempted in the case where frequencies in which ruies are referenced are known and caching is used to avoid recomputation of frequently accessed paths [4]. Second, multiple-query processing and optimization can be applied on queries that require the execution of multiple paths [6]. Such processing will use common subexpressions to avoid multiple computations of the same subqueries. Finally, the RAP Schema provides muitiple access paths that can be used to evaluate the same rule, thus providing more information to the query processor in order to come up with the optimal execution plan. In general, the RAP Schema introduces a trade-off between compilation and query execution time. By making the compilation process deeper, we have reduced query execution time significantly, this effect being more significant in the presence of cached rule execution results. The performance of the integration algorithm can be improved using heuristics to prune down the number of transformations applied. Techniques similar to those used in query optimization [26] can be used for that reason. We have a prototype implementation of a system that supports compilation and integrates logical access paths to obtain a LAP Schema in the context of relational views (20,271. Since views are very similar to deductive rules this system can be used to obtain a rule compilation system as well. Currently, we are extending our prototype to accept multiple view definitions (union operator) and recursive views in order to achieve full support for rules. In parallel, we focus on the problem of query processing and the inclusion of additional transformation rules in the

rule integratjon algorithm. Such rules are used to model semantic information such as integrity constraints and functional dependencies which allow for the detection of even more common access paths and in some cases the removal of rules which are redundant. Acknariedgement-This research was sponsored partially by the National Science Foundation under Grant IRI8719458 and by UMIACS.

83

[3] C. Kellog, A. O’Hare and L. Travis, Optimizing the rule-data interface in a KMS. Proc. 12th. Int. Confi on Very Lurge Dara Eases, Kyoto, August (1986). [4] N. Roussopoulos. The logrcal access path schema of a database. IEEE Trans. Sofiware Engng (SE-8) 6. 563-573 (1982). [S] N. Roussopoulos. View indexing in relational databases ACM Trans. Database Svslems (7) 2. 2588290 (1982). 161 T. Sellis. Multiple-query optimization. ACM Trans. Database Systems (i3) 1, 23-52 (1988). [7] T. Sellis and N. Roussopoulos. Deep compilation of large rule bases. Proc-Second Int. Con/: on E_ynert Database Systems, Tysons Corner. April (1988).

I81 T. Sellis. S. Ghosh and C. C. Lin. Imolementation of * * a Prolog-INGRES interface. ACM SiGMOD Record. pp. 77-88 (1988). [9] K. Ramamohanarao and J. Shepherd. A superimposed codeword indexing scheme for very large prolog databases. Proc. 3rd Ini. ConJ: on Logic Progrumming. London (1986). [lo] J. A. Thorn, K. Ramamohanarao and L. Naish. A superjoin algorithm for deductive databases. Proc. i2fh Ini. Conf Very Large Dafo Bases, Kyoto, August (1986). [Ill N. Roussopoulos. The incremental access method of view cache: concept and cost analysis. .4CM Trans. on Database Systems (1987). [12] C. L. Charm and R. C. Lee. Svmbolic Logic and Mechanicul Theorem Prooing. Academic Press. New

York (1973). 1131 J. Uilman. Principles qf Daatabase Swems. Computer Science Press, Palo Alto (1982). 1141 F. Bancilhon and R. Ramakrishnan. An a~teur’s introduction to recursive query processing. Proc. @‘the I986 ACM-SIGMOD Int. Conf: on rhe Manueement of Data, Washington, D.C., May (1986). ” ” II51 U. S. Chakravarthy and J. Minker. Multiple query processing in deductive databases, Proc. 12th Int. Conf. on Very Lnrge Data Bases, Kyoto, August (1986). t161J. Grant and J. Minker. On Oplimizing the Eraiuation of a set of expressions. Technical Report TR-916, University of Maryland, College Park (1980). ]17] J. Park and A. Segev. Using common subexpressions to optimize multiple queries. Technical Report LBL23597, Lawrence Berkeley Laboratory, University of California, Berkeley (1987). S. Finkelstein. Common expression analysis in data[I81 base applications. Proc. f982 ACM-SIGMOD Int. Conf: on the Management of Data, Orlando, June (1982). u91 P. Larson and H. Yang. Computing queries from derived relations. Proc. Ilth Int. Conf. on Very Large Data Bases. Stockholm, August (1985). 1201N. Roussopoulos. Overview of ADMS: a high performance database management system. Invited Paper, Fail Point Computer Conf., Dallas, October (1987). T. Sellis. Intelligent caching and indexing techniques I-211 for relational database systems. Information &stems 13, 175-185 (1988). P21 P. Valduriez. Join indices. ACM Truns. Database Systems (12) 2, 218-246 (1987). u31 S. Ceri, G. Gottlob and G. Wiederhold. Interfacing relational databases and prolog efficiently. Proc. First In?. Confi on Expert Dafabase S,ystems. Charleston.

April (1986). ~41 Y. Ioannidis, J. Chen, M. Friedman and M. Tsangaris.

REFERENCES [I] F. Hayes-Roth. Invited Talk. IEEE Compcon, San Francisco, February (1987). [?] H. Gallaire and J. Minker. Logic and Data Bases. Plenum Press, New York (1978).

BERMUDA-an architectural perspective on interfacing prolog to a database machine. Proc. Second Inf. Conf on Experi Database Systems, Tysons Corner, April (1988). and L. Rowe. The design of P51 M. Stonedraker POSTGRES. Proc. I986ACM-SIGMOD Inr. Conf. on the Management of Data, Washington. May (1986).

TIMOS K. SELLIS et al.

84

Integrate2

(RAP,@

begin /* Phase

1: same

as before */

/* Phase 2: Checking transformation rules ‘/ for each predicate F that is “untested” do begin case 1: the rule R is of the form T+ S(E,) begin if F is a “select-project-descendant” of some G which is a “select-project-ancestor” of E,, and F is cached then begin Check transformation rules for F and T; Add arcs from F to T or vice versa if appropriate; end; end; 2: the rule R is of the form T+ J(E, ,I$) begin if F is a common “join-descendant” of some G, and G, which are “ancestors” of E, and E, respectively, and F is cached then begin Check transformation rules for F and T Add arcs from F to ‘I’or vice versa if appropriate; end; end; end case; Mark F as “tested”; end; /* Phase 3: Cycle Labeling */ for each cycle Ci in RAPdo begin Label each predicate in Ci so that recursive rules can be used; end;

efficient

techniques

for evaluating

end;

Fig. 14. An optimized integration algorithm. WI P. Selinger et al. Access path selection in a relational database management system. Proc. 1979 ACMSIGMOD

Inr. Conf. on the Management of Data,

Boston, June (1979). [271 M. Salazar. Dynamic query optimization for relational databases. M.Sc. Thesis, University of Maryland, College Park (1986). APPENDIX An optimized integration algorithm is shown in Fig. 14. While Phase I is the same as before, Phase 2 includes conditions to exempt inappropriate predicates from transformation rules testing. For example, if the rule being integrated is a selection on E, it is then sufficient to check predicates F that are produced via a sequence of selections and/or projections from some predicate G, which also produces E through another sequence of selections and projections. We use the informal notions of ‘Lancestors0 and “descendants” defined as follows.

given a RAP Schema as an ordinary directed graph, if there exists a path from predicate E to predicate F, then E is an “ancestor” of F, and F a “descendant” of E. The classification of “ancestors” and “descendants” can be further refined if the types of arcs in a path are also taken into consideration. For example, E may be called a “select-project-ancestor” of F, and thus F a “select-project-descendant” of E, if only selection and projection arcs constitute the path from E to F. In addition, when the integration algorithm is invoked for query processing, it may be beneficial to only consider predicates that are cached. Finally, the algorithm includes a new phase for optimizing the computation of recursive rules. Since a RAP Schema can be regarded as an ordinary directed graph, cycles in the Schema can be detected using standard algorithms in graph theory. Then by labeling the predicates in these cycles, efficient techniques for evaluating recursive rules can be applied to the predicates.

Lihat lebih banyak...

Efficient compilation of large rule bases using logical access paths

Descripción

Comentarios