SL Parses The LR Languages

July 24, 2017 | Autor: David Barnard | Categoría: Cognitive Science, Computer Software, Computer Languages
Share Embed


Descripción

Comput. Lang. Vol. 13, No. 2, pp. 65-74, 1988 Printed in Great Britain. All rights reserved

0096-0551/88 $3.00+0.00 Copyright © 1988 Pergamon Press plc

SL PARSES THE LR LANGUAGES DAVID T. BARNARD and JAMES R. CORDY Department of Computing and Information Science, Queen's University, Kingston, Ontario, Canada K7L 3N6 (Received I0 February 1988; in revisedform 25 March 1988) Abstract--SL, the Syntax Language component of S/SL (Syntax/Semantic Language) is a dataless programming language for implementingefficientrecursivedescentparsers. SL is dearly powerfulenough to parse the LL(1) languages, but it is much less obvious that it is sufficientlypowerfulto parse the LR(k) languages. We show that this is in fact the case by giving an algorithm to construct an SL program that parses the same language as a given LR(1) parser. This construction provides both a proof that SL has the parsing power of LR(1) [and hence LR(k)] and a practical method for deriving efficientSL parsers from BNF grammars.

I. I N T R O D U C T I O N SL, the Syntax Language component of the Syntax/Semantic Language S/SL [1] is a dataless programming language explicitly designed for expressing efficient recursive descent parsers. SL has been used to implement production parsers for a number of programming languages including Euclid [2], Concurrent Euclid [3], PT Pascal [4], Turing [5], Turing Plus [6] and many others. It has long been suspected that the rule choice construct of SL extends its parsing power to the power of LR(k) based on [7]. However, the only existing proof of this fact consists of a complex non-constructive argument based on simulation of a deterministic push down automaton (DPDA) by an SL program [8]. Also, in spite of its demonstrated success as a practical technique, acceptance of SL has been hampered by the fact that SL parsers cannot be directly derived from a BNF specification of the language grammar, requiting instead that the user write a new specification of the language grammar in SL form. This paper solves both these problems by presenting an elegant constructive proof that SL has the parsing power of LR(k). The p r o o f consists of a simple, straightforward algorithm to construct an SL program from a given LR(1) parser. This algorithm can be used in conjunction with an LR(I) parser generator to automatically derive an efficient SL parser from a given BNF grammar without the necessity o f hand conversion to SL. 2. L L ( k ) AND L R ( k ) The LL(k) [9] and LR(k) [I0] classes of parsers are "natural" in several senses. LL(k) corresponds to left-to-right reading yielding a canonical leftmost parse, while LR(k)corresponds to left-to-right reading yielding a canonical rightmost parse. LL(k) is a top-down technique, meaning it starts with the goal symbol and grows a parse tree down to where the leaves match tokens in the program, while LR(k) is a bottom-up technique, meaning it starts with the tokens and grows a parse tree up from the leaves to the goal symbol. Here is another useful way to think about them. When determining what production was used to generate a part of the program being parsed, LL(k) uses the first k terminal symbols of that part of the program. It answers the question, "What is this that I see ahead of me?" LR(k) uses the k terminal symbols after that part of the program. It answers the question, "What was that that is now behind me?" LL languages are usually handled by recursive descent parsers, or by table-driven LL parsers [typically LL(I)], and they can be handled in a straightforward way by SL. LR languages are usually handled by table-driven parsers. We will show that they can also be handled by SL, thus proving that SL has the same parsing power as LR. The essential aspect of SL that makes this possible is the rule choice construct. 65

66

DAVID T. BARNARD and JAMES R. CORDY

2.1 Brief review of LR(1) parsing An LR Parser consists of a stack for state symbols (which are described later), an input, an output, and a finite control. The finite control is directed by a two-dimensional table. There is a row of the table for each symbol that can appear on the stack, namely the state symbols and the bottom of stack marker. There is a column of the table for every possible lookahead string; for LR(1) these are the terminal and nonterminal symbols and the endmarker. The entries in the table are Reduce (with a production number), Shift (with a state), Accept, and Error. The parser's finite control is said to be in the state whose symbol is on top of the stack. The actions taken are as follows. On a Reduce action, the number of symbols in the right side of the associated production are popped from the stack, and the symbol on the left side is inserted in the input. On a Shift action, the state specified is pushed onto the stack and the input is advanced one symbol. The unique Accept entry signifies that the input is accepted. Error entries signify syntax errors. States in an L R parser are formed from sets of items. A n item is a production with a marked position in its right side. A right side with k symbols has k + I positions; one before each of the k symbols and one after the last symbol. The occurrence of an item in a state indicatcs that the input recognized so far could possibly form that part of thc production preceding the mark. There are various algorithms for producing the parsc table entries, but we will not describe them here. Figures I-4 show a simple grammar, the LR(1) parsing table gcnerated from it, a different presentation of the parse tablc information using a listrather than a table and including the items, and a trace of the parser on an input string, respcctivcly. 2.2 Brief review of SL SL can be thought of as a recursive procedural programming language without variables. Essentially, an SL program describes a naive but efficient recursive descent parser. SL was developed from earlier work on syntax charts, and with a lower level notation more directly reflecting the structure of the charts [11]. SL programs have an input stream of symbols that can be consumed one at a time and a similar output stream of symbols that can be generated one at a time. A second output stream called the error stream allows diagnostic output. Each SL program consists of a set of possibly recursive procedures called rules whose implementation is supported by a single stack of return addresses produced when rule calls occur. There are two kinds of rules, called procedure rules and choice rules, that correspond to the procedures and functions of a general purpose programming language. Syntactically, a procedure rule is defined using the rule name followed by a colon and the list of actions forming the body of the procedure terminated by a semicolon. A choice rule is defined similarly, using the symbol " > > " and a type name following the rule name to indicate the set of result states that can be returned from the rule. Execution of an SL program begins with the first rule; a rule terminates by reaching the trailing semicolon or by executing a an explicit return action. There are eight actions that can be used in a rule. The call action causes the (potentially recursive) invocation of another rule; it is denoted by the symbol " @ " followed by the name of the rule to be called. The return action causes an immediate return from a rule, it is written as " > > " . The input action causes (actually requires) input of a symbol; the input action is implicit--it is written simply by naming the symbol. The emit action causes a symbol to be written to the output; it is written as "." followed by the name of the symbol. The error action causes a symbol to be written to the diagnostic output; it is written as " ~ " followed by the symbol name. a $4

b

c $3

d

S'

S $1

X $2

_1_ A

S5

0: S'--> S

1: S --> Xb 2: Icd 3: X - - > a Fig. 1. A simple LR grammar.

$6

R3

R3

R3

R3

R1

R1

RI

R1

R2 112 R2 R2 Fig. 2. LR Parser from Fig. 1, tabular form.

SL parses the LR languages

o:

[s'-->.sl [S-- >.Xb] fS->.cd]

s

(s,o(x,2)(c~)O,4)

[X-->.a] 1: 2: 3: 4: 5: 6:

[S'r->S.] [S-->X.b] [s-->c.d] [X->a.] [S-->Xb.] [S-->cd.]

A S s R3 R1 R2

67

(o,5) 0,6)

Fig. 3. LR parser from Fig. 1, list form.

stack

input

0

ab__l_

O4 0 02 025 o 01

b._l_ Xbl_ b_l_ _1_ S_l_ _1_

output move S4

3 1

1,.3 S2 s5 R1 sl A

Fig. 4. Trace of parser on input "ab'.

The cycle action is the repetition construct of S/SL, denoted by the cycle brackets "{" and "}". The list of actions enclosed within cycle brackets are repeated until a cycle exit action or a return action is executed. The exit action, denoted b y " > ", causes an immediate exit from the most tightly enclosing cycle. The choice action is the conditional construct of S/SL, denoted by the choice brackets "[" and "]". A choice selects among a set of alternative lists of actions based on the next symbol in the input (an input choice) or on the result state returned from a call to a choice rule (a rule choice). Each alternative begins with the alternation symbol " [ " followed by a list of comma-separated label symbols ending with a colon, and the corresponding list of actions to be done. The selector for the choice follows the opening choice bracket and precedes the first alternative. The selector for a rule choice is a call to a choice rule, indicated using the call action syntax (i.e. " @ " followed by the rule name). Like input actions, input choices are implicit and are indicated by the absence of an explicit choice selector. The label "*" on an alternative means "otherwise," and indicates that the alternative is to be chosen when the selector does not match any of the explicitly-given labels. For an input choice, the input is consumed unless the otherwise alternative is chosen. Repeated choices can be constructed by directly nesting them inside a cycle "{["..."]}". Before the rules are given in an SL program, definitions corresponding to enumerated types in Pascal can be given. There should be definitions for input symbols, output symbols, error symbols, and whatever state values are returned by choice rules. Figure 5 shows an SL program to parse simple expressions and output them in postfix form. 3. A C O N S T R U C T I O N FOR LR(0) G R A M M A R S We now show how to construct an SL program to mimic an LR parser. We first show how to do this for LR(0) parsers. These are parsers where the parse tables have a particularly attractive property: any row of the table that contains a Reduce action will have that Reduce action specified for all terminal symbols. The defintions of our SL program will involve three types. The input type defines the input symbols of the language. The output type defines one value for each production number; we will write them as o0, ol, and so on. The Position type is used to record how many of the symbols corresponding to the right hand side of a production have been popped off the stack when an LR Reduce move is being simulated. The values of this type must encode a production number and a position within the right side of the production; we write them as PlPl, for production l position 1, and so on. There will be one rule in the SL program for each state in the LR parser. The name of the rule corresponding to state i in the parser is st. The first rule is a procedure rule corresponding to the start state. All other rules are choice rules returning a value of type Position. The essential idea of the construction is to mimic an LR Shift move by a call to the rule implementing the state the LR parser would shift into. This means that the call stack in the SL machine mimics the state stack in the LR parser---every time a state symbol is pushed onto the LR stack, a return address indicating where the corresponding rule should return to is pushed onto the SL stack. To mimic the LR Reduce moves, we must first get rid of entries on this stack, and then make the Shift transition that the LR parser would have made if the left hand side of the

68

DAVID T. BAI~NARDand JAMESR. CORDY input: pPlus pMinus pAsterisk pSlash pLeftParen pRightParen pldent~er;

' +' '-' '*' '/' '(' ')'

output: sAdd sSubtract sMultiply sDivide sIdcntifier;

% % % %

Enumerative declaration of the input token set (A quoted symbol beside a token nante simp~ declares an alternative notation for referring to the token)

% Enumerative declaration of the output token set

rules Expression: @Factor

{[

I '+'" @Factor .sAdd I lit_. @Factor .sSubtract

I *:

>

9;

% Beginning of the first rule (main program) % Call the Factor rule % Input choice nested in a cycle % If mat input token is pPius consume it, % call Factor rule % and emit sAdd % Else if next input token is pMinus consume i~ % call Factor rule % and emit sSubtract % Else % exit the cycle % End of input choice nested in cycle, and end of rule

Factor:

@Primary {[ @Primary

.~t~atlply 1 7"

@Primary .sDivide

J *:

]}; Primary:.

[

I 'C"

@Expression ,), [ pIdentifier:

];

.sldentificr

end Fig. 5. An SL program to parse infix expressions and output flow.

SL parses the LR languages

69

recognized prodution had been inserted in the input. Unwinding the stack is accomplished using the Position type; when a rule recognizes that a Reduce is to be done, it returns a value indicating how much of the right hand side of the production that it has recognized remains on the stack; the rule that called this rule must check whether the Reduce occurred, which it can recognize by the returned value, and, if the Reduce has occurred and this state corresponds to one that should be popped off the stack, it simply returns to its caller, passing back a revised value to indicate that now one fewer state must be popped from the stack. One of the essential things about an LR state on the stack is that it can be the top of stack several times before it is finally removed. However, the very first time that the state appears at the top of the stack, it is as a result of a Shift action that has consumed input (either a terminal or an inserted nonterminal), and there will always be a terminal as the input symbol. This suggests that the overall structure of a rule should be an input choice, with an option specified for each terminal symbol. We do not explicitly encode the Error moves from the LR table, and if all actions are equal, the rule can be simplified by not writing the choice and simply writing the common actions. The specific constructions to be used for each move of the LR parser are as follows. Error: we do not write these explicitly, but use the built-in SL syntax error handling. Accept: this occurs only in one state, and we write it as follows: sl > > Position: > > p0pl; Reduce i: Output oi, and return the value pipj, where j is the length of the right hand side of production i. Shift i: When the LR machine has a shift to state i on input t, the SL rule will have an option for terminal t, which is a rule choice on the rule s~. There will be an alternative in the rule choice for each item in the LR state that has the mark to the left of i. The value label on this alternative will be p~pj, where i is the production number and j is the position of the mark in the item. The action written in this alternative will be a return ofp~pj l i f j is greater than 1. I f j is equal to 1 (indicating that all the symbols corresponding to the right hand side have now been popped from the stack), the action is a rule choice on Sk, where k is the state to which the LR machine will shift on input of the left hand side of production i in the current state. Figure 6 shows the SL program yielded by this construction for the parser in Figs 2 and 3.

4. H A N D L I N G P A R S E R S W I T H C O N F L I C T S The simple construction given in the previous section works only for LR(0) parsers. Parsers that can have both Shift and Reduce moves, or Reduce moves for more than one production, on a single row of the LR parse table are said to have conflicts, and more complex algorithms are needed to construct the tables. It turns out that more complex constructions are needed here to mimic the LR parser with an SL program as well. We proceed in two stages. We first show an elegant construction for such parsers, assuming an extension to SL. We then show how to transform a program in this extended SL into an equivalent program in standard SL. Figures 7 and 8 show a grammar that is LR(1) but not LR(0), and the LR parse table for this grammar. State 3 in the LR parse table is a conflict state. If the input symbol is d, the parser should Shift and go to state 7; on any other input symbol, the parser should Reduce by production 5. We have shown in the table that, in fact, the input symbol b is the only one that can occur in these circumstances, and is the only symbol for which the R5 entry is required in the table.

4.1 An extended SL The straightforward construction described earlier does not work for state 3 in Fig. 8. There is, though, a straightforward construction assuming a simple extension to SL described by [12]. The extension consists of adding a single additional action to SL. The lookahead choice action is denoted by a choice action with "*" as the selector. The effect of this action is to have the next imput symbol examined but not consumed, it remains in the input stream to be read later. This

70

DAVID T. BARNARD and JAMES R. CORDY

Input:

'a'

ia

ib ic id

'b' ~cP

'd';

output: o0 ol 02 03;

type Position:

pop]. plpl plp2 p2pl

p2p2 p3pl;

rules sO:

[ I 'a':

[ @s4 I p3pl: [ @s2 [ plpl: [ @s~

[ pOpI: >>

] ]

%

] 'd:

% else if it is 'c' then consume it and % % %

[ @s3 I p2pl: [ @s2 [ plpl: [ @sl [ p0pl: >>

] ]

sl > > Position: > > p0pl; s2 > > Position:

[

I 'b':

[ @~ ]

];

% Rule for start state % Choose on first input % if it is 'a'then consume it and % call rule for state 4 % if rule for state 4 returns position p3p] then % call rule for state 2 % if it returnsplpl then % call rule for state 1 % if it returnspOpl then % all done; return

I plp2: • • plpl Fig. 6

SL parses the LR languages

71

s3 • • Position:

[ I 'd': [ @s6 I p2p2: > > p2pl

]

s4 • • Position: .03 • • p3pl; s5 > > Position: .ol > > plp2; s6 > > Position: .02 > > p2p2; end

Fig. 6. SL program corresponding to parser in Fig. 2. is n o w a second use o f " * " in a choice; we earlier introduced it as the label on an alternative (see 2.2), where it indicated "otherwise" and where no input was c o n s u m e d for input choices, but here it is used as a selector. The l o o k a h e a d extension was originally designed into SL to allow convenient chain optimization o f expression parsing in SL recursive descent parsers. As we will prove later, it does not change the parsing p o w e r o f SL. Figure 9 shows a rule for state 3 written in the extended SL. Here we m a k e use of the "otherwise" feature of the choice construct, which is written by putting " * " in the value label o f an alternative. Figure 10 shows a n o t h e r version o f the rule, where we explicitly use the input symbol that can legally a p p e a r as a label. Notice that we must explicitly c o n s u m e the input symbol d; its a p p e a r a n c e as a value label in a l o o k a h e a d choice does not have the side effect of consuming it, as happens in the input choice used earlier.

4.2 Handling conflicts without extensions We now show how to handle conflict states without resorting to the use o f an extended SL. The essential idea is that when a l o o k a h e a d choice would be used, we instead use an input choice. F o r the alternatives that subsequently c o n s u m e input (such as the alternative labeled with d in Figs 9 and 10), this means simply deleting the first a c t i o n - - t h e explicit input of the terminal symbol. F o r 0:

[S'-->.S] [S-- > .Xb] [S-- >.cd]

S

(S,1) (X,2) (c,3) (a,4) (Y,5)

[x-->.v] [Y-->.c] O: 1: 2: 3: 4: 5:

S'--> S S--> Xb I cd X--> a [Y Y--> c

Fig. 7. A grammar that is LR(1) but not LR(0). CL 132

D

1:

[S'-->S.]

A

2:

[s-->X.b]

s

(b,6)

3:

[S-->c.d] [Y-->c.] [X-- > a.] [X-->Y.] [S-->Xb.] [S-->cd.]

S R5 R3 R4 R1 R2

(d,7) {b}

4: 5: 6: 7:

Fig. 8. LR parse table for grammar in Fig. 7.

72

DAVID T. BARNARDand JAMESR. CORDY s.3 > > Position:

s3 > > Position:

[,

[,

I 'd':

I 'd':

'd'

'd'

[ @s7

[ @s7

I p2p2: ]

I p2p2:

> > p2pl

]

I*:

• > p2pl

['b':

. .05

.05

> > p5pl

> • p5pl

l;

1;

Fig. 9. SL rule for conflict state.

Fig. 10. Alternative SL rule for conflict state.

the alternatives that do not immediately c o n s u m e input, this m e a n s that additional information must be encoded, namely, the value o f the input symbol that was consumed. Since there are no variables in SL, the only place to encode this information is in an a u g m e n t e d return value for the rule. We extend the type Position to include values o f the f o r m piPstx, where i and j are as before, and x is the n a m e o f an input symbol that has already been consumed. Such a value can be passed back f r o m a rule. T h e rule that called a rule returning such a value, must be modified to not sO:

i

"~C1:

[ @s3 I p5pltb: [ @s5tb I p4pltb: [ @s2tb ] plpl: [ @sl

I pop1: > >

] ] ] l °°.

l;

s2tb > > Position:

[ @s6

I pZpZ:

> > plpl ]; s3 > > Position:

[

I 'd': [ @sT I p2p2: • > p2pl

] I 'b':

];

.05 • • p5pltb

s5tb: % Same as s5 under tile previous construction

Fig. 11. Handling conflict states in standard SL.

SL parses the LR languages

73

consume input the next time an input action or input choice action would have been required by the construction described in the previous section. In some cases this will mean modifying only the rule that called the rule generating the augmented value. In other cases the next action causing input will not appear in the calling rule, but may appear in some other rule called (perhaps indirectly) by that rule. Figure 11 shows how to modify our example to handle the conflict state from the LR grammar with rules written in standard SL. We show only those rules that differ from what would be produced by the construction described in the previous section. Assuming that a set of SL rules has been produced using the lookahead choice action, the way to augment the appropriate returned values and propagate this buffered information through state splitting is as follows. For every lookahead choice: (1) Convert the lookahead choice into an input choice. If an alternative had an input action as its first action, remove that input action. If an alternative did not have an input action as its first action, augment the returned value indicating what symbol has been returned. (This will require augmenting the set of values in the definition of the type Position at the beginning of the SL program.) (2) For every call action that references a state that has had its return values augmented, add an alternative for each augmented result that can be returned; if the action in the extant alternative simply returns a Position, modify it to return the same Position value augmented with the input symbol that was consumed and passed back as part of the augmented value; if the action in the extant alternative is a choice rule, modify it to call a new state whose name is the name of the old state augmented with the name of the previously consumed input symbol. (3) For every newly-created state, if the old state (with the unaugmented name) accepted the symbol that has been buffered (and is thus encoded in the new state's name), delete the explicit input action from the new state; if the old state did not explicitly accept the buffered input symbol, use step 2 above to modify it. The full construction now comprises the steps given in the previous section of the paper, augmented with the use of the lookahead choice action, followed by the elimination of lookahead choices as described here. 5. S U M M A R Y We have proven that SL has the parsing power of LR(k) by showing how to build an SL program that will mimic the action of any LR(1) parser. The construction uses an SL rule for each LR state, an SL rule choice for each LR Shift action, and explicit return stack unwinding in the SL program for each LR Reduce action. The construction is elegant for LR(0) parse tables, and there is an elegant construction for LR(I) parse tables using an extended version of SL. The extended SL program can be converted into standard SL in a straightforward manner that may introduce some additional states. This construction means that SL can be used as an implementation technique for parsers based on BNF grammars as input, rather than the normal programmatic style of SL rule input. The construction is not only attractive in theory, but in practice it should produce parsers that are about as fast and small as standard LR parsers.

REFERENCES 1. Holt R. C., Cordy J. R. and Wortman D. B. An introduction to S/SL: Syntax/SemanticLanguage. ACM TOPLAS 4, 149-178 (1982). 2. Lampson B. W., Homing J. J., London R. L., MitchellJ. G. and Popek G, J. Report on the programming language Euclid. SIGPLAN Not. 12 (February, 1977). 3. Cordy J. R. and Holt R. C. Specificationof concurrent Euclid (Version I). Technical Report CSRG-133, Computer Systems Research Institute, University of Toronto (1981). 4. RosseletJ. A. PT: a Pascal subset. Technical Report CSRG-119, Computer Systems Research Institute, Universityof Toronto (1980). 5. Holt R. C., Matthews P. A., RosseletJ. A. and Cordy J. R. The Turing ProgrammingLanguage: Design and Definition. Prentice-Hall, Englewood Cliffs, NJ (1988).

74

DAVID T. BARNARD and JAMES R. CORDY

6. Holt R. C. and Cordy J. R. The Turing Plus Report (Preliminary). Computer Systems Research Institute, University of Toronto (1985). 7. Lomet D. B. A formalization of transition diagram systems. J. A C M 20, 235--257 (April, 1973). 8. McKenzie P., Spinney B. and Wu T. S/SL parses the LR(k) languages. Technical Note 21, Computer Systems Research Group, University of Toronto (1981). 9. Lewis P. M. II and Stearns R. E. Syntax directed transduction, d. A C M 15, 464~88 (1968). 10. Knuth D. E. On the translation of languages from left to right. Inform. Control g, 609-639 (1965). 11. Barnard D. T. Automatic generation of syntax-repairing and paragraphing parsers. M.Se. Thesis, Technical Report CSRG-52, Computer Systems Research Institute, University of Toronto (1975). 12. Turner J. NSSL: an automated compiler generation system. M.SC. Thesis, Department of Computer Science, University of Toronto (1985), About the Author--DAVID T. BAR~AgD studied computer science at the University of Toronto before joining the Department of Computing and Information Science at Queen's University in 1977, where he is currently Associate Professor and Head. From 1982 to 1987 he served as Director of Computing and Communications Services at Queen's. His research interests are in the application of formal language theory to programming language compilation, module specification, and text processing. Dr Barnard is a member of the Association for Computing Machinery, the IEEE Computer Society, the Association for Computers and the Humanities and the American Scientific Affiliation. About the Author--JA~S R. CORDY received his B.SC. in 1973, M.SC. in 1976 and Ph.D. in computer science in 1986 from the University of Toronto. From 1974-1983 Dr Cordy served as a research associate in the Computer Systems Research Institute at the University of Toronto, and from 1983 to 1985 he was a lecturer in the Department of Computer Science of that same university. He is presently assistant professor of computing and information science at Queen's University at Kingston, Canada. Dr Cordy is co-designer of the programming languages Concurrent Euclid, Turing and Turing Plus, the Turing programming environment and the S/SL compiler specification language. He is a member of the Association for Computing Machinery, the IEEE Computer Society and IFIP working group 2.4.

Lihat lebih banyak...

Comentarios

Copyright © 2017 DATOSPDF Inc.