Report on Superscalar Processor

October 8, 2017 | Autor: Nowshed Shiam | Categoría: Computer Science, Computer Engineering

Descripción

Student ID: M-145309

Report on Superscalar Processor

To Ryotaro Kobayashi Department of Computer Science & Engineering, Toyohashi University of Technology, Toyohashi, Japan. 17th November, 2014.

Subject: Letter of submitter.

Sensei, With due respect and humble submission, I beg to approach you that I’m the student of C.S.E, Master 1st semester. I am glad to inform you that I have successfully completed the Report on “Superscalar Processor”.

I, therefore, pray and hope that your honor and intelligence would be kind enough to evaluate my report and oblige thereby.

Yours Sincerely,

Chy Abu Nowshed ID : M-145309 Semester : 1st Department of Computer Science & Engineering Toyohashi University of Technology.

2

Superscalar Processor Superscalar is a term coined in the late 1980s. Superscalar processors arrived as the RISC movement gained widespread acceptance, and RISC processors are particularly suited to superscalar techniques. However, the approach can be used on non-RISC processors (e.g. Intel's P6-based processors, the Pentium 4, and AMD's IA32 clones), with considerable effort. All current desktop and server market processors are now superscalar. The superscalar architecture allows the execution of two or more successive instructions simultaneously, in different pipelines. As a result, more than one instruction gets completed in every clock cycle. Thus the throughput of a superscalar processor is greater than that of a pipelined scalar processor by twice or more. In some superscalar processors, the sequencing of instruction execution is static (during compilation) whereas in others it is dynamic (at run time). Superscalar versus Superpipelined:

Both the superpipeline and the superscalar implementations depicted in above figure have the same number of instructions executing at the same time in the steady state. The superpipelined processor falls behind the superscalar processor at the start of the program and at each branch target. 3

Superscalar Processor Requirements of Superscalar Processors: The requirements for a successful superscalar execution are as follows: 1. The instruction fetching unit fetches more than one instruction at a time from cache. 2. The instruction decoding logic should decide whether the instructions are independent and hence executable, simultaneously. 3. There should be sufficient number of execution units so that several instructions can be processed, simultaneously. The execution units may be preferably pipelined. 4. The cycle time for each pipeline stage should match the cycle times for the fetching and decoding logic so that pipeline works at an optimum efficiency. Superscalar pipelines:

Figure (a): A superscalar processor with two pipelines. Earlier it was assumed that each stage in the pipeline takes the same amount of time as all other stages. But, in practice, some stages take more time than others. Assume that the Fetch, Decode, and Save stages, each take 1 clock cycle to complete, but the Execute stage takes 2 clock cycles to complete. Then, it takes 8 clock cycles to complete an instruction using pipelining. The reason is that the Execute stage would drastically affect the other stages. The other stages are unable to receive the next instruction until the Execute unit gets ready to receive the next instruction. In a pipeline, the time spent in each stage is the same for all stages and is determined by the slowest stage. Except the Execute stage 4

Superscalar Processor than other stages (Fetch. Decode and Store) take only 1 clock cycle to complete but they wait for one clock cycle before moving to the next instruction. It takes 8 clock cycles to complete the first instruction and then for every 2 clock cycles, one instruction is completed. Figure (a) illustrates the addition of one more functional unit that improves the productivity of pipelines. The two separated execution units work independently; each execution unit takes two clock cycles to complete. The odd instructions enter the execution unit 1 and the even instructions enter the execution unit 2. It takes 5 clock cycles to complete the first instruction and then for every clock cycle, one instruction is completed. Performance limit of superscalar: The superscalar approach depends on the ability to execute multiple instructions in parallel. The term instruction-level parallelism refers to the degree to which, on average, the instructions of a program can be executed in parallel. To increase instruction-level parallelism, the system must cope with the following fundamental limitations. 1. True data dependencies. 2. Procedural dependencies (control hazard). 3. Resource conflicts.

Figure (b): Effect of Dependencies

5

Superscalar Processor True Data Dependency: Consider the following sequence: ADD EAX, ECX ; MOV EBX, EAX ;

The second instruction can be fetched and decoded but cannot execute until the first instruction executes. The reason is that the second instruction needs data produced by the first instruction. This situation is referred to as a true data dependency Figure (b) illustrates this dependency in a superscalar machine of degree 2. With no dependency, two instructions can be fetched and executed in parallel. If there is a data dependency between the first and second instructions, then the second instruction is delayed as many clock cycles as required to remove the dependency. One way to compensate for this delay is for the compiler to reorder instructions so that one or more subsequent instructions that do not depend on the memory load can begin flowing through the pipeline. This scheme is less effective in the case of a superscalar pipeline: The independent instructions executed during the load are likely to be executed on the first cycle of the load, leaving the processor with nothing to do until the load completes. Procedural Dependencies: The instructions following a branch (taken or not taken) have a procedural dependency on the branch and cannot be executed until the branch is executed. Figure (b) illustrates the effect of a branch on a superscalar pipeline of degree 2. This type of procedural dependency also affects a scalar pipeline. The consequence for a superscalar pipeline is more severe, because a greater magnitude of opportunity is lost with each delay. Resource Conflicts: A resource conflict is a competition of two or more instructions for the same resource at the same time. Examples of resources include memories, caches, buses, register-file ports, and functional units (e.g., ALU adder). In terms of the pipeline, a resource conflict exhibits similar behavior to a data dependency (Figure (b)). There are some differences, however. For one thing, resource conflicts can be overcome by duplication of resources, whereas a true data dependency cannot be eliminated. Also, when an operation takes a long time to complete, resource conflicts can be minimized by pipelining the appropriate functional unit. Design Issues: Instruction-Level Parallelism and Machine Parallelism: Instruction parallelism is a measure of the average number of instructions that a superscalar processor might be able to execute at the same time. Machine parallelism of a processor is a measure of the ability of the processor to take advantage of the instruction-level parallelism. Machine parallelism is determined by the number of instructions that can be fetched and executed at the same time by the speed and sophistication of the mechanisms that the processor uses to find independent instructions. 6

Superscalar Processor Superscalar Instruction Issues and Completion Policies: Instruction-issue refers to the process of initiating instruction execution in the processor's functional units. Instruction-issue policy limits affects performance because it determines the processor's 'lookahead' capability; that is, the ability of the processor to examine instructions beyond the current point of execution in hopes of finding independent instructions to execute.   

In-order issue with in-order completion. In-order issue with out-of-order completion. Out-of-order issue with out-of-order completion.

Figure (c): In-order issue with in-order completion.

Figure (d): In-order issue with out-of-order completion.

7

Superscalar Processor

Figure (e): Out-of-order issue with out-of-order completion. One common technique that is used to support out-of-order completion is the reorder buffer. The reorder buffer is temporary storage for results completed out of order that are then committed to the register file in program order. A related concept is Tomasulo’s algorithm. Register Renaming: Anti-dependencies and output dependencies are both examples of storage conflicts. Multiple instructions are competing for the use of the same register locations, generating pipeline constraints that retard performance. The problem is made more acute when register optimization techniques are used. One method for coping with these types of storage conflicts is based on a traditional resource-conflict solution: duplication of resources. In this context, the technique is referred to as register renaming. In essence, registers are allocated dynamically by the processor hardware, and they are associated with the values needed by instructions at various points in time. When a new register value is created (i.e., when an instruction executes that has a register as a destination operand), a new register is allocated for that value. Subsequent instructions that access that value as a source operand in that register must go through a renaming process: the register references in those instructions must be revised to refer to the register containing the needed value. Thus, the same original register reference in several different instructions may refer to different actual registers, if different values are intended. I1: I2: I3: I4:

R3b R4b R3c R7b

R3a

op R5a R3b + 1 R5a + 1 R3c op R4b

Branch Prediction: Any high-performance pipelined machine must address the issue of dealing with branches. With the development of superscalar machines, the delayed branch strategy has less appeal. The reason is that multiple instructions need to execute in the delay slot, raising several problems relating to instruction dependencies. Thus, superscalar machines have returned to pre-RISC techniques of branch prediction. Why would a superscalar processor use dynamic scheduling? There are three major reasons. First, not all stalls are predictable. Dynamic scheduling allows the processor to hide some of those stalls by continuing to execute instructions while waiting for the stall to end. 8

Superscalar Processor Second, if the processor speculates on branch outcomes using dynamic branch prediction, it cannot know the exact order of instructions at compile time, since it depends on the predicted and actual behavior of branches. Incorporating dynamic speculation to exploit more ILP without incorporating dynamic scheduling would significantly restrict the benefits of such speculation. Third, the pipeline structure affects both the number of times a loop must be unrolled to avoid stalls as well as the process of compiler-based register renaming. Dynamic scheduling allows the hardware to hide most of these details. Thus, users and software distributors do not need to worry about having multiple versions of a program for different implementations of the same instruction set. Superscalar Execution:

Figure (f): Conceptual Depiction of Superscalar Processing The program to be executed consists of a linear sequence of instructions. This is the static program as written by the programmer or generated by the compiler. The instruction fetch process, which includes branch prediction, is used to form a dynamic stream of instructions. This stream is examined for dependencies, and the processor may remove artificial dependencies. The processor then dispatches the instructions into a window of execution. In this window, instructions no longer form a sequential stream but are structured according to their true data dependencies. The processor performs the execution stage of each instruction in an order determined by the true data dependencies and hardware resource availability. Finally, instructions are conceptually put back into sequential order and their results are recorded. The final step mentioned in the preceding paragraph is referred to as committing, or retiring, the instruction. Further, the use of branch prediction and speculative execution means that some instructions may complete execution and then must be abandoned because the branch they represent is not taken. Therefore, permanent storage and program-visible registers cannot be updated immediately when instructions complete execution. ----------------**END**--------------9

Superscalar Processor References [1]. David A. Patterson and John L. Hennessy, “Computer Organization and Design”, The Hardware/Software Interface, Morgan Kaufmann Publishers,3rd Edition. ISBN 1-55860-604-1, 2005. [2]. David Money Harris and Sarah L. Harris, “Computer Design and Computer Architecture,” Copyright Material; Elseveir. [3]. William Stallings, “Computer Design and Architecture,” Design for Performance, Prentice Hall, 8th Edition ISBN-13: 978-0-13-607373-4. [4]. B. Govindarajalu, “Computer Architecture and Organization: Design Principles and Applications,” Tata McGraw-Hill, 2004. [5]. http://en.wikipedia.org/wiki/Superscalar [6]. http://www.cis.upenn.edu/~milom/cis501-Fall08/lectures/07_superscalar.pdf [7]. https://www.ida.liu.se/~TDTS08/lectures/12/lec5.pdf [8]. http://umcs.maine.edu/~cmeadow/courses/cos335/COA14.pdf [9]. http://csl.stanford.edu/~christos/publications/2002.micro35.viram.slides.pdf [10]. https://www.youtube.com/watch?v=9NdGVmveCwA [11]. https://www.youtube.com/watch?v=5E_W7EeNs8U [12]. https://www.youtube.com/watch?v=T9B_DtYTKCc [13]. https://www.youtube.com/watch?v=WdRiZEwBhsM [14]. https://www.youtube.com/watch?v=b9346ajIBwg

10

Lihat lebih banyak...

Report on Superscalar Processor

Descripción

Comentarios