William stallings computer organization and architecture 6th edition - Chapter 15: IA - 64 Architecture
Superscalar features
Six wide, ten stage deep hardware pipeline
Dynamic prefetch
branch prediction
register scoreboard to optimise for compile time nondeterminism
EPIC features
Hardware support for predicated execution
Control and data speculation
Software pipelining
30 trang |
Chia sẻ: nguyenlam99 | Lượt xem: 920 | Lượt tải: 0
Bạn đang xem trước 20 trang tài liệu William stallings computer organization and architecture 6th edition - Chapter 15: IA - 64 Architecture, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên
William Stallings Computer Organization and Architecture6th EditionChapter 15IA-64 ArchitectureBackground to IA-64Pentium 4 appears to be last in x86 lineIntel & Hewlett-Packard (HP) jointly developedNew architecture64 bit architectureNot extension of x86Not adaptation of HP 64bit RISC architectureExploits vast circuitry and high speedsSystematic use of parallelismDeparture from superscalarMotivationInstruction level parallelism Implicit in machine instructionNot determined at run time by processorLong or very long instruction words (LIW/VLIW)Branch predication (not the same as branch prediction)Speculative loadingIntel & HP call this Explicit Parallel Instruction Computing (EPIC)IA-64 is an instruction set architecture intended for implementation on EPICItanium is first Intel productSuperscalar v IA-64Why New Architecture?Not hardware compatible with x86Now have tens of millions of transistors available on chipCould build bigger cacheDiminishing returnsAdd more execution units Increase superscaling“Complexity wall”More units makes processor “wider”More logic needed to orchestrateImproved branch prediction requiredLonger pipelines requiredGreater penalty for mispredictionLarger number of renaming registers requiredAt most six instructions per cycleExplicit ParallelismInstruction parallelism scheduled at compile timeIncluded with machine instructionProcessor uses this info to perform parallel executionRequires less complex circuitryCompiler has much more time to determine possible parallel operationsCompiler sees whole programGeneral OrganizationKey FeaturesLarge number of registersIA-64 instruction format assumes 256128 * 64 bit integer, logical & general purpose128 * 82 bit floating point and graphic64 * 1 bit predicated execution registers (see later)To support high degree of parallelismMultiple execution unitsExpected to be 8 or moreDepends on number of transistors availableExecution of parallel instructions depends on hardware available8 parallel instructions may be spilt into two lots of four if only four execution units are availableIA-64 Execution UnitsI-UnitInteger arithmeticShift and addLogicalCompareInteger multimedia opsM-UnitLoad and storeBetween register and memorySome integer ALUB-UnitBranch instructionsF-UnitFloating point instructionsInstruction Format DiagramInstruction Format128 bit bundleHolds three instructions (syllables) plus templateCan fetch one or more bundles at a timeTemplate contains info on which instructions can be executed in parallelNot confined to single bundlee.g. a stream of 8 instructions may be executed in parallelCompiler will have re-ordered instructions to form contiguous bundlesCan mix dependent and independent instructions in same bundleInstruction is 41 bit longMore registers than usual RISCPredicated execution registers (see later)Assembly Language Format[qp] mnemonic [.comp] dest = srcs //qp - predicate register1 at execution then execute and commit result to hardware0 result is discardedmnemonic - name of instructioncomp – one or more instruction completers used to qualify mnemonicdest – one or more destination operandssrcs – one or more source operands// - commentInstruction groups and stops indicated by ;;Sequence without read after write or write after writeDo not need hardware register dependency checksAssembly Examplesld8 r1 = [r5] ;; //first groupadd r3 = r1, r4 //second groupSecond instruction depends on value in r1Changed by first instructionCan not be in same group for parallel executionPredicationSpeculative LoadingControl & Data SpeculationControlAKA Speculative loadingLoad data from memory before neededDataLoad moved before store that might alter memory locationSubsequent check in valueSoftware PipeliningL1: ld4 r4=[r5],4 ;; //cycle 0 load postinc 4 add r7=r4,r9 ;; //cycle 2 st4 [r6]=r7,4 //cycle 3 store postinc 4 br.cloop L1 ;; //cycle 3Adds constant to one vector and stores result in anotherNo opportunity for instruction level parallelismInstruction in iteration x all executed before iteration x+1 beginsIf no address conflicts between loads and stores can move independent instructions from loop x+1 to loop xUnrolled Loopld4 r32=[r5],4;; //cycle 0ld4 r33=[r5],4;; //cycle 1ld4 r34=[r5],4 //cycle 2add r36=r32,r9;; //cycle 2ld4 r35=[r5],4 //cycle 3add r37=r33,r9 //cycle 3st4 [r6]=r36,4;; //cycle 3ld4 r36=[r5],4 //cycle 3add r38=r34,r9 //cycle 4st4 [r6]=r37,4;; //cycle 4add r39=r35,r9 //cycle 5st4 [r6]=r38,4;; //cycle 5add r40=r36,r9 //cycle 6st4 [r6]=r39,4;; //cycle 6st4 [r6]=r40,4;; //cycle 7Unrolled Loop DetailCompletes 5 iterations in 7 cyclesCompared with 20 cycles in original codeAssumes two memory portsLoad and store can be done in parallelSoftware Pipeline Example DiagramSupport For Software PipeliningAutomatic register renamingFixed size are of predicate and fp register file (p16-P32, fr32-fr127) and programmable size area of gp register file (max r32-r127) capable of rotationLoop using r32 on first iteration automatically uses r33 on secondPredicationEach instruction in loop predicated on rotating predicate registerDetermines whether pipeline is in prolog, kernel or epilogSpecial loop termination instructionsBranch instructions that cause registers to rotate and loop counter to decrementIA-64 Register SetIA-64 Registers (1)General Registers128 gp 64 bit registersr0-r31 staticreferences interpreted literallyr32-r127 can be used as rotating registers for software pipeline or register stackReferences are virtualHardware may rename dynamicallyFloating Point Registers128 fp 82 bit registersWill hold IEEE 745 double extended formatfr0-fr31 static, fr32-fr127 can be rotated for pipelinePredicate registers64 1 bit registers used as predicatespr0 always 1 to allow unpredicated instructionspr1-pr15 static, pr16-pr63 can be rotatedIA-64 Registers (2)Branch registers8 64 bit registersInstruction pointerBundle address of currently executing instructionCurrent frame markerState info relating to current general register stack frameRotation info for fr and prUser maskSet of single bit valuesAllignment traps, performance monitors, fp register usage monitoringPerformance monitoring data registersSupport performance monitoring hardwareApplication registersSpecial purpose registersRegister StackAvoids unnecessary movement of data at procedure call & returnProvides procedure with new frame up to 96 registers on entryr32-r127Compiler specifies required numberLocaloutputRegisters renamed so local registers from previous frame hiddenOutput registers from calling procedure now have numbers starting r32Physical registers r32-r127 allocated in circular buffer to virtual registersHardware moves register contents between registers and memory if more registers neededRegister Stack BehaviourRegister FormatsItanium OrganizationSuperscalar featuresSix wide, ten stage deep hardware pipelineDynamic prefetchbranch predictionregister scoreboard to optimise for compile time nondeterminismEPIC featuresHardware support for predicated executionControl and data speculationSoftware pipeliningItanium Processor DiagramRequired ReadingStallings chapter 15Intel web siteIMPACTUniversity of Illinois
Các file đính kèm theo tài liệu này:
- ch_15_0948_2249.ppt