William stallings computer organization and architecture 6th edition - Chapter 18: Parallel Processing

Cray Supercomputers Vector operation may start as soon as first element of operand vector available and functional unit is free Result from one functional unit is fed immediately into another If vector registers used, intermediate results do not have to be stored in memory

62 trang | Chia sẻ: nguyenlam99 | Lượt xem: 767 | Lượt tải: 0

Bạn đang xem trước 20 trang tài liệu William stallings computer organization and architecture 6th edition - Chapter 18: Parallel Processing, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên

William Stallings Computer Organization and Architecture6th EditionChapter 18Parallel ProcessingMultiple Processor OrganizationSingle instruction, single data stream - SISDSingle instruction, multiple data stream - SIMDMultiple instruction, single data stream - MISDMultiple instruction, multiple data stream- MIMDSingle Instruction, Single Data Stream - SISDSingle processorSingle instruction streamData stored in single memoryUni-processorSingle Instruction, Multiple Data Stream - SIMDSingle machine instruction Controls simultaneous executionNumber of processing elementsLockstep basisEach processing element has associated data memoryEach instruction executed on different set of data by different processorsVector and array processorsMultiple Instruction, Single Data Stream - MISDSequence of dataTransmitted to set of processorsEach processor executes different instruction sequenceNever been implementedMultiple Instruction, Multiple Data Stream- MIMDSet of processorsSimultaneously execute different instruction sequencesDifferent sets of dataSMPs, clusters and NUMA systemsTaxonomy of Parallel Processor ArchitecturesMIMD - OverviewGeneral purpose processorsEach can process all instructions necessaryFurther classified by method of processor communicationTightly Coupled - SMPProcessors share memoryCommunicate via that shared memorySymmetric Multiprocessor (SMP)Share single memory or poolShared bus to access memoryMemory access time to given area of memory is approximately the same for each processorTightly Coupled - NUMANonuniform memory accessAccess times to different regions of memroy may differLoosely Coupled - ClustersCollection of independent uniprocessors or SMPsInterconnected to form a clusterCommunication via fixed path or network connectionsParallel Organizations - SISDParallel Organizations - SIMDParallel Organizations - MIMD Shared MemoryParallel Organizations - MIMDDistributed MemorySymmetric MultiprocessorsA stand alone computer with the following characteristicsTwo or more similar processors of comparable capacityProcessors share same memory and I/OProcessors are connected by a bus or other internal connectionMemory access time is approximately the same for each processorAll processors share access to I/OEither through same channels or different channels giving paths to same devicesAll processors can perform the same functions (hence symmetric)System controlled by integrated operating systemproviding interaction between processors Interaction at job, task, file and data element levelsSMP AdvantagesPerformanceIf some work can be done in parallelAvailabilitySince all processors can perform the same functions, failure of a single processor does not halt the systemIncremental growthUser can enhance performance by adding additional processorsScalingVendors can offer range of products based on number of processorsBlock Diagram of Tightly Coupled MultiprocessorOrganization ClassificationTime shared or common busMultiport memoryCentral control unitTime Shared BusSimplest formStructure and interface similar to single processor systemFollowing features providedAddressing - distinguish modules on bus Arbitration - any module can be temporary masterTime sharing - if one module has the bus, others must wait and may have to suspendNow have multiple processors as well as multiple I/O modulesShared BusTime Share Bus - AdvantagesSimplicityFlexibilityReliabilityTime Share Bus - DisadvantagePerformance limited by bus cycle timeEach processor should have local cacheReduce number of bus accessesLeads to problems with cache coherenceSolved in hardware - see laterMultiport MemoryDirect independent access of memory modules by each processorLogic required to resolve conflictsLittle or no modification to processors or modules requiredMultiport Memory DiagramMultiport Memory - Advantages and DisadvantagesMore complexExtra login in memory systemBetter performanceEach processor has dedicated path to each moduleCan configure portions of memory as private to one or more processorsIncreased securityWrite through cache policyCentral Control UnitFunnels separate data streams between independent modulesCan buffer requestsPerforms arbitration and timingPass status and controlPerform cache update alertingInterfaces to modules remain the samee.g. IBM S/370Operating System IssuesSimultaneous concurrent processesSchedulingSynchronizationMemory managementReliability and fault toleranceIBM S/390 Mainframe SMPS/390 - Key componentsProcessor unit (PU)CISC microprocessorFrequently used instructions hard wired64k L1 unified cache with 1 cycle access timeL2 cache384kBus switching network adapter (BSN)Includes 2M of L3 cacheMemory card8G per cardCache Coherence and MESI ProtocolProblem - multiple copies of same data in different cachesCan result in an inconsistent view of memoryWrite back policy can lead to inconsistencyWrite through can also give problems unless caches monitor memory trafficSoftware SolutionsCompiler and operating system deal with problemOverhead transferred to compile timeDesign complexity transferred from hardware to softwareHowever, software tends to make conservative decisionsInefficient cache utilizationAnalyze code to determine safe periods for caching shared variablesHardware SolutionCache coherence protocolsDynamic recognition of potential problemsRun timeMore efficient use of cacheTransparent to programmerDirectory protocolsSnoopy protocolsDirectory ProtocolsCollect and maintain information about copies of data in cacheDirectory stored in main memoryRequests are checked against directoryAppropriate transfers are performedCreates central bottleneckEffective in large scale systems with complex interconnection schemesSnoopy ProtocolsDistribute cache coherence responsibility among cache controllersCache recognizes that a line is sharedUpdates announced to other cachesSuited to bus based multiprocessorIncreases bus trafficWrite InvalidateMultiple readers, one writerWhen a write is required, all other caches of the line are invalidatedWriting processor then has exclusive (cheap) access until line required by another processorUsed in Pentium II and PowerPC systemsState of every line is marked as modified, exclusive, shared or invalidMESIWrite UpdateMultiple readers and writersUpdated word is distributed to all other processorsSome systems use an adaptive mixture of both solutionsMESI State Transition DiagramClustersAlternative to SMPHigh performanceHigh availabilityServer applicationsA group of interconnected whole computersWorking together as unified resourceIllusion of being one machineEach computer called a nodeCluster BenefitsAbsolute scalabilityIncremental scalabilityHigh availabilitySuperior price/performanceCluster Configurations - Standby Server, No Shared DiskCluster Configurations - Shared DiskOperating Systems Design IssuesFailure ManagementHigh availabilityFault tolerantFailoverSwitching applications & data from failed system to alternative within clusterFailbackRestoration of applications and data to original systemAfter problem is fixedLoad balancingIncremental scalabilityAutomatically include new computers in schedulingMiddleware needs to recognise that processes may switch between machinesParallelizingSingle application executing in parallel on a number of machines in clusterComplierDetermines at compile time which parts can be executed in parallelSplit off for different computersApplicationApplication written from scratch to be parallelMessage passing to move data between nodesHard to programBest end resultParametric computingIf a problem is repeated execution of algorithm on different sets of datae.g. simulation using different scenariosNeeds effective tools to organize and runCluster Computer ArchitectureCluster MiddlewareUnified image to userSingle system imageSingle point of entrySingle file hierarchySingle control pointSingle virtual networkingSingle memory spaceSingle job management systemSingle user interfaceSingle I/O spaceSingle process spaceCheckpointingProcess migrationCluster v. SMPBoth provide multiprocessor support to high demand applications.Both available commerciallySMP for longerSMP:Easier to manage and controlCloser to single processor systemsScheduling is main differenceLess physical spaceLower power consumptionClustering:Superior incremental & absolute scalabilitySuperior availabilityRedundancyNonuniform Memory Access (NUMA)Alternative to SMP & clusteringUniform memory accessAll processors have access to all parts of memoryUsing load & storeAccess time to all regions of memory is the sameAccess time to memory for different processors sameAs used by SMPNonuniform memory accessAll processors have access to all parts of memoryUsing load & storeAccess time of processor differs depending on region of memoryDifferent processors access different regions of memory at different speedsCache coherent NUMACache coherence is maintained among the caches of the various processorsSignificantly different from SMP and clustersMotivationSMP has practical limit to number of processorsBus traffic limits to between 16 and 64 processorsIn clusters each node has own memoryApps do not see large global memoryCoherence maintained by software not hardwareNUMA retains SMP flavour while giving large scale multiprocessinge.g. Silicon Graphics Origin NUMA 1024 MIPS R10000 processorsObjective is to maintain transparent system wide memory while permitting multiprocessor nodes, each with own bus or internal interconnection systemCC-NUMA OrganizationCC-NUMA OperationEach processor has own L1 and L2 cacheEach node has own main memoryNodes connected by some networking facilityEach processor sees single addressable memory spaceMemory request order:L1 cache (local to processor)L2 cache (local to processor)Main memory (local to node)Remote memoryDelivered to requesting (local to processor) cacheAutomatic and transparentMemory Access SequenceEach node maintains directory of location of portions of memory and cache statuse.g. node 2 processor 3 (P2-3) requests location 798 which is in memory of node 1P2-3 issues read request on snoopy bus of node 2Directory on node 2 recognises location is on node 1Node 2 directory requests node 1’s directoryNode 1 directory requests contents of 798Node 1 memory puts data on (node 1 local) busNode 1 directory gets data from (node 1 local) busData transferred to node 2’s directoryNode 2 directory puts data on (node 2 local) busData picked up, put in P2-3’s cache and delivered to processorCache CoherenceNode 1 directory keeps note that node 2 has copy of dataIf data modified in cache, this is broadcast to other nodesLocal directories monitor and purge local cache if necessaryLocal directory monitors changes to local data in remote caches and marks memory invalid until writebackLocal directory forces writeback if memory location requested by another processorNUMA Pros & ConsEffective performance at higher levels of parallelism than SMPNo major software changesPerformance can breakdown if too much access to remote memoryCan be avoided by:L1 & L2 cache design reducing all memory accessNeed good temporal locality of softwareGood spatial locality of softwareVirtual memory management moving pages to nodes that are using them mostNot transparentPage allocation, process allocation and load balancing changes neededAvailability?Vector ComputationMaths problems involving physical processes present different difficulties for computationAerodynamics, seismology, meteorologyContinuous field simulationHigh precisionRepeated floating point calculations on large arrays of numbersSupercomputers handle these types of problemHundreds of millions of flops$10-15 millionOptimised for calculation rather than multitasking and I/OLimited marketResearch, government agencies, meteorologyArray processorAlternative to supercomputerConfigured as peripherals to mainframe & miniJust run vector portion of problemsVector Addition ExampleApproachesGeneral purpose computers rely on iteration to do vector calculationsIn example this needs six calculationsVector processingAssume possible to operate on one-dimensional vector of dataAll elements in a particular row can be calculated in parallelParallel processingIndependent processors functioning in parallelUse FORK N to start individual process at location NJOIN N causes N independent processes to join and merge following JOINO/S Co-ordinates JOINsExecution is blocked until all N processes have reached JOINProcessor DesignsPipelined ALUWithin operationsAcross operationsParallel ALUsParallel processorsApproaches to Vector ComputationChainingCray SupercomputersVector operation may start as soon as first element of operand vector available and functional unit is freeResult from one functional unit is fed immediately into anotherIf vector registers used, intermediate results do not have to be stored in memoryComputer OrganizationsIBM 3090 with Vector Facility

Các file đính kèm theo tài liệu này:

ch_18_3124_4468.ppt