Khoa học máy tính - Chapter 19: Recovery and fault tolerance

Recovery from non-Byzantine faults can be performed by using two approaches: Backward recovery and forward recovery Fault tolerance implemented by maintaining logs E.g., undo or do logs Logs used to implement atomic transactions Two-phase commit protocol (2PC protocol) is used Nested transactions are a resiliency technique Used when transaction involves data in many nodes

22 trang | Chia sẻ: nguyenlam99 | Lượt xem: 822 | Lượt tải: 0

Bạn đang xem trước 20 trang tài liệu Khoa học máy tính - Chapter 19: Recovery and fault tolerance, để xem tài liệu hoàn chỉnh bạn click vào nút DOWNLOAD ở trên

Chapter 19Recovery and Fault ToleranceCopyright © 20081Operating Systems, by Dhananjay Dhamdhere*IntroductionFaults, Failures, and RecoveryByzantine Faults and Agreement ProtocolsRecoveryFault Tolerance TechniquesResiliency2Operating Systems, by Dhananjay Dhamdhere*Faults, Failures, and RecoveryA fault may damage the state of a systemError: a part of the system state that is erroneousFailure: unexpected behavior or situation3Operating Systems, by Dhananjay Dhamdhere*Faults, Failures, and Recovery (continued)Recovery: for reliable operation, system is restored to a consistent state, and operation resumedA recovery is performed when a failure is noticed4Operating Systems, by Dhananjay Dhamdhere*Classes of FaultsFault model: properties that determine the kinds of errors/failures that might result from a faultClasses of faults:System fault  system crashAmnesia and partial amnesia faultsA fail-stop fault brings a system to a haltProcess fault Byzantine faults: malicious or arbitrary actionsStorage fault  amnesia faultsCommunication fault  nonamnesia faults5Operating Systems, by Dhananjay Dhamdhere*Overview of Recovery TechniquesFor non-Byzantine faults, recovery involves restoring system or application to a consistent stateInvolves reexecuting some actions6Operating Systems, by Dhananjay Dhamdhere*Overview of Recovery Techniques (continued)Recovery approaches are classified into:Backward recovery: resetting state of entity affected by fault to a prior state and resuming its operationInvolves reexecution of some actions Forward recovery: repairing erroneous state of a system so system can continue its operationRepair cost depends on the nature of the computationMay involve a certain amount of reexecutionBackward recovery is simpler to implementBut, requires a practical method of producing a consistent state recording of a system7Operating Systems, by Dhananjay Dhamdhere*Byzantine Faults and Agreement ProtocolsByzantine faults have been studied only in the restricted context of agreement between processesByzantine generals problem: Attack or retreatAgreement protocols used for:Byzantine agreement problemConsensus problemInteractive consistency problemImpossibility result: a group of three (m) processes containing one (k) faulty cannot reach agreementIf m > 3, agreement is possible if m > 3 k8Operating Systems, by Dhananjay Dhamdhere*RecoveryA recovery scheme consists of two components:Checkpointing algorithmDecides when to take a checkpoint for a processRecovery algorithmUses checkpoints to roll back processes such that new process states are mutually consistent9Operating Systems, by Dhananjay Dhamdhere*Recovery (continued)State of a process cannot be recovered in isolationMust restore state of computation S’ in which states of all pairs of processes are mutually consistentGoal of recovery algorithm is to decide:Whether a process Pi should be rolled backIdentify checkpoint to which Pi should be rolled backAsynchronous or synchronous checkpointing10Operating Systems, by Dhananjay Dhamdhere*Fault Tolerance TechniquesBasic principle in fault tolerance is to ensure that a fault either does not cause an errorOr that the error can be removed easilyTwo facets of the tolerance of system faults that follow the fail-stop model:Fault tolerance for replicated dataFault tolerance for distributed data11Operating Systems, by Dhananjay Dhamdhere*Logs, Forward Recovery, and Backward RecoveryA log is a record of actions or activities in a processDo logs (also called redo logs)Used to implement forward recoveryUndo logsUsed to implement backward recoveryA write-ahead logging principle is usedA log could be an operation log or a value logExample: intentions list (for atomic actions) is a value log used as a redo logValue logs provide idempotency12Operating Systems, by Dhananjay Dhamdhere*Handling Replicated DataAvailability of data D provided through replicationAt least one copy of D would be accessible from any node despite anticipated faults in the systemIf D may be modified, it is essential to use rules to ensure correctness of data access and updates:Many processes can concurrently read DOnly one process can write into D at any timeReading and writing can’t be performed concurrentlyA process reading D must see the latest value of D13Operating Systems, by Dhananjay Dhamdhere*Handling Replicated Data (continued)Quorum: number of copies of D that must be accessed to perform a specific operation on DQuorum algorithms enforce Rules 1–4 by specifying a read quorum Qr and a write quorum Qw2 x Qw > n and Qr + Qw > nTwo kinds of locks used on D: read and write locksIf a system is required to tolerate faults in up to k nodes, we could choose:Qr = k + 1Qw = n − kn > 2 × k14Operating Systems, by Dhananjay Dhamdhere*Handling Distributed DataDistributed transaction: facility for manipulating files located in different nodes of a distributed system in a mutually consistent mannerAlso called a multisite transactionEach node contains a transaction managerOriginating node contains a transaction coordinatorCoordinator implements the all-or-nothing property of transactions with two-phase commit (2PC) protocolDepending on responses from participating nodes, decides whether to commit or abort transaction15Two-Phase Commit ProtocolPhase 1Actions of transaction coordinator:Write a ‘Prepare Ti’ record in the logSet a time-out and send a ‘Prepare Ti’ message to all nodes participating in the transactionActions of a participating node:If it is ready to commit, write updates in stable storage and a ‘Prepared Ti’ record in the log, and send a ‘Prepared Ti’ message to coordinatorOtherwise, write an ‘Abandoned Ti’ record in the log and send an ‘Abandoned Ti’ message to coordinatorOperating Systems, by Dhananjay Dhamdhere*16Two-Phase Commit ProtocolPhase 2Actions of transaction coordinator:If it receives a ‘Prepared’ reply from all nodes before time-out occurs, write a ‘Commit Ti’ record in the log and send ‘Commit Ti’ messages to all nodesOtherwise, write an ‘Abort Ti’ record in the log and send ‘Abort Ti’ messages to all nodesWait for an acknowledgment from each node and write a ‘Complete Ti’ message in the logActions of a participating node: Depending on the coordinator’s message,Write a ‘Commit Ti’ record in the log and perform commit processingWrite an ‘Abort Ti’ record in log and abandon updates of Ti*Operating Systems, by Dhananjay Dhamdhere17Operating Systems, by Dhananjay Dhamdhere*ResiliencyResiliency techniques focus on minimizing the cost of reexecution when faults occurBasis for resiliency: failures in a distributed system are partial, so some parts of a distributed computation may survive a failureFor example, distributed transactions use resiliency techniques:Nested transactionsTentative commits18Nested TransactionA nested transaction Tik is a part of an atomic transaction TiIt commits only if its parent transaction Ti commitsIt is implemented as followsWhen Tik completes, it is said to have reached a tentative commitWhen Ti wishes to commit, it checks whether all nested transactions have reached a tentative commit and can participate in commit processingIt is implemented using a 2PC protocolOperating Systems, by Dhananjay Dhamdhere*19Nested TransactionResiliency is implemented as follows:If a nested transaction Tik does not respond to a ‘Prepare’ message, the coordinator can retry Tik in the same node or in some other nodeIf Tik had reached tentative commit and its node had failed when ‘Prepare’ message was sentIf the failed node recovers and the coordinator retries Tik in itThe results of Tik, computed before the failure, can be used Operating Systems, by Dhananjay Dhamdhere*20Operating Systems, by Dhananjay Dhamdhere*SummaryRecovery and fault tolerance are two approaches to reliability of a computer systemGenerically called recoveryA third recovery approach is resiliencyA fault causes an error in the state of the system, which leads to a failureA fail-stop fault brings the system to a haltAn amnesia fault makes it lose a part of its stateA Byzantine fault makes it behave in an unpredictable manner21Operating Systems, by Dhananjay Dhamdhere*Summary (continued)Recovery from non-Byzantine faults can be performed by using two approaches:Backward recovery and forward recoveryFault tolerance implemented by maintaining logsE.g., undo or do logsLogs used to implement atomic transactionsTwo-phase commit protocol (2PC protocol) is usedNested transactions are a resiliency technique Used when transaction involves data in many nodes22

Các file đính kèm theo tài liệu này:

chapter_19_2507.ppt