Friday, October 8, 2004, 3:14:25 AM, David Teigland wrote: > On Thu, Oct 07, 2004 at 07:26:35AM -0400, Jeff wrote: >> My preference would be that it has the most current copy from >> the surviving members. If the nodes keep track of the change count, >> this would be the copy with the highest value. An alternative, >> although I suspect this is more difficult to implement, would be for >> each surviving node to return the VALNOTVALID status until it writes >> the lock value block. In this case after one node has written the value >> block it would be important that the current, valid, value is used. >> >> Here's the problem with simply resetting the value block to zero. >> We're using the value block as a counter to track whether a block >> on disk has changed or not. Each cluster member keeps a copy of the >> value block counter in memory along with the associated disk block. >> When a process converts a NL lock to a higher mode it reads the >> current copy of the value block to decide whether it needs to re-read >> the block from disk. >> >> When the lock request completes with VALNOTVALID as a status the >> process knows that it needs to re-read the block from disk. The big >> question though is what does it write into the lock value block at >> that point so the other systems will know this as well. If the lock >> value block is guaranteed to have the most recent value seen by the >> existing nodes then the process can simply increment the value and >> it will know that the result will not match what any other system has >> cached. If the lock value block is zeroed or set to an arbitrary >> value from any one of the surviving nodes, then it might be a value >> which is lower than exists on one or more of the nodes. There are ways >> we can deal with this but it means more bookkeeping. > That makes sense. Here's an outline of LVB recovery. While recovering > resource R on node N: > - If N was the master of R before recovery, we just leave R's LVB contents > as they are. (We are certain this LVB was the most recent one written.) > - If N is a new master of R (assigned during recovery) we rebuild > R's locks from remaining nodes, then: > o If any of the locks have mode > CR, we take the LVB from it as > the copy for R. (We are certain this is the most recent LVB that > was written.) > o If all locks on R have mode <= CR, we cannot know if any of the > LVB's on the remaining locks represent R's last LVB prior to > recovery. We can, however, pick the most recent copy from the > remaining locks by using LVB sequence numbers. (this is the part > we don't do now) > Lock_dlm can use the VALNOTVALID flag to zero the LVB in this last case as > GFS requires. And in step #3 the resource is marked VALNOTVALID which is sent across with subsequent grants until the lock value block is written.