On Thu, Oct 07, 2004 at 07:26:35AM -0400, Jeff wrote: > My preference would be that it has the most current copy from > the surviving members. If the nodes keep track of the change count, > this would be the copy with the highest value. An alternative, > although I suspect this is more difficult to implement, would be for > each surviving node to return the VALNOTVALID status until it writes > the lock value block. In this case after one node has written the value > block it would be important that the current, valid, value is used. > > Here's the problem with simply resetting the value block to zero. > We're using the value block as a counter to track whether a block > on disk has changed or not. Each cluster member keeps a copy of the > value block counter in memory along with the associated disk block. > When a process converts a NL lock to a higher mode it reads the > current copy of the value block to decide whether it needs to re-read > the block from disk. > > When the lock request completes with VALNOTVALID as a status the > process knows that it needs to re-read the block from disk. The big > question though is what does it write into the lock value block at > that point so the other systems will know this as well. If the lock > value block is guaranteed to have the most recent value seen by the > existing nodes then the process can simply increment the value and > it will know that the result will not match what any other system has > cached. If the lock value block is zeroed or set to an arbitrary > value from any one of the surviving nodes, then it might be a value > which is lower than exists on one or more of the nodes. There are ways > we can deal with this but it means more bookkeeping. That makes sense. Here's an outline of LVB recovery. While recovering resource R on node N: - If N was the master of R before recovery, we just leave R's LVB contents as they are. (We are certain this LVB was the most recent one written.) - If N is a new master of R (assigned during recovery) we rebuild R's locks from remaining nodes, then: o If any of the locks have mode > CR, we take the LVB from it as the copy for R. (We are certain this is the most recent LVB that was written.) o If all locks on R have mode <= CR, we cannot know if any of the LVB's on the remaining locks represent R's last LVB prior to recovery. We can, however, pick the most recent copy from the remaining locks by using LVB sequence numbers. (this is the part we don't do now) Lock_dlm can use the VALNOTVALID flag to zero the LVB in this last case as GFS requires. -- Dave Teigland <teigland@xxxxxxxxxx>