[Linux-cluster] Re: GFS on md on shared disks?

Ed L Cashin <ecashin@xxxxxxxxxx> · Thu, 07 Oct 2004 14:39:24 -0400

Ken Preslan <kpreslan@xxxxxxxxxx> writes:

...
> Suppose Node A writes inode 23 and Node B writes inode 24 (both at the
> same time).  The following sequence of events could occur:
>
> 1)  Node A locks inode 23 exclusively
> 2)  Node B locks inode 24 exclusively
> 3)  Node A starts writing inode 23.  This consists of:
>     A) Reading the inode off of Disk 0
>     B) Reading the parity block off of Disk 2
>     C) XORing the old version of the Disk 0 block out of the Disk 2 block
>     D) XORing the new version of the Disk 0 block into the Disk 2 block
> 4)  Node B starts writing inode 24.  This consists of:
>     A) Reading the inode off of Disk 1
>     B) Reading the parity block off of Disk 2
>     C) XORing the old version of the Disk 1 block out of the Disk 2 block
>     D) XORing the new version of the Disk 1 block into the Disk 2 block
> 5)  Node A completes writing inode 23.  This consists of:
>     A) Writing the new block to Disk 0
>     A) Writing the new parity block to Disk 2
> 6)  Node A completes writing inode 24.  This consists of:

That's node B if I am following you correctly.  

>     A) Writing the new block to Disk 1 
>     A) Writing the new parity block to Disk 2
>
> The problem is that you had two simultaneous read-modify-write operations
> on the parity block.  Neither operation took the other one into account.
> So, the data in the non-parity blocks is correct, but the parity block is
> now corrupt.  As long as you don't lose a disk, you're fine.  But, as soon
> as a disk dies, the values you'll get from reading inode 23 and 24 will
> be completely bogus.

Thanks for the example.  That's about as concrete as one could hope for!

> A cluster aware software RAID5 implementation would lock stripes so that
> only one machine could modify a stripe at a time.

It sounds like it would be slow.  Maybe not in a situation with
reader-writer locks, where writes were infrequent, though.

-- 
  Ed L Cashin <ecashin@xxxxxxxxxx>