On Thu, Oct 07, 2004 at 12:07:57PM -0400, Ed L Cashin wrote: > Erling Nygaard <nygaard@xxxxxxxxxx> writes: > > > No, this will not work at all. > > > > All GFS locking is done on a filesystem level. In order to make this work > > you need locking on the blocksystem level . > > I guess I'm looking for a concrete reason why it won't work. I've > been assuming it won't work, but I can't think of a concrete reason. The reason non-cluster-aware software RAID5 won't work is because the parity blocks aren't locked correctly. Let's take a case where there are 3 disks and look at the contents of one stripe: Disk0 Disk1 Disk2 +-----------+-----------+-----------+ .... | | | | Stripe 12 | inode #23 | inode #24 | parity | .... | | | | +-----------+-----------+-----------+ Suppose Node A writes inode 23 and Node B writes inode 24 (both at the same time). The following sequence of events could occur: 1) Node A locks inode 23 exclusively 2) Node B locks inode 24 exclusively 3) Node A starts writing inode 23. This consists of: A) Reading the inode off of Disk 0 B) Reading the parity block off of Disk 2 C) XORing the old version of the Disk 0 block out of the Disk 2 block D) XORing the new version of the Disk 0 block into the Disk 2 block 4) Node B starts writing inode 24. This consists of: A) Reading the inode off of Disk 1 B) Reading the parity block off of Disk 2 C) XORing the old version of the Disk 1 block out of the Disk 2 block D) XORing the new version of the Disk 1 block into the Disk 2 block 5) Node A completes writing inode 23. This consists of: A) Writing the new block to Disk 0 A) Writing the new parity block to Disk 2 6) Node A completes writing inode 24. This consists of: A) Writing the new block to Disk 1 A) Writing the new parity block to Disk 2 The problem is that you had two simultaneous read-modify-write operations on the parity block. Neither operation took the other one into account. So, the data in the non-parity blocks is correct, but the parity block is now corrupt. As long as you don't lose a disk, you're fine. But, as soon as a disk dies, the values you'll get from reading inode 23 and 24 will be completely bogus. A cluster aware software RAID5 implementation would lock stripes so that only one machine could modify a stripe at a time. -- Ken Preslan <kpreslan@xxxxxxxxxx>