On Fri, Oct 02, 2009 at 11:51:59AM +1000, Neil Brown wrote: > I have thought about this sort of thing from time to time and I have a > very different idea for how the necessary communication between the > filesystem and MD would happen. I think my approach would completely > address this problem, and doesn't need to add any ioctls (which I am not > keen on). > > I would add two new BIO_RW_ flags to be used with WRITE requests. > The first flag would mean "don't worry about a crash in the middle of > this write, I will validate it after a crash before I rely on the > data." > The second would mean "last time I wrote data near here there might > have been a failure, be extra careful". > > So the first flag would be used during normal filesystem writes for > every block that gets recorded in the journal, and for every write > to the journal. > > The second flag is used after a crash to re-write every block that > could have been in-flight during the crash. Some of those blocks will > be read from the journal and written to their proper home, other will > be read from wherever they are and written back there. > > The first flag would be interpreted by MD as "don't set the bitmap > bit". The second flag would be interpreted as "don't trust the > parity block, but do a reconstruct-write". > > With this scheme you would still need a write-intent-bitmap on the MD > array, but no bits would ever be set if the filesystem were using the > new flags, so no performance impact. You probably could run without a > bitmap, in which case the flag would me "don't mark the array as > active". That makes a lot of sense. The original design used something similar to your second flag to do all the resync work. I changed to an IOCTL approach because there's no good way for a flagged IO to return "the resync you requested could not be done so don't skip the automated resync." But with your suggestion, there's no need to explicitly skip the automated resync at the end either - it just won't act on those blocks, assuming all blocks were previously written with the first flag. I don't like the IOCTLs either :) > I'm not entirely sure about the second flag. Maybe it would be better > to make it a flag for READ and have it mean "validate and correct any > redundancy information (duplicates or parity) for this block before > returning it. Then we could have just one flag, which meant different > things for READ and WRITE. Yes, it needs to be a READ flag to avoid an unnecessary read-write cycle. The list of declare blocks in the ext3 journal just lists the block numbers, not their contents (this is equivalent to data=ordered, not data=journaled). It would be nice to also have a WRITE flag though, for those cases (metadata) where we do know what's being written to the block. In any case, I think it's less confusing to have two flags: the "don't worry about this block for now" flag and the "resync this block now" flag. There's still one complication though, and an IOCTL may still be necessary: at fsck time, we need to perform a journal recovery from e2fsck (userland.) Clearly, resyncing the declared blocks (plus the journal and affected metadata blocks) must be done before e2fsck does any writes to these blocks. An IOCTL seemed to be the best way to do that, but I'm open to other suggestions. > What do you think? I'll send a new patchset, hopefully today. Cheers, Jody > NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html