On Tue, Jul 14, 2015 at 08:22:54AM +1000, NeilBrown wrote: > On Fri, 10 Jul 2015 10:48:45 -0700 Shaohua Li <shli@xxxxxx> wrote: > > > On Fri, Jul 10, 2015 at 04:42:09PM +1000, NeilBrown wrote: > > > On Thu, 9 Jul 2015 22:18:15 -0700 Shaohua Li <shli@xxxxxx> wrote: > > > > > > > On Fri, Jul 10, 2015 at 03:10:44PM +1000, NeilBrown wrote: > > > > > On Thu, 9 Jul 2015 21:52:43 -0700 Shaohua Li <shli@xxxxxx> wrote: > > > > > > > > > > > On Fri, Jul 10, 2015 at 02:36:56PM +1000, NeilBrown wrote: > > > > > > > On Thu, 9 Jul 2015 21:08:49 -0700 Shaohua Li <shli@xxxxxx> wrote: > > > > > > > > > > > > > > > > > > > There is also the issue of what action commits a previous transaction. > > > > > > > I'm not sure what you had. I'm suggesting that each metadata block > > > > > > > commits previous transactions. Is that a close-enough match to what > > > > > > > you had? > > > > > > > > > > > > What did you mean about a transaction? In my implementation, metadata > > > > > > block and followed stripe data/parity consist of an io unit. io units can > > > > > > be finished out of order. but if io unit has flush request (the data has > > > > > > flush/flush bio or metadata is a flush block), the io unit can only > > > > > > start after all previous io units and disk cache flush finish. Such io > > > > > > unit is strictly ordered. The log patch describes this behavior. Does it > > > > > > match? > > > > > > > > > > Yes, a "transaction" is an "io unit". The flushing is the same. > > > > > I just couldn't remember how, when reading the log on restart, you > > > > > determined if a given "io unit" was reliably consistent, or whether it > > > > > should be ignored (having possibly only partially been written). > > > > > > > > The metadata block has a checksum for data of the block. data/parity has > > > > checksum stored in metadata block. This way we can know if metadata and > > > > data is consistent. > > > > > > > > > > OK .. though I'm not totally sold on the value of checksums. When a > > > checksum doesn't match, that means something. When a checksum does > > > match, it could just be a co-incidence. > > > I'd rather have a process that made checksums unnecessary, and only use > > > the checksums as a double-check. > > > > We could do something like: write metadata/data, wait, write another > > metadata. the second metadata indicates the first is in disk. But this > > can impact performance very much. > > The performance consideration is why I suggested a double-buffered > approach. Write metadata1, data1, metadata2, data2, then don't write > metdata3 until metdata1 and data1 has been written. > I haven't actually tried that so I don't know for certain it would help. Not sure if double buffer works, but you can't write metadata1 till data1 hits to disk, which has big penality. The only possible way is to origanize data/metadata as a big transaction so wait doesn't hurt too much like jbd does. > > I think checksum should be fine. It > > might be just a coninsidence, but the rate should extremely low. jbd2 is > > using checksum too now. > > Maybe I'll have a look at jbd2 - do you know what sort of checksum it > uses? I'd be surprised it didn't use something quite a bit stronger > than crc32 for a task like this. It uses crc32. 32bits checksum for every 4k as far as I check. Thanks, Shaohua -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html