On Fri, 10 Jul 2015 10:48:45 -0700 Shaohua Li <shli@xxxxxx> wrote: > On Fri, Jul 10, 2015 at 04:42:09PM +1000, NeilBrown wrote: > > On Thu, 9 Jul 2015 22:18:15 -0700 Shaohua Li <shli@xxxxxx> wrote: > > > > > On Fri, Jul 10, 2015 at 03:10:44PM +1000, NeilBrown wrote: > > > > On Thu, 9 Jul 2015 21:52:43 -0700 Shaohua Li <shli@xxxxxx> wrote: > > > > > > > > > On Fri, Jul 10, 2015 at 02:36:56PM +1000, NeilBrown wrote: > > > > > > On Thu, 9 Jul 2015 21:08:49 -0700 Shaohua Li <shli@xxxxxx> wrote: > > > > > > > > > > > > > > > > There is also the issue of what action commits a previous transaction. > > > > > > I'm not sure what you had. I'm suggesting that each metadata block > > > > > > commits previous transactions. Is that a close-enough match to what > > > > > > you had? > > > > > > > > > > What did you mean about a transaction? In my implementation, metadata > > > > > block and followed stripe data/parity consist of an io unit. io units can > > > > > be finished out of order. but if io unit has flush request (the data has > > > > > flush/flush bio or metadata is a flush block), the io unit can only > > > > > start after all previous io units and disk cache flush finish. Such io > > > > > unit is strictly ordered. The log patch describes this behavior. Does it > > > > > match? > > > > > > > > Yes, a "transaction" is an "io unit". The flushing is the same. > > > > I just couldn't remember how, when reading the log on restart, you > > > > determined if a given "io unit" was reliably consistent, or whether it > > > > should be ignored (having possibly only partially been written). > > > > > > The metadata block has a checksum for data of the block. data/parity has > > > checksum stored in metadata block. This way we can know if metadata and > > > data is consistent. > > > > > > > OK .. though I'm not totally sold on the value of checksums. When a > > checksum doesn't match, that means something. When a checksum does > > match, it could just be a co-incidence. > > I'd rather have a process that made checksums unnecessary, and only use > > the checksums as a double-check. > > We could do something like: write metadata/data, wait, write another > metadata. the second metadata indicates the first is in disk. But this > can impact performance very much. The performance consideration is why I suggested a double-buffered approach. Write metadata1, data1, metadata2, data2, then don't write metdata3 until metdata1 and data1 has been written. I haven't actually tried that so I don't know for certain it would help. > I think checksum should be fine. It > might be just a coninsidence, but the rate should extremely low. jbd2 is > using checksum too now. Maybe I'll have a look at jbd2 - do you know what sort of checksum it uses? I'd be surprised it didn't use something quite a bit stronger than crc32 for a task like this. NeilBrown > > Thanks, > Shaohua > -- > To unsubscribe from this list: send the line "unsubscribe linux-raid" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html