On 8/20/18, NeilBrown <neilb@xxxxxxxx> wrote: > On Fri, Aug 17 2018, Danil Kipnis wrote: > >>>>> > On 08/11/2018 02:06 AM, NeilBrown wrote: >>>>> >> It might be expected behaviour with async direct IO. >>>>> >> Two threads writing with O_DIRECT io to the same address could >>>>> >> result in >>>>> >> different data on the two devices. This doesn't seem to me to be a >>>>> >> credible use-case though. Why would you ever want to do that in >>>>> >> practice? >>>>> >> >>>>> >> NeilBrown >>>>> > >>>>> > My only thought is while the credible case may be weak, if it is >>>>> > something >>>>> > that can be protected against with a few conditionals to prevent the >>>>> > different >>>>> > data on the slaves diverging -- then it's worth a couple of >>>>> > conditions to >>>>> > prevent the nut that know just enough about dd from confusing >>>>> > things.... >>>>> >>>>> Yes, it can be protected against - the code is already written. >>>>> If you have a 2-drive raid1 and want it to be safe against this >>>>> attack, >>>>> simply: >>>>> >>>>> mdadm /dev/md127 --grow --level=raid5 >>>>> >>>>> This will add the required synchronization between writes so that >>>>> multiple writes to the one block are linearized. There will be a >>>>> performance impact. >>>>> >>>>> NeilBrown >>>> Thanks for your comments, Neil. >>>> Convert to raid5 with 2 drives will not only cause perrormance drop, >>>> will also disable the redundancy. >>>> It's clearly a no go. >>> >>>I don't understand why you think it would disable the redundancy, there >>>are still two copies of every block. Both RAID1 and RAID5 can survive a >>>single device failure. >>> >>>I agree about performance and don't expect this would be a useful thing >>>to do; it just seemed the simplest way to explain the cost that would be >>>involved in resisting this attack. >>> >>>NeilBrown >> >> Hi Neil, >> >> the performance imact one is facing when running raid5 on top of two legs >> - >> is it only due to the tracking of the inflight writes or is the raid5 >> actually >> doing some XORing (with zeros?) in that case? And if the cpu is burned >> also >> for some other reason apart of tracking, do you think it would make sence >> to expose that "writes-to-the-same-sector-tracking" functionality also >> for >> raid1 personality? > > There would be some performance impact due to extra work for the CPU, > but I doubt it would be much - CPUs are fast these days. > > With RAID1, the data isn't copied. With RAID5 it is - it is copied > twice, once from the fs or user-space buffer into the stripe-cache, and > once from 'data' to 'parity' slots in the stripe cache. > This copying would cause some of the slowdown - memory isn't as fast as > CPUs. > > Also all requests are divided into multiple 4K requests as they pass > through the RAID5 stripe-cache. They should get recombined, but this is > quite a bit of work and often results in more smaller requests, which > doesn't make such good use of the underlying device. > > Finally there is the synchronization overhead of taking locks to make > sure that requests don't overlap. With lots of CPUs and v.fast devices, > this can be quite significant. > > Would it make sense to track writes to the same sector in RAID1? > Given the justification provided so far, I would say "no". > The only use-case that has been demonstrated is performance testing, and > performance testing isn't hurt by data being different on different > devices. > I really think that any genuine use case that cared about data > consistency would never write data in a way that could result in these > inconsistencies. > > However ... if you have a RAID1 configured with write-behind and a > write-mostly device, then it is possible that a similar problem could > arise. > > If the app or fs writes to the same block twice in quick > succession - waiting for the first write to complete before issuing the > second write - then the data on non-write-mostly devices will reflect > the second write as expected. > However the two writes to the write-mostly device could be re-ordered > and you could end up with the data from the first write remaining on the > write-mostly device. This is rather unlikely, but is theoretically > possible. > As writes to write-mostly devices are expected to be slow and already > involve a data copy, adding some synchronization probably wouldn't hurt > much. > > So it might make sense to add some sort of filter to delay overlapping > writes to the write-mostly devices. Once that were done, it might not > be completely pointless to enable the same filter for all writes if > someone really wanted to slow down their raid1 for very little real > gain. > > NeilBrown > > Hi Neil, Thanks a lot for the detailed explanation! Danil Kipnis