On Fri, Aug 17 2018, Danil Kipnis wrote: >>>> > On 08/11/2018 02:06 AM, NeilBrown wrote: >>>> >> It might be expected behaviour with async direct IO. >>>> >> Two threads writing with O_DIRECT io to the same address could result in >>>> >> different data on the two devices. This doesn't seem to me to be a >>>> >> credible use-case though. Why would you ever want to do that in >>>> >> practice? >>>> >> >>>> >> NeilBrown >>>> > >>>> > My only thought is while the credible case may be weak, if it is something >>>> > that can be protected against with a few conditionals to prevent the different >>>> > data on the slaves diverging -- then it's worth a couple of conditions to >>>> > prevent the nut that know just enough about dd from confusing things.... >>>> >>>> Yes, it can be protected against - the code is already written. >>>> If you have a 2-drive raid1 and want it to be safe against this attack, >>>> simply: >>>> >>>> mdadm /dev/md127 --grow --level=raid5 >>>> >>>> This will add the required synchronization between writes so that >>>> multiple writes to the one block are linearized. There will be a >>>> performance impact. >>>> >>>> NeilBrown >>> Thanks for your comments, Neil. >>> Convert to raid5 with 2 drives will not only cause perrormance drop, >>> will also disable the redundancy. >>> It's clearly a no go. >> >>I don't understand why you think it would disable the redundancy, there >>are still two copies of every block. Both RAID1 and RAID5 can survive a >>single device failure. >> >>I agree about performance and don't expect this would be a useful thing >>to do; it just seemed the simplest way to explain the cost that would be >>involved in resisting this attack. >> >>NeilBrown > > Hi Neil, > > the performance imact one is facing when running raid5 on top of two legs - > is it only due to the tracking of the inflight writes or is the raid5 actually > doing some XORing (with zeros?) in that case? And if the cpu is burned also > for some other reason apart of tracking, do you think it would make sence > to expose that "writes-to-the-same-sector-tracking" functionality also for > raid1 personality? There would be some performance impact due to extra work for the CPU, but I doubt it would be much - CPUs are fast these days. With RAID1, the data isn't copied. With RAID5 it is - it is copied twice, once from the fs or user-space buffer into the stripe-cache, and once from 'data' to 'parity' slots in the stripe cache. This copying would cause some of the slowdown - memory isn't as fast as CPUs. Also all requests are divided into multiple 4K requests as they pass through the RAID5 stripe-cache. They should get recombined, but this is quite a bit of work and often results in more smaller requests, which doesn't make such good use of the underlying device. Finally there is the synchronization overhead of taking locks to make sure that requests don't overlap. With lots of CPUs and v.fast devices, this can be quite significant. Would it make sense to track writes to the same sector in RAID1? Given the justification provided so far, I would say "no". The only use-case that has been demonstrated is performance testing, and performance testing isn't hurt by data being different on different devices. I really think that any genuine use case that cared about data consistency would never write data in a way that could result in these inconsistencies. However ... if you have a RAID1 configured with write-behind and a write-mostly device, then it is possible that a similar problem could arise. If the app or fs writes to the same block twice in quick succession - waiting for the first write to complete before issuing the second write - then the data on non-write-mostly devices will reflect the second write as expected. However the two writes to the write-mostly device could be re-ordered and you could end up with the data from the first write remaining on the write-mostly device. This is rather unlikely, but is theoretically possible. As writes to write-mostly devices are expected to be slow and already involve a data copy, adding some synchronization probably wouldn't hurt much. So it might make sense to add some sort of filter to delay overlapping writes to the write-mostly devices. Once that were done, it might not be completely pointless to enable the same filter for all writes if someone really wanted to slow down their raid1 for very little real gain. NeilBrown
Attachment:
signature.asc
Description: PGP signature