On Tue, 7 Jan 2014 17:01:23 +0200 Alexander Lyakas <alex.bolshoy@xxxxxxxxx> wrote: > Hi Neil, > Thank you for your comments. Yes, apparently in this case md/raid1 was > not the cause. I studied the code more by adding prints and following > the resync flow, but cannot see any obvious problem. > > I did find another case, in which raid1-resync can read phantom data, > although this is not what happened to us: > # raid1 has 3 disks A,B,C and is resyncing after an unclean shutdown. > sync_request always selects disk=A as read_disk. > # application reads from far sector (beyond next_resync), so > read_balance() selects disk=A to read from (it is the first one) > # disk A fails > # resync aborts and restarts, now sync_request reads from B and syncs into C > # application reads again from the same far sector, now read_balance() > selects disk B to read from > > So potentially we could get a different data from these two reads. In > our case, though, there were no disk failures. > > FWIW, the raid1 code I was once responsible for, treated this > situation as follows: > # READ comes from application > # raid1 sees that it is resyncing, so it locks the relevant area of > the raid1 and syncs it. Then it unlocks and proceeds to serve the READ > normally > # resync thread comes to appropriate area, locks it and sees that it > has already been synced (bits are off in the bitmap), so it proceeds > further > > However in md/raid1, there is no mechanism currently that can lock a > part of the raid. We only have raise_barrier/wait_barrier that > effectively locks the whole capacity. > > Is it, for example, reasonable to READ the data as you normally do, > then to trigger a WRITE with the same data and only then to complete > the original READ? There are a lot of inefficiencies here, I know, > like re-writing the same data again on read_disk, and syncing this > data again later. (I know, patches are welcome...) Hmmm.. yes that could conceivably cause a problem. It would apply to RAID6 too. To "fix" it we would have to either read-and-check the replicas or parity whenever we read from a block that is not "in-sync", and/or write them out. This could be rather expensive for fairly little gain. If we were doing a bitmap-based resync, then we could maybe expedite the resync of any region before reading from it. i.e. before reading from an block which is not known to be in-sync, we wait for it to be in-sync, but also signal the resync process to do this 'bit' worth next. That could be rather messy... but might not be too bad. Rather than using the resync thread to handle the extra bits, maybe we could have a work-queue which just handled specifically requested regions... Patches certainly welcome :-) NeilBrown
Attachment:
signature.asc
Description: PGP signature