Re: [md PATCH 4/4] md/bitmap: Don't write bitmap while earlier writes might be in-flight

Shaohua Li <shli@xxxxxxxxxx> · Mon, 7 Nov 2016 14:57:24 -0800

On Tue, Nov 08, 2016 at 07:19:05AM +1100, Neil Brown wrote:
> On Tue, Nov 08 2016, Shaohua Li wrote:
> 
> > On Mon, Nov 07, 2016 at 09:53:42AM +1100, Neil Brown wrote:
> >> On Sat, Nov 05 2016, Shaohua Li wrote:
> >> 
> >> > On Fri, Nov 04, 2016 at 04:46:03PM +1100, Neil Brown wrote:
> >> >> As we don't wait for writes to complete in bitmap_daemon_work, they
> >> >> could still be in-flight when bitmap_unplug writes again.  Or when
> >> >> bitmap_daemon_work tries to write again.
> >> >> This can be confusing and could risk the wrong data being written last.
> >> >
> >> > Applied the first 3 patches, thanks!
> >> >
> >> > This one seems not completely solving the race condition. It's still possible
> >> > bitmap_daemon_work clears BITMAP_PAGE_NEEDWRITE but hasn't dispatch the IO yet,
> >> > bitmap_unplug then does nothing and thinks bitmap is updated to disk. Why don't
> >> > we add locking here?
> >> 
> >> Thanks for the review!
> >> 
> >> BITMAP_PAGE_NEEDWRITE is set for pages that need to be written out in
> >> order to clear bits that don't need to be set any more.  There is never
> >> any urgency to do this.
> >> BITMAP_PAGE_DIRTY is set of pages that need to be written out in order
> >> to set bits representing regions that are about to be written to.  These
> >> have to be flushed by bitmap_unplug().
> >> Pages can have both bits set, in which case bitmap_daemon_work() will
> >> leave them for bitmap_unplug() to deal with.
> >> 
> >> So if bitmap_daemon_work() clears BITMAP_PAGE_PENDING on a page, then it
> >> is a page that bitmap_unplug() doesn't need to wait for.
> >
> > Oops, I misread the code. Yes, this is very clear, thanks for the explaination.
> >
> > So I can understand the patch fixes the confusion. when is there risk wrong
> > data is written?
> 
> The goal is to avoid the possibility that two writes to the same
> location are in flight at the same time.
> It isn't clear that this would cause a problem, but it isn't clear that
> it is completely safe either.
> In general, this doesn't happen - certainly not from the page cache - so
> drivers will not be prepared for it.
> As an extreme example, suppose that the target device is a multi-path
> device.
> One write is sent and just after it is DMAed to one path, the in-memory
> page is changed and a second write is submitted, and the data is DMAed
> down the other path.  Are we certain the two will be committed to
> storage in the intended order?
> Probably they will be.  But maybe with a more complex stack, the chance
> of one IO overtaking the other (even after the first DMA) increases.
> 
> So I cannot argue very strongly for this code, but it seems like a good
> idea and is reasonably simple....

I didn't try to argue about the patch, it makes things clear actually. Just
want to know where the data corruption comes from. I think I have an answer
now.

I'll add this patch.

Thanks,
Shaohua
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html