Hello Neil, thanks for your response. On Tue, Jun 4, 2013 at 4:49 AM, NeilBrown <neilb@xxxxxxx> wrote: > On Sun, 2 Jun 2013 15:43:41 +0300 Alexander Lyakas <alex.bolshoy@xxxxxxxxx> > wrote: > >> Hello Neil, >> I believe I have found what is causing the deadlock. It happens in two flavors: >> >> 1) >> # raid1d() is called, and conf->pending_bio_list is non-empty at this point >> # raid1d() calls md_check_recovery(), which eventually calls >> raid1_add_disk(), which calls raise_barrier() >> # Now raise_barrier will wait for conf->nr_pending to become 0, but it >> cannot become 0, because there are bios sitting in >> conf->pending_bio_list, which nobody will flush, because raid1d is the >> one supposed to call flush_pending_writes(), either directly or via >> handle_read_error. But it is stuck in raise_barrier. >> >> 2) >> # raid1_add_disk() calls raise_barrier(), and waits for >> conf->nr_pending to become 0, as before >> # new WRITE comes and calls wait_barrier(), but this thread has a >> non-empty current->bio_list >> # In this case, the code allows the WRITE to go through >> wait_barrier(), and trigger WRITEs to mirror legs, but these WRITEs >> again end up in conf->pending_bio_list (either via raid1_unplug or >> directly). But nobody will flush conf->pending_bio_list, because >> raid1d is stuck in raise_barrier. >> >> Previously, for example in kernel 3.2, raid1_add_disk did not call >> raise_barrier, so this problem did not happen. >> >> Attached is a reproduction with some prints that I added to >> raise_barrier and wait_barrier (their code also attached). It >> demonstrates case 2. It shows that once raise_barrier got called, >> conf->nr_pending drops down, until it equals the number of >> wait_barrier calls, that slipped through because of non-empty >> current->bio_list. And at this point, this array hangs. >> >> Can you please comment on how to fix this problem. It looks like a >> real deadlock. >> We can perhaps call md_check_recovery() after flush_pending_writes(), >> but this only makes the window smaller, not closes it entirely. But it >> looks like we really should not be calling raise_barrier from raid1d. >> >> Thanks, >> Alex. > > Hi Alex, > thanks for the analysis. > > Does the following patch fix it? It makes raise_barrier more like > freeze_array(). > If not, could you try making the same change to the first > wait_event_lock_irq in raise_barrier? > > Thanks. > NeilBrown > > > > diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c > index 328fa2d..d34f892 100644 > --- a/drivers/md/raid1.c > +++ b/drivers/md/raid1.c > @@ -828,9 +828,9 @@ static void raise_barrier(struct r1conf *conf) > conf->barrier++; > > /* Now wait for all pending IO to complete */ > - wait_event_lock_irq(conf->wait_barrier, > - !conf->nr_pending && conf->barrier < RESYNC_DEPTH, > - conf->resync_lock); > + wait_event_lock_irq_cmd(conf->wait_barrier, > + !conf->nr_pending && conf->barrier < RESYNC_DEPTH, > + conf->resync_lock, flush_pending_writes(conf)); > > spin_unlock_irq(&conf->resync_lock); > } Yes, this patch fixes the problem, thanks! Actually, yesterday, I attempted a similar fix[1] and it also seemed to work. I have several comments about this patch: # It fully fixes case 1, and almost fully closes the race window for case 2. I attach a reproduction, in which it can be seen, that while raise_barrier is waiting, new bios slip through wait_barrier (because of non-empty current->bio_list), and raise_barrier is awaken more than once to flush them. Eventually it takes 2 seconds for raise_barrier to complete. This is still much better than sleeping forever though:) # We are now calling flush_pending_writes() while mddev_lock() is locked. It doesn't seem problematic to call generic_make_request() under this lock, but flush_pending_writes() also does bitmap_unplug(), which may wait for superblock update etc. Is this ok? I found one or two other places, where we wait for superblock update under mddev_lock (ADD_NEW_DISK, for example), so it's probably ok? # I am concerned that raise_barrier is also called from sync_request, so it may also attempt to flush_pending_writes. Can we make a more conservative patch, like this: /* Now wait for all pending IO to complete */ - wait_event_lock_irq(conf->wait_barrier, - !conf->nr_pending && conf->barrier < RESYNC_DEPTH, - conf->resync_lock); + if (conf->mddev->thread && conf->mddev->thread->tsk == current) { + /* + * If we are called from the management thread (raid1d), we + * need to flush the bios that might be sitting in conf->pending_bio_list, + * otherwise, we will wait for them here forever + */ + wait_event_lock_irq_cmd(conf->wait_barrier, + !conf->nr_pending && conf->barrier < RESYNC_DEPTH, + conf->resync_lock, flush_pending_writes(conf)); + } else { + wait_event_lock_irq(conf->wait_barrier, + !conf->nr_pending && conf->barrier < RESYNC_DEPTH, + conf->resync_lock); + } + spin_unlock_irq(&conf->resync_lock); Thanks again. I will work & reply on other issues I mentioned, for some of them I already made patches. Alex. [1] /* block any new IO from starting */ conf->barrier++; + /* if we are raising the barrier while inside raid1d (which we really shouldn't)... */ + if (conf->mddev->thread && conf->mddev->thread->tsk == current) { + while (!(!conf->nr_pending && conf->barrier < RESYNC_DEPTH)) { + int nr_pending = conf->nr_pending; + + spin_unlock_irq(&conf->resync_lock); + + if (nr_pending) + flush_pending_writes(conf); + wait_event_timeout(conf->wait_barrier, + !conf->nr_pending && conf->barrier < RESYNC_DEPTH, + msecs_to_jiffies(100)); + + spin_lock_irq(&conf->resync_lock); + } + + spin_unlock_irq(&conf->resync_lock); + + return; + } /* Now wait for all pending IO to complete */ wait_event_lock_irq(conf->wait_barrier, !conf->nr_pending && conf->barrier < RESYNC_DEPTH,
Attachment:
repro.tgz
Description: GNU Zip compressed data