Re: BUG - raid 1 deadlock on handle_read_error / wait_barrier

Alexander Lyakas <alex.bolshoy@xxxxxxxxx> · Tue, 4 Jun 2013 12:52:02 +0300

Hello Neil,
thanks for your response.

On Tue, Jun 4, 2013 at 4:49 AM, NeilBrown <neilb@xxxxxxx> wrote:
> On Sun, 2 Jun 2013 15:43:41 +0300 Alexander Lyakas <alex.bolshoy@xxxxxxxxx>
> wrote:
>
>> Hello Neil,
>> I believe I have found what is causing the deadlock. It happens in two flavors:
>>
>> 1)
>> # raid1d() is called, and conf->pending_bio_list is non-empty at this point
>> # raid1d() calls md_check_recovery(), which eventually calls
>> raid1_add_disk(), which calls raise_barrier()
>> # Now raise_barrier will wait for conf->nr_pending to become 0, but it
>> cannot become 0, because there are bios sitting in
>> conf->pending_bio_list, which nobody will flush, because raid1d is the
>> one supposed to call flush_pending_writes(), either directly or via
>> handle_read_error. But it is stuck in raise_barrier.
>>
>> 2)
>> # raid1_add_disk() calls raise_barrier(), and waits for
>> conf->nr_pending to become 0, as before
>> # new WRITE comes and calls wait_barrier(), but this thread has a
>> non-empty current->bio_list
>> # In this case, the code allows the WRITE to go through
>> wait_barrier(), and trigger WRITEs to mirror legs, but these WRITEs
>> again end up in conf->pending_bio_list (either via raid1_unplug or
>> directly). But nobody will flush conf->pending_bio_list, because
>> raid1d is stuck in raise_barrier.
>>
>> Previously, for example in kernel 3.2, raid1_add_disk did not call
>> raise_barrier, so this problem did not happen.
>>
>> Attached is a reproduction with some prints that I added to
>> raise_barrier and wait_barrier (their code also attached). It
>> demonstrates case 2. It shows that once raise_barrier got called,
>> conf->nr_pending drops down, until it equals the number of
>> wait_barrier calls, that slipped through because of non-empty
>> current->bio_list. And at this point, this array hangs.
>>
>> Can you please comment on how to fix this problem. It looks like a
>> real deadlock.
>> We can perhaps call md_check_recovery() after flush_pending_writes(),
>> but this only makes the window smaller, not closes it entirely. But it
>> looks like we really should not be calling raise_barrier from raid1d.
>>
>> Thanks,
>> Alex.
>
> Hi Alex,
>  thanks for the analysis.
>
> Does the following patch fix it?  It makes raise_barrier  more  like
> freeze_array().
> If not, could you try making the same change to the first
> wait_event_lock_irq in raise_barrier?
>
> Thanks.
> NeilBrown
>
>
>
> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
> index 328fa2d..d34f892 100644
> --- a/drivers/md/raid1.c
> +++ b/drivers/md/raid1.c
> @@ -828,9 +828,9 @@ static void raise_barrier(struct r1conf *conf)
>         conf->barrier++;
>
>         /* Now wait for all pending IO to complete */
> -       wait_event_lock_irq(conf->wait_barrier,
> -                           !conf->nr_pending && conf->barrier < RESYNC_DEPTH,
> -                           conf->resync_lock);
> +       wait_event_lock_irq_cmd(conf->wait_barrier,
> +                               !conf->nr_pending && conf->barrier < RESYNC_DEPTH,
> +                               conf->resync_lock, flush_pending_writes(conf));
>
>         spin_unlock_irq(&conf->resync_lock);
>  }

Yes, this patch fixes the problem, thanks! Actually, yesterday, I
attempted a similar fix[1] and it also seemed to work.
I have several comments about this patch:

# It fully fixes case 1, and almost fully closes the race window for
case 2. I attach a reproduction, in which it can be seen, that while
raise_barrier is waiting, new bios slip through wait_barrier (because
of non-empty current->bio_list), and raise_barrier is awaken more than
once to flush them. Eventually it takes 2 seconds for raise_barrier to
complete. This is still much better than sleeping forever though:)

# We are now calling flush_pending_writes() while mddev_lock() is
locked. It doesn't seem problematic to call generic_make_request()
under this lock, but flush_pending_writes() also does bitmap_unplug(),
which may wait for superblock update etc. Is this ok? I found one or
two other places, where we wait for superblock update under mddev_lock
(ADD_NEW_DISK, for example), so it's probably ok?

# I am concerned that raise_barrier is also called from sync_request,
so it may also attempt to flush_pending_writes. Can we make a more
conservative patch, like this:
        /* Now wait for all pending IO to complete */
-       wait_event_lock_irq(conf->wait_barrier,
-                           !conf->nr_pending && conf->barrier < RESYNC_DEPTH,
-                           conf->resync_lock);
+       if (conf->mddev->thread && conf->mddev->thread->tsk == current) {
+               /*
+                * If we are called from the management thread (raid1d), we
+                * need to flush the bios that might be sitting in
conf->pending_bio_list,
+                * otherwise, we will wait for them here forever
+                */
+               wait_event_lock_irq_cmd(conf->wait_barrier,
+                                       !conf->nr_pending &&
conf->barrier < RESYNC_DEPTH,
+                                       conf->resync_lock,
flush_pending_writes(conf));
+       } else {
+               wait_event_lock_irq(conf->wait_barrier,
+                                   !conf->nr_pending && conf->barrier
< RESYNC_DEPTH,
+                                   conf->resync_lock);
+       }
+

        spin_unlock_irq(&conf->resync_lock);

Thanks again. I will work & reply on other issues I mentioned, for
some of them I already made patches.
Alex.

[1]
 	/* block any new IO from starting */
 	conf->barrier++;

+	/* if we are raising the barrier while inside raid1d (which we
really shouldn't)... */
+	if (conf->mddev->thread && conf->mddev->thread->tsk == current) {
+		while (!(!conf->nr_pending && conf->barrier < RESYNC_DEPTH)) {
+			int nr_pending = conf->nr_pending;
+
+			spin_unlock_irq(&conf->resync_lock);
+
+			if (nr_pending)
+				flush_pending_writes(conf);
+			wait_event_timeout(conf->wait_barrier,
+				!conf->nr_pending && conf->barrier < RESYNC_DEPTH,
+				msecs_to_jiffies(100));
+
+			spin_lock_irq(&conf->resync_lock);
+		}
+
+		spin_unlock_irq(&conf->resync_lock);
+
+		return;
+	}
 	/* Now wait for all pending IO to complete */
 	wait_event_lock_irq(conf->wait_barrier,
 			    !conf->nr_pending && conf->barrier < RESYNC_DEPTH,
Attachment:
repro.tgz

Description: GNU Zip compressed data