On Thu, 11 Sep 2014 12:12:01 -0500 Brassow Jonathan <jbrassow@xxxxxxxxxx> wrote: > > On Sep 10, 2014, at 10:45 PM, Brassow Jonathan wrote: > > > > > On Sep 10, 2014, at 1:20 AM, NeilBrown wrote: > > > >> > >> Jon: could you test with these patches on top of what you > >> have just in case something happens to fix the problem without > >> me realising it? > > > > I'm on it. The test is running. I'll know later tomorrow. > > > > brassow > > The test is still failing from here. I grabbed 3.17.0-rc4, added the 5 patches, and got the attached backtraces when testing. As I said, the hangs are not exactly the same. This set shows the mdX_raid1 thread in the middle of handling a read failure. Thanks. mdX_raid1 is blocked in freeze_array. That could be caused by conf->nr_pending nor aligning properly with conf->nr_queued. Both normal IO and resync IO can be retried with reschedule_retry() and so be counted into ->nr_queued, but only normal IO gets counted in ->nr_pending. Previously could could only possibly have on or the other and when handling a read failure it could only be normal IO. But now that they two types can interleave, we can have both normal and resync IO requests queued, so we need to count them both in nr_pending. So the following patch might help. How complicated are your test scripts? Could you send them to me so I can try too? Thanks, NeilBrown diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c index 888dbdfb6986..6a9c73435eb8 100644 --- a/drivers/md/raid1.c +++ b/drivers/md/raid1.c @@ -856,6 +856,7 @@ static void raise_barrier(struct r1conf *conf, sector_t sector_nr) conf->next_resync + RESYNC_SECTORS), conf->resync_lock); + conf->nr_pending++; spin_unlock_irq(&conf->resync_lock); } @@ -865,6 +866,7 @@ static void lower_barrier(struct r1conf *conf) BUG_ON(conf->barrier <= 0); spin_lock_irqsave(&conf->resync_lock, flags); conf->barrier--; + conf->nr_pending--; spin_unlock_irqrestore(&conf->resync_lock, flags); wake_up(&conf->wait_barrier); }
Attachment:
signature.asc
Description: PGP signature