Re: raid1 boot regression in 2.6.37 [bisected]

Tejun Heo <tj@xxxxxxxxxx> · Tue, 29 Mar 2011 12:07:44 +0200

On Tue, Mar 29, 2011 at 11:53:06AM +0200, Thomas Jarosch wrote:
> On Tuesday, 29. March 2011 10:25:03 Tejun Heo wrote:
> > Can you please apply the following patch and see whether it resolves
> > the problem and report the boot log?
> 
> Ok, I did the following:
> - Check out commit e804ac780e2f01cb3b914daca2fd4780d1743db1
>   (md: fix and update workqueue usage)
> - Apply your patch
> - Add small debug output on top of it:
> 
> ------------------------------
> # git diff
> diff --git a/drivers/md/md.c b/drivers/md/md.c
> index 1e6534d..d2ddef4 100644
> --- a/drivers/md/md.c
> +++ b/drivers/md/md.c
> @@ -5899,6 +5899,16 @@ static int md_open(struct block_device *bdev, fmode_t mode)
>                                 once = true;
>                         }
>                 }
> +               /* DEBUG HACK */
> +               {
> +                   static bool tomj_once = false;
> +                   if (!tomj_once)
> +                   {
> +                               printk("TOMJ %s: md_open(): RT prio, pol=%u p=%d rt_p=%u\n",
> +                                      current->comm, current->policy, current->static_prio, current->rt_priority);
> +                       tomj_once = true;
> +                   }
> +               }
>                 msleep(10);
>                 /* Wait until bdev->bd_disk is definitely gone */
>                 flush_workqueue(md_misc_wq);
...
> TOMJ blkid: md_open(): RT prio, pol=0 p=118 rt_p=0
...
> As you can see, your printk() is not triggered(). I just
> copied your printk and made it print once unconditionally.
> 
> So probably the msleep(10); does the trick. Something
> seems very racy to me as other boxes with software RAID
> can boot the exact same kernel + dracut version just fine.
> 
> I'll put the box in a reboot loop over the lunch break.

Hmmm.. interesting, so no RT task there.  I don't know why the
softlockup is triggering then.  Ah, okay, none of CONFIG_PREEMPT and
CONFIG_PREEMPT_VOLUNTARY is set, right?

Anyways, the root cause here is that md_open() -ERESTARTSYS retrying
is busy looping without giving the put path a chance to run.  When it
was using flush_scheduled_work(), there were some unrelated work items
there so it ended up sleeping by accident giving the put path a chance
to run.  With the conversion, the flush domain is reduced and there's
nothing unrelated to wait for so it just busy loops.

Neil, we can put a short unconditional sleep there or somehow ensure
work item is queued before the restart loop engages.  What do you
think?

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html