Re: v3.15 dm-mpath regression: cable pull test causes I/O hang

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Thu, Jul 03 2014 at  9:56am -0400,
Bart Van Assche <bvanassche@xxxxxxx> wrote:

> On 07/03/14 00:02, Mike Snitzer wrote:
> > On Fri, Jun 27 2014 at  9:33am -0400,
> > Mike Snitzer <snitzer@xxxxxxxxxx> wrote:
> > 
> >> On Fri, Jun 27 2014 at  9:02am -0400,
> >> Bart Van Assche <bvanassche@xxxxxxx> wrote:
> >>
> >>> Hello,
> >>>
> >>> While running a cable pull simulation test with dm_multipath on top of
> >>> the SRP initiator driver I noticed that after a few iterations I/O locks
> >>> up instead of dm_multipath processing the path failure properly (see also
> >>> below for a call trace). At least kernel versions 3.15 and 3.16-rc2 are
> >>> vulnerable. This issue does not occur with kernel 3.14. I have tried to
> >>> bisect this but gave up when I noticed that I/O locked up completely with
> >>> a kernel built from git commit ID e809917735ebf1b9a56c24e877ce0d320baee2ec
> >>> (dm mpath: push back requests instead of queueing). But with the bisect I
> >>> have been able to narrow down this issue to one of the patches in "Merge
> >>> tag 'dm-3.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/
> >>> device-mapper/linux-dm". Does anyone have a suggestion how to analyze this
> >>> further or how to fix this ?
> > 
> > I still don't have a _known_ fix for your issue but I reviewed commit
> > e809917735ebf1b9a56c24e877ce0d320baee2ec closer and identified what
> > looks to be a regression in logic for multipath_busy, it now calls
> > !pg_ready() instead of directly checking pg_init_in_progress.  I think
> > this is needed (Hannes, what do you think?):
> > 
> > diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
> > index 3f6fd9d..561ead6 100644
> > --- a/drivers/md/dm-mpath.c
> > +++ b/drivers/md/dm-mpath.c
> > @@ -373,7 +373,7 @@ static int __must_push_back(struct multipath *m)
> >  		 dm_noflush_suspending(m->ti)));
> >  }
> >  
> > -#define pg_ready(m) (!(m)->queue_io && !(m)->pg_init_required)
> > +#define pg_ready(m) (!(m)->queue_io && !(m)->pg_init_required && !(m)->pg_init_in_progress)
> >  
> >  /*
> >   * Map cloned requests
> 
> Hello Mike,
> 
> Sorry but even with this patch applied and additionally with commit IDs
> 86d56134f1b6 ("kobject: Make support for uevent_helper optional") and
> bcccff93af35 ("kobject: don't block for each kobject_uevent") reverted
> my multipath test still hangs after a few iterations. I also reran the
> same test with kernel 3.14.3 and it is still running after 30 iterations.

OK, thanks for testing though!  I still think the patch is needed.

You are using queue_if_no_path, do you see hangs due to paths not being
restored after the "cable" is restored?  Any errors in the multipathd
userspace logging?  Or abnormal errors in kernel?  Basically I'm looking
for some other clue besides the hung task timeout spew.

How easy would it be to replicate your testbed?  Is it uniquely FIO hw
dependent?  How are you simulating the cable pull tests?

I'd love to setup a testbed that would enable me to chase this more
interactively rather than punting to you for testing.

Hannes, do you have a testbed for heavy cable pull testing?  Are you
able to replicate these hangs?

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel




[Index of Archives]     [DM Crypt]     [Fedora Desktop]     [ATA RAID]     [Fedora Marketing]     [Fedora Packaging]     [Fedora SELinux]     [Yosemite Discussion]     [KDE Users]     [Fedora Docs]

  Powered by Linux