Re: v3.15 dm-mpath regression: cable pull test causes I/O hang

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 07/03/2014 04:05 PM, Mike Snitzer wrote:
On Thu, Jul 03 2014 at  9:56am -0400,
Bart Van Assche <bvanassche@xxxxxxx> wrote:

On 07/03/14 00:02, Mike Snitzer wrote:
On Fri, Jun 27 2014 at  9:33am -0400,
Mike Snitzer <snitzer@xxxxxxxxxx> wrote:

On Fri, Jun 27 2014 at  9:02am -0400,
Bart Van Assche <bvanassche@xxxxxxx> wrote:

Hello,

While running a cable pull simulation test with dm_multipath on top of
the SRP initiator driver I noticed that after a few iterations I/O locks
up instead of dm_multipath processing the path failure properly (see also
below for a call trace). At least kernel versions 3.15 and 3.16-rc2 are
vulnerable. This issue does not occur with kernel 3.14. I have tried to
bisect this but gave up when I noticed that I/O locked up completely with
a kernel built from git commit ID e809917735ebf1b9a56c24e877ce0d320baee2ec
(dm mpath: push back requests instead of queueing). But with the bisect I
have been able to narrow down this issue to one of the patches in "Merge
tag 'dm-3.15-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/
device-mapper/linux-dm". Does anyone have a suggestion how to analyze this
further or how to fix this ?

I still don't have a _known_ fix for your issue but I reviewed commit
e809917735ebf1b9a56c24e877ce0d320baee2ec closer and identified what
looks to be a regression in logic for multipath_busy, it now calls
!pg_ready() instead of directly checking pg_init_in_progress.  I think
this is needed (Hannes, what do you think?):

diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index 3f6fd9d..561ead6 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -373,7 +373,7 @@ static int __must_push_back(struct multipath *m)
  		 dm_noflush_suspending(m->ti)));
  }

-#define pg_ready(m) (!(m)->queue_io && !(m)->pg_init_required)
+#define pg_ready(m) (!(m)->queue_io && !(m)->pg_init_required && !(m)->pg_init_in_progress)

  /*
   * Map cloned requests

Hello Mike,

Sorry but even with this patch applied and additionally with commit IDs
86d56134f1b6 ("kobject: Make support for uevent_helper optional") and
bcccff93af35 ("kobject: don't block for each kobject_uevent") reverted
my multipath test still hangs after a few iterations. I also reran the
same test with kernel 3.14.3 and it is still running after 30 iterations.

OK, thanks for testing though!  I still think the patch is needed.

You are using queue_if_no_path, do you see hangs due to paths not being
restored after the "cable" is restored?  Any errors in the multipathd
userspace logging?  Or abnormal errors in kernel?  Basically I'm looking
for some other clue besides the hung task timeout spew.

How easy would it be to replicate your testbed?  Is it uniquely FIO hw
dependent?  How are you simulating the cable pull tests?

I'd love to setup a testbed that would enable me to chase this more
interactively rather than punting to you for testing.

Hannes, do you have a testbed for heavy cable pull testing?  Are you
able to replicate these hangs?

Yes, I do. But sadly I've been tied up with polishing up SLES12 (release deadline is looming nearer) and for some inexplicable reason management seems to find releasing a product more important than working on mainline issue ... But I hope to find some time soonish (ie start of next week) to work on this; it's the very next thing on my to-do list.

Cheers,

Hannes
--
Dr. Hannes Reinecke		      zSeries & Storage
hare@xxxxxxx			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: J. Hawn, J. Guild, F. Imendörffer, HRB 16746 (AG Nürnberg)

--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel





[Index of Archives]     [DM Crypt]     [Fedora Desktop]     [ATA RAID]     [Fedora Marketing]     [Fedora Packaging]     [Fedora SELinux]     [Yosemite Discussion]     [KDE Users]     [Fedora Docs]

  Powered by Linux