Hey,
I have ~10 machines running multipath-tools-0.4.4 on RHEL ES 4.1 (latest
everything). Machines are mounting multipathed mounts on an EMC clariion
and a 3PAR SAN device, over the same fabric.
At some random point in time today, one of the machines lost one of its
four 3par mounts. All other mounts worked fine. This has happened once
or twice before as well, but we rebooted before I had time to inspect
the issue.
multipath -v3 -l showed this status on the bad path;
params = 1 queue_if_no_path 0 1 1 round-robin 0 2 1 8:64 1000 8:176 1000
status = 1 3 0 1 1 E 0 2 0 8:64 F 3574 8:176 F 3574
exports (350002ac0005b02a4)
[size=150 GB][features="1 queue_if_no_path"][hwhandler="0"]
\_ round-robin 0 [enabled][first]
\_ 5:0:0:3 sde 8:64 [ready ][failed]
\_ 6:0:1:3 sdl 8:176 [ready ][failed]
This was being spammed into /var/log/messages once every five seconds
(the multipathd polling interval):
Aug 10 15:35:43 cc42-86 multipathd: 8:64: tur checker reports path is up
Aug 10 15:35:43 cc42-86 multipathd: devmap event (8163) on exports
Aug 10 15:35:43 cc42-86 kernel: device-mapper: dm-multipath: Failing
path 8:176.
Aug 10 15:35:43 cc42-86 kernel: device-mapper: dm-multipath: Failing
path 8:64.
Aug 10 15:35:43 cc42-86 multipathd: 8:176: tur checker reports path is up
Aug 10 15:35:43 cc42-86 kernel: cdrom: open failed.
Aug 10 15:35:43 cc42-86 kernel: device-mapper: dm-multipath: Failing
path 8:176.
Aug 10 15:35:43 cc42-86 kernel: device-mapper: dm-multipath: Failing
path 8:64.
Aug 10 15:35:43 cc42-86 kernel: cdrom: open failed.
Aug 10 15:35:43 cc42-86 multipathd: open(/dev/hdc) failed
Aug 10 15:35:43 cc42-86 multipathd: mark 8:64 as failed
Aug 10 15:35:43 cc42-86 multipathd: mark 8:176 as failed
Aug 10 15:35:43 cc42-86 multipathd: devmap event (8164) on exports
Aug 10 15:35:43 cc42-86 kernel: cdrom: open failed.
Aug 10 15:35:43 cc42-86 multipathd: open(/dev/hdc) failed
Aug 10 15:35:43 cc42-86 kernel: cdrom: open failed.
Aug 10 15:35:43 cc42-86 multipathd: open(/dev/hdc) failed
tur sees it up, kernel says it's down, ad infinitum.
Nothing I tried could elicit a more detailed error about why this was
happening. The mount on top of it is a normal ext3 mount, and wasn't
being accessed at the time of the failure as far as I know.
I switched off the queue_if_no_path option globally in the
mulitpath.conf file. Immediately the ext3 journal failed out, and
multipath brought both paths back as active:
exports (350002ac0005b02a4)
[size=150 GB][features="0"][hwhandler="0"]
\_ round-robin 0 [active][first]
\_ 5:0:0:3 sde 8:64 [ready ][active]
\_ 6:0:1:3 sdl 8:176 [ready ][active]
I was able to fsck the device and remount it without issue or reboot
after that. Since, I've left the queue option disabled to see if the
problem creeps back.
I basically have a default multipath.conf file, with some WWN to alias
mappings, had the queue_if_no_path option enabled, and the EMC device
info added. The problem's on the 3par however. Only one of the four 3par
mounts on the machine was having issues.
Is this known at all? Is there anything else I can provide so that we
can figure out why this happened? I had been running multipath tools for
two months on a test box and never encounterred this problem. It's only
snuck up as we've started deploying it on more machines for
pre-production. All of the servers are identical... redhat ES4.1, same
qla2300 fiber cards, same CPUs/etc.
We also encounterred the EMC ghost LUN issue (discussed on here once),
which is especially bad if queue_if_no_path is enabled. Sometimes
causing a kernel panic and bringing the machine down :(
Any assistance on the first or second issue would be appreciated!
Thanks,
-Alan