suggestions on debugging abort commands

Gianluca Cecchi <gianluca.cecchi@xxxxxxxxx> · Thu, 14 May 2009 16:50:39 +0200

Hello,
I have a two node stretched-cluster where the situation is like the
attached image (I hope it is possible to attach small files...)

Both the nodes have multipath installed. On site 2, the node often
gets abort commands (see below /var/log/messages)

May 14 11:35:22 orastud2 clurgmgrd: [6961]: <notice> Getting status
May 14 11:35:23 orastud2 last message repeated 7 times
May 14 11:59:56 orastud2 kernel: qla2xxx 0000:08:00.0: scsi(0:0:6):
Abort command issued -- 1 10710 2002.
May 14 11:59:56 orastud2 kernel: sd 0:0:0:6: timing out command, waited 300s
May 14 11:59:56 orastud2 multipathd: /sbin/mpath_prio_alua exitted with 5
May 14 11:59:56 orastud2 multipathd: error calling out
/sbin/mpath_prio_alua 8:208
May 14 12:35:22 orastud2 clurgmgrd: [6961]: <notice> Getting status
May 14 12:35:23 orastud2 last message repeated 7 times
May 14 12:41:25 orastud2 kernel: qla2xxx 0000:08:00.0: scsi(0:1:8):
Abort command issued -- 1 16ca3 2002.
May 14 13:35:22 orastud2 clurgmgrd: [6961]: <notice> Getting status
May 14 14:35:23 orastud2 last message repeated 8 times

the messages above refer to the device that now (5 hours later) show this:

mpath3 (3600507630efe0b0c0000000000000603) dm-6 IBM,1750500
[size=60G][features=1 queue_if_no_path][hwhandler=0][rw]
\_ round-robin 0 [prio=0][active]
 \_ 1:0:1:6  sdam 66:96  [active][undef]
 \_ 0:0:1:6  sdan 66:112 [active][undef]
\_ round-robin 0 [prio=0][enabled]
 \_ 1:0:0:6  sdm  8:192  [active][undef]
 \_ 0:0:0:6  sdn  8:208  [active][undef]

but in general I have these kind of messages for severale devices....
not only this

What is the meaning of the messages related to multipath:
multipathd: /sbin/mpath_prio_alua exitted with 5
multipathd: error calling out /sbin/mpath_prio_alua 8:208

Any hint on debugging the lines of kind:
qla2xxx 0000:08:00.0: scsi(0:1:8): Abort command issued -- 1 16ca3 2002.

Could it be related with bb credits as the only impacted is node 2
that is one switch more than node 1 away from the storage?
Next step, as the san configuration for the two servers is identical
and they are two hp blades would be to swap them, and see if the
problem swap too or not.
But any further hint or debugging flag I can put in multipath or other
components is welcome.

OS are rh el 5.3 x86_64 and storage is IBM DS6800.
Thanks,
Gianluca
Attachment:
san.jpg

Description: JPEG image
--
dm-devel mailing list
dm-devel@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/dm-devel