Re: [dm-devel] dm-mpath-rdac.patch problem

Andrew Vasquez <andrew.vasquez@xxxxxxxxxx> · Fri, 13 Jul 2007 09:12:53 -0700

On Thu, 12 Jul 2007, Mike Anderson wrote:

> Copying this mail to linux-scsi and Ccing Andrew Vasquez to possibly
> provide input on the Qlogic behavior.
> 
> Chandra Seetharaman <sekharan@xxxxxxxxxx> wrote:
> > On Thu, 2007-07-12 at 18:35 -0700, Brian De Wolf wrote:
> > > Hello All,
> > > 
> > > I'm not sure if this is the right place for this, but it seems to be the only
> > > mailing list related to dm, multipath, and rdac, as far as I can tell.  I've
> > > been trying out the dm-mpath-rdac patch (both yesterday's and previous) with
> > > gentoo's unstable 2.6.22 kernel, on a Sun x4100 through a QLA2422 HBA (firmware
> > > ql2400_fw.bin.4.00.27) to an IBM DS4000.  I am using a version of
> > > multipath-tools that I got with git a few days ago.
> > > 
> > > I've got multipath working, it reports the hwhandler correctly ([hwhandler=1
> > > rdac]), and the volume is mountable, etc.  It also shows one link as active, the
> > > other as ghost.  However, once the active link dies, the volume becomes read
> > > only, and both connections are listed as failed.  Most importantly, something
> > > like this shows up in the logs:
> > > 
> > > Jul 12 17:11:15 jimbo kernel: device-mapper: multipath rdac: queueing
> > > MODE_SELECT command on 8:32
> > 
> > It does look like the rdac hardware handler is doing the right thing and
> > the qlogic is dying for some reason.
> > 
> > I have tested this code in both RHEL5 and SLES10 environments (qla23xx)
> > and they work fine. Can you try in one of those and see if it is any
> > different.
> > 
> > Just an FYI w.r.t multipath tools: please remove the patch
> > http://git.kernel.org/?p=linux/storage/multipath-
> > tools/.git;a=commit;h=e1e1a1bfb2cf76bfd1a49335e3deec5360fb09db from your
> > tree for the tools to calculate the path priorities properly.
> > 
> > 
> > > Jul 12 17:11:15 jimbo kernel: qla2xxx 0000:02:01.1: ISP System Error - mbx1=0h
> > > mbx2=8012h mbx3=8002h.
> > > Jul 12 17:11:15 jimbo kernel: qla2xxx 0000:02:01.1: Firmware has been previously
> > > dumped (ffffc2000171d000) -- ignoring request...
> > > Jul 12 17:11:16 jimbo kernel: qla2xxx 0000:02:01.1: Performing ISP error
> > > recovery - ha= ffff81007e85c530.

Hmm yes, there's some real problems going on within the firmware which
we need to triage.  From the snippet above, the driver was able to
capture a firmware-dump of a failure (not sure of the timing and how
it relates to the window in which you recognized a 'problem'), but
I'll need to to 'capture' the firmware trace and forward it along to
us to inspect.

1) download the following shell script:

	ftp://ftp.qlogic.com/outgoing/linux/beta/8.x/test/qla_dmp.sh

2) copy the script to the host (/tmp) which is experiencing the
   problems.

3) reboot and load the driver with the ql2xextended_error_logging
   module parameter set to 1. e.g.:

	$ insmod qla2xxx.ko ql2xextended_error_logging=1

4) rerun your test and monitor the kernel-messages file for a message
   similar to:

        Firmware dump saved to temp buffer (1/adcdabcd)

5) To retrieve the dump, go to a console and type the following:

        # cd /tmp/
        # ./qla_dmp.sh 1

   The value passed to qla_dmp.sh should be the same as the first integer
   in the 'saved to temp buffer' string (in this example, 1).  If the
   operation was successful, a message like to following should be
   displayed:

        Firmware dumped to file fw_dump_1_20041217_023222.txt.gz

   Formward the 
   forward over the file.

6) forward over the /var/log/messages file of the driver load and
   failure snippet.

Not sure which firmware version you are running, but an additional
datapoint which may be useful after you send the firmware-dump is to
download the latest 24xx firmware file from QLogic.com:

	ftp://ftp.qlogic.com/outgoing/linux/firmware/ql2400_fw.bin

and retry the test.  If you still see problems, and see a similar
'Firmware dump saved...' messages.  Follow the steps above again and
forward the same datapoints.

> > > While this may be something for the maintainer of the qla2xxx module (I can't
> > > figure out where I'd send it, in that case...) I think it may be of interest
> > > that the dm_rdac module tries to push something over the HBA that causes it to
> > > bail completely and start from scratch (it starts init processes and loading
> > > firmware again).
> > > 
> > > Not to say that I'm not interested in any help getting this working, that is.
> > > If you have any suggestions on how to get this working, I'd love to hear them.
> > > I'm also willing to guinea pig some testing if you need it (This box still has a
> > > bit before it will have to be put in use).  I may use redhat to ensure that it's
> > > not just a broken HBA, but for the long run we would like it to join our gentoo
> > > environment.
> > > 
> > > Thanks!
> > > Brian De Wolf
> > > 
> > > PS- If the subject mislead you because you feel that this is just a qla2xxx
> > > problem, I'm sorry for wasting your time.

Regards,
Andrew Vasquez
-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html