Re: ESX FC host connectivity issues

DAVID S <shraderdm@xxxxxxxxx> · Sat, 27 Feb 2016 05:41:50 -0500

Dan/All,

Sorry for the delay in responding to this, but I am indeed having
similar issues. I haven't been able to consistently replicate the
crashes, but I did have one today when I powered on every VM in my
environment simultaneously, and I have pasted relevant errors found
when running journalctl -p err. I believe the crash was at 04:49 on
Feb 27.

Feb 26 17:20:59 storage.example.home kernel: qla2xxx
[0000:06:00.0]-0121:9: Failed to enable receiving of RSCN requests:
0x2.
Feb 26 17:21:00 storage.example.home kernel: MODE SENSE: unimplemented
page/subpage: 0x1c/0x02
Feb 26 17:21:00 storage.example.home kernel: MODE SENSE: unimplemented
page/subpage: 0x1c/0x02
Feb 26 17:21:00 storage.example.home kernel: MODE SENSE: unimplemented
page/subpage: 0x1c/0x02
Feb 26 17:21:06 storage.example.home kernel: qla2xxx
[0000:06:00.1]-0121:10: Failed to enable receiving of RSCN requests:
0x2.
Feb 26 17:51:09 storage.example.home kernel: qla2xxx
[0000:06:00.1]-0121:10: Failed to enable receiving of RSCN requests:
0x2.
Feb 26 17:51:32 storage.example.home kernel: qla2xxx
[0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi.
Feb 26 17:51:52 storage.example.home kernel: qla2xxx
[0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi.
Feb 26 17:52:17 storage.example.home kernel: qla2xxx
[0000:06:00.0]-505e:9: Link is offline.
Feb 26 17:52:31 storage.example.home kernel: qla2xxx
[0000:06:00.0]-505e:9: Link is offline.
Feb 26 17:52:51 storage.example.home kernel: qla2xxx
[0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi.
Feb 26 17:53:11 storage.example.home kernel: qla2xxx
[0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi.
Feb 26 17:53:31 storage.example.home kernel: qla2xxx
[0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi.
Feb 26 17:53:51 storage.example.home kernel: qla2xxx
[0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi.
Feb 26 17:54:11 storage.example.home kernel: qla2xxx
[0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi.
Feb 26 17:54:32 storage.example.home kernel: qla2xxx
[0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi.
Feb 26 17:54:52 storage.example.home kernel: qla2xxx
[0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi.
Feb 26 17:55:12 storage.example.home kernel: qla2xxx
[0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi.
Feb 26 17:55:32 storage.example.home kernel: qla2xxx
[0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi.
Feb 26 18:45:35 storage.example.home kernel: MODE SENSE: unimplemented
page/subpage: 0x1c/0x02
Feb 26 18:45:35 storage.example.home kernel: MODE SENSE: unimplemented
page/subpage: 0x1c/0x02
Feb 26 19:12:19 storage.example.home kernel: qla2xxx
[0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi.
Feb 26 19:12:25 storage.example.home kernel: qla2xxx
[0000:06:00.1]-0121:10: Failed to enable receiving of RSCN requests:
0x2.
Feb 26 19:12:39 storage.example.home kernel: qla2xxx
[0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi.
Feb 26 19:13:00 storage.example.home kernel: qla2xxx
[0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi.
Feb 26 19:13:20 storage.example.home kernel: qla2xxx
[0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi.
Feb 26 19:13:40 storage.example.home kernel: qla2xxx
[0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi.
Feb 26 19:14:00 storage.example.home kernel: qla2xxx
[0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi.
Feb 26 19:14:20 storage.example.home kernel: qla2xxx
[0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi.
Feb 26 19:14:40 storage.example.home kernel: qla2xxx
[0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi.
Feb 26 19:15:01 storage.example.home kernel: qla2xxx
[0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi.
Feb 26 19:17:00 storage.example.home kernel: qla2xxx
[0000:06:00.1]-f095:10: sess ffff880224459240 PRLI received, before
plogi ack.
Feb 26 19:24:18 storage.example.home systemd[1]: Failed unmounting
Configuration File System.
Feb 26 19:24:18 storage.example.home systemd[1]: Failed unmounting /var.
Feb 26 19:24:18 storage.example.home kernel: watchdog watchdog0:
watchdog did not stop!
-- Reboot --
Feb 26 14:25:36 storage.example.home kernel: ERST: Can not request
[mem 0xcff69000-0xcff69fff] for ERST.
Feb 26 19:25:39 storage.example.home kernel: kvm: disabled by bios
Feb 26 19:25:39 storage.example.home kernel: kvm: disabled by bios
Feb 26 19:31:04 storage.example.home kernel: qla2xxx
[0000:06:00.0]-0121:9: Failed to enable receiving of RSCN requests:
0x2.
Feb 26 19:31:31 storage.example.home kernel: MODE SENSE: unimplemented
page/subpage: 0x1c/0x02
Feb 26 19:31:31 storage.example.home kernel: MODE SENSE: unimplemented
page/subpage: 0x1c/0x02
Feb 26 19:31:31 storage.example.home kernel: MODE SENSE: unimplemented
page/subpage: 0x1c/0x02
Feb 26 20:10:38 storage.example.home kernel: MODE SENSE: unimplemented
page/subpage: 0x1c/0x02
Feb 26 20:10:38 storage.example.home kernel: MODE SENSE: unimplemented
page/subpage: 0x1c/0x02
Feb 26 20:10:39 storage.example.home kernel: MODE SENSE: unimplemented
page/subpage: 0x1c/0x02
Feb 26 20:24:57 storage.example.home kernel: MODE SENSE: unimplemented
page/subpage: 0x1c/0x02
Feb 26 20:24:57 storage.example.home kernel: MODE SENSE: unimplemented
page/subpage: 0x1c/0x02
Feb 26 20:24:57 storage.example.home kernel: MODE SENSE: unimplemented
page/subpage: 0x1c/0x02
-- Reboot --
Feb 26 23:55:55 storage.example.home kernel: ERST: Can not request
[mem 0xcff69000-0xcff69fff] for ERST.
-- Reboot --
Feb 27 04:49:20 storage.example.home kernel: kernel BUG at
drivers/scsi/qla2xxx/qla_target.c:3099!
-- Reboot --
Feb 27 04:55:58 storage.example.home kernel: kvm: disabled by bios
Feb 27 04:55:58 storage.example.home kernel: kvm: disabled by bios
Feb 27 04:55:58 storage.example.home kernel: kvm: disabled by bios
Feb 27 04:56:34 storage.example.home kernel: qla2xxx
[0000:06:00.1]-0121:10: Failed to enable receiving of RSCN requests:
0x2.
Feb 27 04:56:47 storage.example.home kernel: qla2xxx
[0000:06:00.0]-0121:9: Failed to enable receiving of RSCN requests:
0x2.
Feb 27 05:13:12 storage.example.home dbus[899]: Can't send to audit
system: USER_AVC avc:  received policyload notice (seqno=2)

exe="/usr/bin/dbus-daemon" sauid=81 hostname=? addr=? terminal=?
Feb 27 05:15:14 storage.example.home kernel: kernel BUG at
drivers/scsi/qla2xxx/qla_target.c:3099!

I did also see some things on a different mailing list where people
referenced a firmware issue with the qla2xxx driver when being used as
a target, but I'm not sure if that's relevant in this discussion (they
said it only appears when the initiator is another linux machine
running on kernel 4.1+).

Anyway, I'll keep an eye on this and try to keep better track of
exactly when/why the crashes happen.

Please let me know if there's any other helpful/relevant information
that I can provide to help pinpoint this issue.

Thanks!
David

On Fri, Feb 26, 2016 at 9:25 PM, Dan Lane <dracodan@xxxxxxxxx> wrote:
> Okay, despite my ongoing efforts to resolve this, I am no closer to
> having a stable storage solution.  Here is what I know so far, let's
> figure this out.
>
> First of all, I am NOT alone with my problems, I have a friend that is
> experiencing these same problems with using FC.  In fact I would like
> to know if anyone is actually running this successfully - surely one
> of the developers must have a lab set up to validate the code, right?
>
> Kernel 4.3.4 symptoms:
> Kernel panics randomly (as reported in the past).  Nicholas identified
> a bug that he believed was causing this and created a patch.
>
> Kernel 4.5rc? - created from the latest kernel source code with the
> patch from Nicholas about 3 weeks ago
> Runs fine for about a day, then simply stops responding.  There is
> absolutely nothing in the /var/log/messages when this happens, and the
> service seems to still be running, but no servers can see the storage.
> Is there anywhere else I can look at logs?  Is there a way to enable
> more verbose logging?  Additionally, once this happens it is
> impossible to stop the service, even running a kill -9 on the process
> never succeeds, the only thing that can be done at this point is to
> reboot the target server.
> Oddly, my FC switch still sees the target server.
> Here is what "ps aux | grep target" shows for the process after it
> crashes and I try to stop it, note the "D"ie "uninterruptible sleep":
> root     17055  0.0  0.0 214848 15444 ?        Ds   19:35   0:00
> /usr/bin/python3 /usr/bin/targetctl clear
>
> Kernel 3.5.0 (old target server, unpatched because I'm afraid of
> breaking target!)
> Runs flawlessly all day long, never fails for any reason, no matter
> what version of ESXi is used
>
> I believe the problem is related to the LIO implementation of VAAI for
> the following reason:
> when I used ESXi 5 without VAAI enabled and the 4.3.4 target, I didn't
> have any problems.  When I tried ESXi 5.5 and 6 (with VAAI enabled),
> LIO crashed.  I also don't have any problems with ESXi 6 against my
> old target based on kernel 3.5, which is prior to the implementation
> of VAAI.
>
> I'm really tired of having my equipment blamed for the problem, or the
> idea that I'm using an old firmware on my FC cards or FC switch.  I
> actually spent a large amount of money building the server that I'm
> using for LIO because of the suggestion that maybe the backend wasn't
> keeping up in the past.  All firmwares have been updated to their
> latest, and I have seen the problem even when using a direct
> connection (no switch).  I've also used three different physical
> servers as the target as well.  In addition, a friend of mine has had
> the exact same issues as I mentioned earlier.  One thing that has been
> suggested is that my backend disks can't keep up.  my backend is 20x
> 10k SAS drives in RAID 6 with an intel 730 SSD acting as read and
> write cache using LSI cachecade 2.0.  Testing with this setup prior to
> a crash has shown 400MB/s reads and writes (likely being limited by
> the single 4gb FC connection) and 1-10ms latency, needless to say, I
> don't think the problem is my back end.  Additionally, my old target
> server that never crashes only has 6 old seagate 7200 RPM drives in
> RAID 6, which are good for about 200MB/s reads and writes.
>
> I'm open to doing just about any troubleshooting that could help.
> Also, as mentioned in the past I have access to absolutely any version
> of ESXi and I have multiple available hosts, so it you would like to
> test anything related to that I would be happy to assist.
>
> Thanks
> Dan
> --
> To unsubscribe from this list: send the line "unsubscribe target-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html