Dan/All, Sorry for the delay in responding to this, but I am indeed having similar issues. I haven't been able to consistently replicate the crashes, but I did have one today when I powered on every VM in my environment simultaneously, and I have pasted relevant errors found when running journalctl -p err. I believe the crash was at 04:49 on Feb 27. Feb 26 17:20:59 storage.example.home kernel: qla2xxx [0000:06:00.0]-0121:9: Failed to enable receiving of RSCN requests: 0x2. Feb 26 17:21:00 storage.example.home kernel: MODE SENSE: unimplemented page/subpage: 0x1c/0x02 Feb 26 17:21:00 storage.example.home kernel: MODE SENSE: unimplemented page/subpage: 0x1c/0x02 Feb 26 17:21:00 storage.example.home kernel: MODE SENSE: unimplemented page/subpage: 0x1c/0x02 Feb 26 17:21:06 storage.example.home kernel: qla2xxx [0000:06:00.1]-0121:10: Failed to enable receiving of RSCN requests: 0x2. Feb 26 17:51:09 storage.example.home kernel: qla2xxx [0000:06:00.1]-0121:10: Failed to enable receiving of RSCN requests: 0x2. Feb 26 17:51:32 storage.example.home kernel: qla2xxx [0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi. Feb 26 17:51:52 storage.example.home kernel: qla2xxx [0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi. Feb 26 17:52:17 storage.example.home kernel: qla2xxx [0000:06:00.0]-505e:9: Link is offline. Feb 26 17:52:31 storage.example.home kernel: qla2xxx [0000:06:00.0]-505e:9: Link is offline. Feb 26 17:52:51 storage.example.home kernel: qla2xxx [0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi. Feb 26 17:53:11 storage.example.home kernel: qla2xxx [0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi. Feb 26 17:53:31 storage.example.home kernel: qla2xxx [0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi. Feb 26 17:53:51 storage.example.home kernel: qla2xxx [0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi. Feb 26 17:54:11 storage.example.home kernel: qla2xxx [0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi. Feb 26 17:54:32 storage.example.home kernel: qla2xxx [0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi. Feb 26 17:54:52 storage.example.home kernel: qla2xxx [0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi. Feb 26 17:55:12 storage.example.home kernel: qla2xxx [0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi. Feb 26 17:55:32 storage.example.home kernel: qla2xxx [0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi. Feb 26 18:45:35 storage.example.home kernel: MODE SENSE: unimplemented page/subpage: 0x1c/0x02 Feb 26 18:45:35 storage.example.home kernel: MODE SENSE: unimplemented page/subpage: 0x1c/0x02 Feb 26 19:12:19 storage.example.home kernel: qla2xxx [0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi. Feb 26 19:12:25 storage.example.home kernel: qla2xxx [0000:06:00.1]-0121:10: Failed to enable receiving of RSCN requests: 0x2. Feb 26 19:12:39 storage.example.home kernel: qla2xxx [0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi. Feb 26 19:13:00 storage.example.home kernel: qla2xxx [0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi. Feb 26 19:13:20 storage.example.home kernel: qla2xxx [0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi. Feb 26 19:13:40 storage.example.home kernel: qla2xxx [0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi. Feb 26 19:14:00 storage.example.home kernel: qla2xxx [0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi. Feb 26 19:14:20 storage.example.home kernel: qla2xxx [0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi. Feb 26 19:14:40 storage.example.home kernel: qla2xxx [0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi. Feb 26 19:15:01 storage.example.home kernel: qla2xxx [0000:06:00.0]-f094:9: sess ffff8800ca9ca780 received double plogi. Feb 26 19:17:00 storage.example.home kernel: qla2xxx [0000:06:00.1]-f095:10: sess ffff880224459240 PRLI received, before plogi ack. Feb 26 19:24:18 storage.example.home systemd[1]: Failed unmounting Configuration File System. Feb 26 19:24:18 storage.example.home systemd[1]: Failed unmounting /var. Feb 26 19:24:18 storage.example.home kernel: watchdog watchdog0: watchdog did not stop! -- Reboot -- Feb 26 14:25:36 storage.example.home kernel: ERST: Can not request [mem 0xcff69000-0xcff69fff] for ERST. Feb 26 19:25:39 storage.example.home kernel: kvm: disabled by bios Feb 26 19:25:39 storage.example.home kernel: kvm: disabled by bios Feb 26 19:31:04 storage.example.home kernel: qla2xxx [0000:06:00.0]-0121:9: Failed to enable receiving of RSCN requests: 0x2. Feb 26 19:31:31 storage.example.home kernel: MODE SENSE: unimplemented page/subpage: 0x1c/0x02 Feb 26 19:31:31 storage.example.home kernel: MODE SENSE: unimplemented page/subpage: 0x1c/0x02 Feb 26 19:31:31 storage.example.home kernel: MODE SENSE: unimplemented page/subpage: 0x1c/0x02 Feb 26 20:10:38 storage.example.home kernel: MODE SENSE: unimplemented page/subpage: 0x1c/0x02 Feb 26 20:10:38 storage.example.home kernel: MODE SENSE: unimplemented page/subpage: 0x1c/0x02 Feb 26 20:10:39 storage.example.home kernel: MODE SENSE: unimplemented page/subpage: 0x1c/0x02 Feb 26 20:24:57 storage.example.home kernel: MODE SENSE: unimplemented page/subpage: 0x1c/0x02 Feb 26 20:24:57 storage.example.home kernel: MODE SENSE: unimplemented page/subpage: 0x1c/0x02 Feb 26 20:24:57 storage.example.home kernel: MODE SENSE: unimplemented page/subpage: 0x1c/0x02 -- Reboot -- Feb 26 23:55:55 storage.example.home kernel: ERST: Can not request [mem 0xcff69000-0xcff69fff] for ERST. -- Reboot -- Feb 27 04:49:20 storage.example.home kernel: kernel BUG at drivers/scsi/qla2xxx/qla_target.c:3099! -- Reboot -- Feb 27 04:55:58 storage.example.home kernel: kvm: disabled by bios Feb 27 04:55:58 storage.example.home kernel: kvm: disabled by bios Feb 27 04:55:58 storage.example.home kernel: kvm: disabled by bios Feb 27 04:56:34 storage.example.home kernel: qla2xxx [0000:06:00.1]-0121:10: Failed to enable receiving of RSCN requests: 0x2. Feb 27 04:56:47 storage.example.home kernel: qla2xxx [0000:06:00.0]-0121:9: Failed to enable receiving of RSCN requests: 0x2. Feb 27 05:13:12 storage.example.home dbus[899]: Can't send to audit system: USER_AVC avc: received policyload notice (seqno=2) exe="/usr/bin/dbus-daemon" sauid=81 hostname=? addr=? terminal=? Feb 27 05:15:14 storage.example.home kernel: kernel BUG at drivers/scsi/qla2xxx/qla_target.c:3099! I did also see some things on a different mailing list where people referenced a firmware issue with the qla2xxx driver when being used as a target, but I'm not sure if that's relevant in this discussion (they said it only appears when the initiator is another linux machine running on kernel 4.1+). Anyway, I'll keep an eye on this and try to keep better track of exactly when/why the crashes happen. Please let me know if there's any other helpful/relevant information that I can provide to help pinpoint this issue. Thanks! David On Fri, Feb 26, 2016 at 9:25 PM, Dan Lane <dracodan@xxxxxxxxx> wrote: > Okay, despite my ongoing efforts to resolve this, I am no closer to > having a stable storage solution. Here is what I know so far, let's > figure this out. > > First of all, I am NOT alone with my problems, I have a friend that is > experiencing these same problems with using FC. In fact I would like > to know if anyone is actually running this successfully - surely one > of the developers must have a lab set up to validate the code, right? > > Kernel 4.3.4 symptoms: > Kernel panics randomly (as reported in the past). Nicholas identified > a bug that he believed was causing this and created a patch. > > Kernel 4.5rc? - created from the latest kernel source code with the > patch from Nicholas about 3 weeks ago > Runs fine for about a day, then simply stops responding. There is > absolutely nothing in the /var/log/messages when this happens, and the > service seems to still be running, but no servers can see the storage. > Is there anywhere else I can look at logs? Is there a way to enable > more verbose logging? Additionally, once this happens it is > impossible to stop the service, even running a kill -9 on the process > never succeeds, the only thing that can be done at this point is to > reboot the target server. > Oddly, my FC switch still sees the target server. > Here is what "ps aux | grep target" shows for the process after it > crashes and I try to stop it, note the "D"ie "uninterruptible sleep": > root 17055 0.0 0.0 214848 15444 ? Ds 19:35 0:00 > /usr/bin/python3 /usr/bin/targetctl clear > > Kernel 3.5.0 (old target server, unpatched because I'm afraid of > breaking target!) > Runs flawlessly all day long, never fails for any reason, no matter > what version of ESXi is used > > I believe the problem is related to the LIO implementation of VAAI for > the following reason: > when I used ESXi 5 without VAAI enabled and the 4.3.4 target, I didn't > have any problems. When I tried ESXi 5.5 and 6 (with VAAI enabled), > LIO crashed. I also don't have any problems with ESXi 6 against my > old target based on kernel 3.5, which is prior to the implementation > of VAAI. > > I'm really tired of having my equipment blamed for the problem, or the > idea that I'm using an old firmware on my FC cards or FC switch. I > actually spent a large amount of money building the server that I'm > using for LIO because of the suggestion that maybe the backend wasn't > keeping up in the past. All firmwares have been updated to their > latest, and I have seen the problem even when using a direct > connection (no switch). I've also used three different physical > servers as the target as well. In addition, a friend of mine has had > the exact same issues as I mentioned earlier. One thing that has been > suggested is that my backend disks can't keep up. my backend is 20x > 10k SAS drives in RAID 6 with an intel 730 SSD acting as read and > write cache using LSI cachecade 2.0. Testing with this setup prior to > a crash has shown 400MB/s reads and writes (likely being limited by > the single 4gb FC connection) and 1-10ms latency, needless to say, I > don't think the problem is my back end. Additionally, my old target > server that never crashes only has 6 old seagate 7200 RPM drives in > RAID 6, which are good for about 200MB/s reads and writes. > > I'm open to doing just about any troubleshooting that could help. > Also, as mentioned in the past I have access to absolutely any version > of ESXi and I have multiple available hosts, so it you would like to > test anything related to that I would be happy to assist. > > Thanks > Dan > -- > To unsubscribe from this list: send the line "unsubscribe target-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html