On 4/15/2019 10:26 PM, TomK wrote:
On 4/15/2019 3:35 PM, Laurence Oberman wrote:On Mon, 2019-04-15 at 08:39 -0700, Bart Van Assche wrote:On Mon, 2019-04-15 at 08:55 -0400, Laurence Oberman wrote:On Sun, 2019-04-14 at 23:25 -0400, TomK wrote:Hey All, I'm getting a kernel panic on an Gigabyte GA-890XA-UD3 motherboard that I've got a QLE2464 card in as a target (FC). The kernel has been crashing / panicking in the last 1-2 months about once a week. Before that, it was rock solid for 4-5 years. I've upgraded to kernel 4.18.19 but that hasn't made much of a difference. Since the message includes qla2x00_request_irqs I thought I would try here first. Tried to get more info on this but: 1) Keyboard doesn't work and locks up when the panic occurs. No USB ports work. Tried the PS/2 port but nothing. 2) Unable to capture a kdump. Can't get to the kdump vmcore due to 1). The two screenshots is pretty much all I can capture. Tried things like clocksource=rtc in the kernel parms and disabling hpet1 but apparently I haven't disabled it everywhere since it still shows up. Wondering if anyone recognizes these messages or has any idea what could be the issue here? Even a hint would be appreciated.Hello Tom I have had similar issues and reported them to Himanshu@Cavium I have kept all my target servers at kernel 4.5 as it been the only version that has always been stable. If your motherboard has an NMI (virtual or physical) set all of these in /etc/sysctl.conf Run sysctl -a;dracut -f and reboot kernel.nmi_watchdog = 1 kernel.panic_on_io_nmi = 1 kernel.panic_on_unrecovered_nmi = kernel.unknown_nmi_panic = 1 When the issue shows up press the virtual/physical NMI This is with the assumption that generic kdump is properly setup and dmesg | grep crash shows memory resrved by the crashkernel and that you have tested kdump manually. Other options are use a USB serial port to capture the full log if you cannot get kdump to work.That approach may provide further evidence about kernel bugs but it is not guaranteed that that approach will lead to a solution. It would help if either or both of you could do the following on a test system: * Check out branch qla2xxx-for-next of my kernel repo on github (https://github.com/bvanassche/linux/tree/qla2xxx-for-next). * Enable lockdep and KASAN in the kernel config (CONFIG_PROVE_LOCKING and CONFIG_KASAN). * Build and install that kernel. * Run your favorite workload. Please note that the qla2xxx-for-next branch is based on the v5.1-rc1 kernel and hence should not be installed on any production system. Thanks, Bart.Hello Bart OK, I will get to this by Thursday, wont be able to change the targetserver kernel until then. Regards LaurenceSame. I'll try this out closer to the weekend.Not an NMI motherboard. This is a 9-10 year old AMD board meant as a desktop or home server.I'll have to read more about the USB Serial port to capture further info. That's interesting.For the time being, I've disabled HPET in BIOS. ( Appears the kernel boot parameter method wasn't enough. )
Hey Guy's, Did some of what you suggested, including the USB serial setup: 1) One of DB9 RS232 Serial Null Modem Cable F/F 2) Two of USB to RS232 Serial Port DB9 9 Pinhowever, when the kernel came down it took the USB support with it and so minicom went offline:
CTRL-A Z for help |115200 8N1 | NOR | Minicom 2.6.2 | VT102 | Offline
But I did enable full logging for the QLA module: echo 0x7fffffff > /sys/module/qla2xxx/parameters/ql2xextended_error_loggingDid all that, minus the Kernel v5.1-rc1 implementation, and this is what was picked up from the minicom USB to Serial capture before things went south:
1235905 ^Mqla2xxx [0000:04:00.0]-e818: is_send_status=1, cmd->bufflen=512, cmd->sg_cnt=1, cmd-> dma_data_directi on=1 se_cmd[0000 00009c9ea758] qp 0 1235906 ^Mqla2xxx [0000:04:00.0]-e818: is_send_status=1, cmd->bufflen=4096, cmd->sg_cnt=0, cmd- >dma_data_direct ion=2 se_cmd[000 0000096ae11b7] q p 0 1235907 ^Mqla2xxx [0000:04:00.0]-e818: is_send_status=1, cmd->bufflen=20480, cmd->sg_cnt=0, cmd ->dma_data_direc tion=2 se_cmd[00 0000001738f793] qp 0 1235908 ^Mqla2xxx [0000:04:00.0]-e818: is_send_status=1, cmd->bufflen=20480, cmd->sg_cnt=0, cmd ->dma_data_direc tion=2 se_cmd[00 000000e8160a90] qp 0 1235909 ^MDetected MISCOMPARE for addr: 0000000033045258 buf: 00000000f9849912
1235910 ^MTarget/fileio: Send MISCOMPARE check condition and sense1235911 ^Mqla2xxx [0000:04:00.0]-e818: is_send_status=1, cmd->bufflen=512, cmd->sg_cnt=0, cmd-> dma_data_directi on=2 se_cmd[0000 0000363ae214] qp 0 1235912 ^Mqla2xxx [0000:04:00.0]-e817: Skipping EXPLICIT_CONFORM and CTIO7_FLAGS_CONFORM_REQ fo r FCP READ w/ no n GOOD status 1235913 ^Mqla2xxx [0000:04:00.0]-e874:2: qlt_free_cmd: se_cmd[000000001db805fd] ox_id 00c8 1235914 ^Mqla2xxx [0000:04:00.0]-e872:2: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 00db 1235915 ^Mqla2xxx [0000:04:00.0]-e872:2: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 00dc 1235916 ^Mqla2xxx [0000:04:00.0]-e874:2: qlt_free_cmd: se_cmd[00000000f67a701f] ox_id 00c9 1235917 ^Mqla2xxx [0000:04:00.0]-e872:2: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 00dd 1235918 ^Mqla2xxx [0000:04:00.0]-e872:2: qlt_24xx_atio_pkt_all_vps: qla_target(0): type 6 ox_id 00de
On an earlier crash, captured the attached image. This time there was nothing on the monitor and the keyboard didn't refresh it. No signal.
When looking this up, closest I could see online is the following: https://target-devel.vger.kernel.narkive.com/XiM5Csx8/luns-become-unavailable-with-current-git-head They too run ESXi .To read the file I used the AnsiEsc plugin for VIM: https://www.vim.org/scripts/script.php?script_id=302
This started to occur once had a VMware based MySQL and PostgreSQL cluster configured. Takes a few days for the issue to occur so from that perspective, appears to be memory related.
Firmware that I'm using is: supported_classes = "Class 3" supported_speeds = "1 Gbit, 2 Gbit, 4 Gbit" symbolic_name = "QLE2464 FW:v8.04.00 DVR:v10.00.00.05-k" Targetcli, rtslib and configshell versions I'm using are: # rpm -aq|grep -Ei "targetcli|rtslib|configshell" python-rtslib-3.0.pre4.9~g6fd0bbf-1.el6.noarch python-configshell-1.1.fb4-1.el6.noarch targetcli-3.0.pre4.5~ga125182-1.el6.noarch -- Thx, TK.
Attachment:
IMG_1821.jpg
Description: JPEG image