Re: Kernel panic w/ message request_threaded_irq -> qla2x00_request_irqs -> qla2x00_probe_one -> mod_timer

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 5/12/2019 11:56 PM, TomK wrote:
On 5/3/2019 9:07 AM, TomK wrote:
On 5/2/2019 10:00 PM, Laurence Oberman wrote:
On Sun, 2019-04-28 at 12:11 -0400, TomK wrote:
On 4/15/2019 10:26 PM, TomK wrote:
On 4/15/2019 3:35 PM, Laurence Oberman wrote:
On Mon, 2019-04-15 at 08:39 -0700, Bart Van Assche wrote:
On Mon, 2019-04-15 at 08:55 -0400, Laurence Oberman wrote:
On Sun, 2019-04-14 at 23:25 -0400, TomK wrote:
Hey All,

I'm getting a kernel panic on an Gigabyte GA-890XA-UD3
motherboard
that
I've got a QLE2464 card in as a target (FC).  The kernel
has
been
crashing / panicking in the last 1-2 months about once a
week.  Before
that, it was rock solid for 4-5 years.  I've upgraded to
kernel
4.18.19
but that hasn't made much of a difference.  Since the
message
includes
qla2x00_request_irqs I thought I would try here first.

Tried to get more info on this but:

1) Keyboard doesn't work and locks up when the panic
occurs.  No
USB
ports work.  Tried the PS/2 port but nothing.

2) Unable to capture a kdump.  Can't get to the kdump
vmcore due
to
1).

The two screenshots is pretty much all I can capture.
Tried
things
like
clocksource=rtc in the kernel parms and disabling hpet1
but
apparently I
haven't disabled it everywhere since it still shows up.

Wondering if anyone recognizes these messages or has any
idea
what
could
be the issue here?  Even a hint would be appreciated.

Hello Tom
I have had similar issues and reported them to
Himanshu@Cavium
I have kept all my target servers at kernel 4.5 as it been
the only
version that has always been stable.
If your motherboard has an NMI (virtual or physical) set all
of
these
in /etc/sysctl.conf
Run sysctl -a;dracut -f and reboot

kernel.nmi_watchdog = 1
kernel.panic_on_io_nmi = 1
kernel.panic_on_unrecovered_nmi =
kernel.unknown_nmi_panic = 1

When the issue shows up press the virtual/physical NMI

This is with the assumption that generic kdump is properly
setup
and
dmesg | grep crash shows memory resrved by the crashkernel
and that
you
have tested kdump manually.

Other options are use a USB serial port to capture the full
log if
you
cannot get kdump to work.
That approach may provide further evidence about kernel bugs
but it
is not
guaranteed that that approach will lead to a solution. It would
help
if
either or both of you could do the following on a test system:
* Check out branch qla2xxx-for-next of my kernel repo on
github
    (https://github.com/bvanassche/linux/tree/qla2xxx-for-next).
* Enable lockdep and KASAN in the kernel config
(CONFIG_PROVE_LOCKING
and
    CONFIG_KASAN).
* Build and install that kernel.
* Run your favorite workload.

Please note that the qla2xxx-for-next branch is based on the
v5.1-rc1
kernel
and hence should not be installed on any production system.

Thanks,

Bart.
Hello Bart
OK, I will get to this by Thursday, wont be able to change the
targetserver kernel until then.
Regards
Laurence

Same.  I'll try this out closer to the weekend.

Not an NMI motherboard.  This is a 9-10 year old AMD board meant as
a desktop or home server.

I'll have to read more about the USB Serial port to capture further
info.  That's interesting.

For the time being, I've disabled HPET in BIOS.  ( Appears the
kernel boot parameter method wasn't enough. )



Hey Guy's,
Did some of what you suggested, including the USB serial setup:
1) One of DB9 RS232 Serial Null Modem Cable F/F
2) Two of USB to RS232 Serial Port DB9 9 Pin Male
however, when the kernel came down it took the USB support with it
and so minicom went offline:
  CTRL-A Z for help |115200 8N1 | NOR | Minicom 2.6.2  | VT102 |
Offline
But I did enable full logging for the QLA module:
echo 0x7fffffff >
/sys/module/qla2xxx/parameters/ql2xextended_error_logging
Did all that, minus the Kernel v5.1-rc1 implementation, and this is
what was picked up from the minicom USB to Serial capture before
things went south:
1235905 ^Mqla2xxx [0000:04:00.0]-e818: is_send_status=1, cmd-
bufflen=512, cmd->sg_cnt=1, cmd-
dma_data_directi
                              on=1
se_cmd[0000
                         00009c9ea758]
qp
                 0
1235906 ^Mqla2xxx [0000:04:00.0]-e818: is_send_status=1, cmd-
bufflen=4096, cmd->sg_cnt=0,
cmd-
dma_data_direct
                              ion=2
se_cmd[000
                        0000096ae11b7]
q
               p 0
1235907 ^Mqla2xxx [0000:04:00.0]-e818: is_send_status=1, cmd-
bufflen=20480, cmd->sg_cnt=0,
cmd
->dma_data_direc
                              tion=2
se_cmd[00
0000001738f793]
                              qp 0
1235908 ^Mqla2xxx [0000:04:00.0]-e818: is_send_status=1, cmd-
bufflen=20480, cmd->sg_cnt=0,
cmd
->dma_data_direc
                              tion=2
se_cmd[00
000000e8160a90]
                              qp 0
1235909 ^MDetected MISCOMPARE for addr: 0000000033045258 buf:
00000000f9849912
1235910 ^MTarget/fileio: Send MISCOMPARE check condition and sense
1235911 ^Mqla2xxx [0000:04:00.0]-e818: is_send_status=1, cmd-
bufflen=512, cmd->sg_cnt=0, cmd-
dma_data_directi
                              on=2
se_cmd[0000
                         0000363ae214]
qp
                 0
1235912 ^Mqla2xxx [0000:04:00.0]-e817: Skipping EXPLICIT_CONFORM and
CTIO7_FLAGS_CONFORM_REQ
fo
              r FCP READ w/
no
                n GOOD status
1235913 ^Mqla2xxx [0000:04:00.0]-e874:2: qlt_free_cmd:
se_cmd[000000001db805fd] ox_id 00c8
1235914 ^Mqla2xxx [0000:04:00.0]-e872:2: qlt_24xx_atio_pkt_all_vps:
qla_target(0): type 6
ox_id
                  00db
1235915 ^Mqla2xxx [0000:04:00.0]-e872:2: qlt_24xx_atio_pkt_all_vps:
qla_target(0): type 6
ox_id
                  00dc
1235916 ^Mqla2xxx [0000:04:00.0]-e874:2: qlt_free_cmd:
se_cmd[00000000f67a701f] ox_id 00c9
1235917 ^Mqla2xxx [0000:04:00.0]-e872:2: qlt_24xx_atio_pkt_all_vps:
qla_target(0): type 6
ox_id
                  00dd
1235918 ^Mqla2xxx [0000:04:00.0]-e872:2: qlt_24xx_atio_pkt_all_vps:
qla_target(0): type 6
ox_id
                  00de

On an earlier crash, captured the attached image.  This time there
was nothing on the monitor and the keyboard didn't refresh it.  No
signal.
When looking this up, closest I could see online is the following:

https://target-devel.vger.kernel.narkive.com/XiM5Csx8/luns-become-unavailable-with-current-git-head
They too run ESXi .
To read the file I used the AnsiEsc plugin for VIM:
https://www.vim.org/scripts/script.php?script_id=302
This started to occur once had a VMware based MySQL and PostgreSQL
cluster configured.  Takes a few days for the issue to occur so from
that perspective, appears to be memory related.
Firmware that I'm using is:
     supported_classes   = "Class 3"
     supported_speeds    = "1 Gbit, 2 Gbit, 4 Gbit"
     symbolic_name       = "QLE2464 FW:v8.04.00 DVR:v10.00.00.05-k"
Targetcli, rtslib and configshell versions I'm using are:

# rpm -aq|grep -Ei "targetcli|rtslib|configshell"
python-rtslib-3.0.pre4.9~g6fd0bbf-1.el6.noarch
python-configshell-1.1.fb4-1.el6.noarch
targetcli-3.0.pre4.5~ga125182-1.el6.noarch


--
Thx,
TK.

I missed this email, Been buried in customer cases.
I also need to still run some tests.
Sorry, reading now


No worries.  Would be very interested to see what you find.

In the meantime later tonight I'll be trying to 1) find more recent firmware for the card, 2) try to use the newer kernel v5.1-rc1 outlined above 3) try the 4.5 kernel later on the weekend.




Trying with this F/W and driver:


symbolic_name       = "QLE2464 FW:v8.06.02 DVR:v10.00.00.07-k-debug"

[root@mbpc-pc 05-03-2019]# strings /lib/firmware/ql2400_fw.bin|grep -Ei copyright
  COPYRIGHT 2016 QLOGIC CORPORATION   ISP24xx Firmware   Version 8.06.02  $
[root@mbpc-pc 05-03-2019]#



The latest 'boot' firmware I can find is the following:



[root@mbpc-pc 05-03-2019]# strings Q24A7232.BIN |grep -Ei copyright
Copyright (C) QLogic Corporation 1993-2015. All rights reserved.
Copyright (C) QLogic Corporation 1993-2015. All rights reserved.
  COPYRIGHT 2015 QLOGIC CORPORATION   ISP24xx Firmware   Version 8.01.02  $
You have new mail in /var/spool/mail/root
[root@mbpc-pc 05-03-2019]#

This is what shows up as loaded during system boot.


So resolved it a few months back. What I did is simply to upgrade the memory on the storage server from 4GB to 8GB. Then later to 16GB. Since the time I upgrade the memory, the kernel panic didn't show up and everything's been stable.

--
Thx,
TK.



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [SCSI Target Devel]     [Linux SCSI Target Infrastructure]     [Kernel Newbies]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Samba]     [Device Mapper]

  Powered by Linux