On Sun, 2019-04-28 at 12:11 -0400, TomK wrote: > On 4/15/2019 10:26 PM, TomK wrote: > > On 4/15/2019 3:35 PM, Laurence Oberman wrote: > > > On Mon, 2019-04-15 at 08:39 -0700, Bart Van Assche wrote: > > > > On Mon, 2019-04-15 at 08:55 -0400, Laurence Oberman wrote: > > > > > On Sun, 2019-04-14 at 23:25 -0400, TomK wrote: > > > > > > Hey All, > > > > > > > > > > > > I'm getting a kernel panic on an Gigabyte GA-890XA-UD3 > > > > > > motherboard > > > > > > that > > > > > > I've got a QLE2464 card in as a target (FC). The kernel > > > > > > has > > > > > > been > > > > > > crashing / panicking in the last 1-2 months about once a > > > > > > week. Before > > > > > > that, it was rock solid for 4-5 years. I've upgraded to > > > > > > kernel > > > > > > 4.18.19 > > > > > > but that hasn't made much of a difference. Since the > > > > > > message > > > > > > includes > > > > > > qla2x00_request_irqs I thought I would try here first. > > > > > > > > > > > > Tried to get more info on this but: > > > > > > > > > > > > 1) Keyboard doesn't work and locks up when the panic > > > > > > occurs. No > > > > > > USB > > > > > > ports work. Tried the PS/2 port but nothing. > > > > > > > > > > > > 2) Unable to capture a kdump. Can't get to the kdump > > > > > > vmcore due > > > > > > to > > > > > > 1). > > > > > > > > > > > > The two screenshots is pretty much all I can capture. > > > > > > Tried > > > > > > things > > > > > > like > > > > > > clocksource=rtc in the kernel parms and disabling hpet1 > > > > > > but > > > > > > apparently I > > > > > > haven't disabled it everywhere since it still shows up. > > > > > > > > > > > > Wondering if anyone recognizes these messages or has any > > > > > > idea > > > > > > what > > > > > > could > > > > > > be the issue here? Even a hint would be appreciated. > > > > > > > > > > > > > > > > Hello Tom > > > > > I have had similar issues and reported them to > > > > > Himanshu@Cavium > > > > > I have kept all my target servers at kernel 4.5 as it been > > > > > the only > > > > > version that has always been stable. > > > > > If your motherboard has an NMI (virtual or physical) set all > > > > > of > > > > > these > > > > > in /etc/sysctl.conf > > > > > Run sysctl -a;dracut -f and reboot > > > > > > > > > > kernel.nmi_watchdog = 1 > > > > > kernel.panic_on_io_nmi = 1 > > > > > kernel.panic_on_unrecovered_nmi = > > > > > kernel.unknown_nmi_panic = 1 > > > > > > > > > > When the issue shows up press the virtual/physical NMI > > > > > > > > > > This is with the assumption that generic kdump is properly > > > > > setup > > > > > and > > > > > dmesg | grep crash shows memory resrved by the crashkernel > > > > > and that > > > > > you > > > > > have tested kdump manually. > > > > > > > > > > Other options are use a USB serial port to capture the full > > > > > log if > > > > > you > > > > > cannot get kdump to work. > > > > > > > > That approach may provide further evidence about kernel bugs > > > > but it > > > > is not > > > > guaranteed that that approach will lead to a solution. It would > > > > help > > > > if > > > > either or both of you could do the following on a test system: > > > > * Check out branch qla2xxx-for-next of my kernel repo on > > > > github > > > > (https://github.com/bvanassche/linux/tree/qla2xxx-for-next). > > > > > > > > * Enable lockdep and KASAN in the kernel config > > > > (CONFIG_PROVE_LOCKING > > > > and > > > > CONFIG_KASAN). > > > > * Build and install that kernel. > > > > * Run your favorite workload. > > > > > > > > Please note that the qla2xxx-for-next branch is based on the > > > > v5.1-rc1 > > > > kernel > > > > and hence should not be installed on any production system. > > > > > > > > Thanks, > > > > > > > > Bart. > > > > > > Hello Bart > > > OK, I will get to this by Thursday, wont be able to change the > > > targetserver kernel until then. > > > Regards > > > Laurence > > > > > > > Same. I'll try this out closer to the weekend. > > > > Not an NMI motherboard. This is a 9-10 year old AMD board meant as > > a desktop or home server. > > > > I'll have to read more about the USB Serial port to capture further > > info. That's interesting. > > > > For the time being, I've disabled HPET in BIOS. ( Appears the > > kernel boot parameter method wasn't enough. ) > > > > > > Hey Guy's, > Did some of what you suggested, including the USB serial setup: > 1) One of DB9 RS232 Serial Null Modem Cable F/F > 2) Two of USB to RS232 Serial Port DB9 9 Pin Male > however, when the kernel came down it took the USB support with it > and so minicom went offline: > CTRL-A Z for help |115200 8N1 | NOR | Minicom 2.6.2 | VT102 | > Offline > But I did enable full logging for the QLA module: > echo 0x7fffffff > > /sys/module/qla2xxx/parameters/ql2xextended_error_logging > Did all that, minus the Kernel v5.1-rc1 implementation, and this is > what was picked up from the minicom USB to Serial capture before > things went south: > 1235905 ^Mqla2xxx [0000:04:00.0]-e818: is_send_status=1, cmd- > >bufflen=512, cmd->sg_cnt=1, cmd- > > > > dma_data_directi > > on=1 > se_cmd[0000 > > 00009c9ea758] > qp > > 0 > 1235906 ^Mqla2xxx [0000:04:00.0]-e818: is_send_status=1, cmd- > >bufflen=4096, cmd->sg_cnt=0, > cmd- > > >dma_data_direct > > ion=2 > se_cmd[000 > > 0000096ae11b7] > q > > p 0 > 1235907 ^Mqla2xxx [0000:04:00.0]-e818: is_send_status=1, cmd- > >bufflen=20480, cmd->sg_cnt=0, > cmd > > ->dma_data_direc > > tion=2 > se_cmd[00 > > > 0000001738f793] > > qp 0 > 1235908 ^Mqla2xxx [0000:04:00.0]-e818: is_send_status=1, cmd- > >bufflen=20480, cmd->sg_cnt=0, > cmd > > ->dma_data_direc > > tion=2 > se_cmd[00 > > > 000000e8160a90] > > qp 0 > 1235909 ^MDetected MISCOMPARE for addr: 0000000033045258 buf: > 00000000f9849912 > 1235910 ^MTarget/fileio: Send MISCOMPARE check condition and sense > 1235911 ^Mqla2xxx [0000:04:00.0]-e818: is_send_status=1, cmd- > >bufflen=512, cmd->sg_cnt=0, cmd- > > > > dma_data_directi > > on=2 > se_cmd[0000 > > 0000363ae214] > qp > > 0 > 1235912 ^Mqla2xxx [0000:04:00.0]-e817: Skipping EXPLICIT_CONFORM and > CTIO7_FLAGS_CONFORM_REQ > fo > r FCP READ w/ > no > > n GOOD status > 1235913 ^Mqla2xxx [0000:04:00.0]-e874:2: qlt_free_cmd: > se_cmd[000000001db805fd] ox_id 00c8 > 1235914 ^Mqla2xxx [0000:04:00.0]-e872:2: qlt_24xx_atio_pkt_all_vps: > qla_target(0): type 6 > ox_id > 00db > 1235915 ^Mqla2xxx [0000:04:00.0]-e872:2: qlt_24xx_atio_pkt_all_vps: > qla_target(0): type 6 > ox_id > 00dc > 1235916 ^Mqla2xxx [0000:04:00.0]-e874:2: qlt_free_cmd: > se_cmd[00000000f67a701f] ox_id 00c9 > 1235917 ^Mqla2xxx [0000:04:00.0]-e872:2: qlt_24xx_atio_pkt_all_vps: > qla_target(0): type 6 > ox_id > 00dd > 1235918 ^Mqla2xxx [0000:04:00.0]-e872:2: qlt_24xx_atio_pkt_all_vps: > qla_target(0): type 6 > ox_id > 00de > > On an earlier crash, captured the attached image. This time there > was nothing on the monitor and the keyboard didn't refresh it. No > signal. > When looking this up, closest I could see online is the following: > https://target-devel.vger.kernel.narkive.com/XiM5Csx8/luns-become-unavailable-with-current-git-head > They too run ESXi . > To read the file I used the AnsiEsc plugin for VIM: > https://www.vim.org/scripts/script.php?script_id=302 > This started to occur once had a VMware based MySQL and PostgreSQL > cluster configured. Takes a few days for the issue to occur so from > that perspective, appears to be memory related. > Firmware that I'm using is: > supported_classes = "Class 3" > supported_speeds = "1 Gbit, 2 Gbit, 4 Gbit" > symbolic_name = "QLE2464 FW:v8.04.00 DVR:v10.00.00.05-k" > Targetcli, rtslib and configshell versions I'm using are: > > # rpm -aq|grep -Ei "targetcli|rtslib|configshell" > python-rtslib-3.0.pre4.9~g6fd0bbf-1.el6.noarch > python-configshell-1.1.fb4-1.el6.noarch > targetcli-3.0.pre4.5~ga125182-1.el6.noarch > > > -- > Thx, > TK. I missed this email, Been buried in customer cases. I also need to still run some tests. Sorry, reading now