Re: LIO crashing Fedora box, multiple versions and kernels tested

"Nicholas A. Bellinger" <nab@xxxxxxxxxxxxxxx> · Fri, 10 Apr 2015 13:15:16 -0700

Hi Dan,

Adding Qlogic folk CC'

On Fri, 2015-04-10 at 00:08 -0400, Dan Lane wrote:
> I'm starting to get a better idea of the trigger is for my problem...
> Something during startup/shutdown of ESXi is doing something that
> either causes or triggers the failure.  I had ran the storage for over
> a day with a VM working flawlessly (it was using a single hard drive
> as the back-end after my ramdisk wasn't big enough and didn't cause a
> crash.  When I had presented the LUN the host was already running, and
> everyone played happy.  but once I tried shutting the host down, the
> storage server crashed... I was able to get MOST of the error message
> as I was running a tailf of messages from ssh, but it's abruptly cut
> off.  I plan to do further testing to see if I can figure out exactly
> what is triggering the failure and get better logs (I'm open to
> suggestions on this!).
> 
> Error message:
> Apr 10 00:00:19 labsan2 kernel: [90003.576341] ------------[ cut here
> ]------------
> Apr 10 00:00:19 labsan2 kernel: [90003.576341] WARNING: CPU: 1 PID: 0
> at kernel/watchdog.c:317 watchdog_overflow_callback+0x82/0xc0()
> Apr 10 00:00:19 labsan2 kernel: [90003.576341] Watchdog detected hard
> LOCKUP on cpu 1
> Apr 10 00:00:19 labsan2 kernel: [90003.576341] Modules linked in:
> tcm_qla2xxx target_core_user uio target_core_pscsi target_core_file
> target_core_iblock iscsi_target_mod target_core_mod ip6t_rpfilter
> ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute
> bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6
> nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security
> ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4
> nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
> iptable_security iptable_raw coretemp gpio_ich iTCO_wdt
> iTCO_vendor_support ipmi_devintf kvm_intel kvm lpc_ich mfd_core
> i5000_edac edac_core ipmi_ssif serio_raw i5k_amb ipmi_si
> ipmi_msghandler ioatdma shpchp dca acpi_cpufreq nfsd auth_rpcgss
> nfs_acl lockd grace sunrpc xfs libcrc32c radeon i2c_algo_bit
> drm_kms_helper mptsas ttm scsi_transport_sas drm mptscsih qla2xxx bnx2
> mptbase usb_storage ata_generic pata_acpi scsi_transport_fc
> Apr 10 00:00:19 labsan2 kernel: [90003.576341] CPU: 1 PID: 0 Comm:
> swapper/1 Not tainted 4.0.0-0.rc2.git0.1.fc22.x86_64 #1
> Apr 10 00:00:19 labsan2 kernel: [90003.576341] Hardware name: IBM IBM
> eServer BladeCenter HS21 -[8853L6U]-/Server Blade, BIOS
> -[BCE142BUS-1.18]- 06/17/2009
> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  0000000000000000
> 0f19f9d994cd1b0a ffff88042fc85a60 ffffffff81780388
> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  0000000000000000
> ffff88042fc85ab8 ffff88042fc85aa0 ffffffff8109c83a
> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  0000000000000000
> ffff88041d6d0000 0000000000000000 ffff88042fc85c00
> Apr 10 00:00:19 labsan2 kernel: [90003.576341] Call Trace:
> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  <NMI>
> [<ffffffff81780388>] dump_stack+0x45/0x57
> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff8109c83a>]
> warn_slowpath_common+0x8a/0xc0
> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff8109c8c5>]
> warn_slowpath_fmt+0x55/0x70
> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff81153e92>]
> watchdog_overflow_callback+0x82/0xc0
> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff8119599b>]
> __perf_event_overflow+0x9b/0x250
> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff811964b4>]
> perf_event_overflow+0x14/0x20
> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff81034f22>]
> intel_pmu_handle_irq+0x1d2/0x3e0
> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff8102c1db>]
> perf_event_nmi_handler+0x2b/0x50
> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff81019148>]
> nmi_handle+0x88/0x130
> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff810196c2>]
> default_do_nmi+0x42/0x110
> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff81019818>]
> do_nmi+0x88/0xd0
> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff81788b01>]
> end_repeat_nmi+0x1e/0x2e
> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff81786415>] ?
> _raw_spin_lock_irqsave+0x55/0x60
> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff81786415>] ?
> _raw_spin_lock_irqsave+0x55/0x60
> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff81786415>] ?
> _raw_spin_lock_irqsave+0x55/0x60
> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  <<EOE>>  <IRQ>
> [<ffffffffa00ffeb2>] qlt_fc_port_deleted+0x62/0xd0 [qla2xxx]
> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffffa00a2c43>]
> qla2x00_mark_device_lost+0x153/0x2e0 [qla2xxx]
> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffffa00c2339>]
> qla2x00_async_event+0xe19/0x1870 [qla2xxx]
> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffffa00c3901>]
> qla24xx_intr_handler+0x1a1/0x2f0 [qla2xxx]
> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff810f4c57>]
> handle_irq_event_percpu+0x77/0x1a0
> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff8
> 

Ok, it looks like a deadlock on qla_hw_data->hardware_lock, because
qla24xx_intr_handler() is holding the lock when qla2x00_async_event() ->
qla2x00_mark_device_lost() -> qlt_fc_port_deleted() attempts to take the
same lock.

AFAICT, this regression was introduced during the v3.18-rc timeframe
with the following commit:

commit ef86cb2059a14b4024c7320999ee58e938873032
Author: Chad Dupuis <chad.dupuis@xxxxxxxxxx>
Date:   Thu Sep 25 05:17:01 2014 -0400

    qla2xxx: Mark port lost when we receive an RSCN for it.
    
    Signed-off-by: Chad Dupuis <chad.dupuis@xxxxxxxxxx>
    Signed-off-by: Saurav Kashyap <saurav.kashyap@xxxxxxxxxx>
    Signed-off-by: Christoph Hellwig <hch@xxxxxx>

diff --git a/drivers/scsi/qla2xxx/qla_isr.c b/drivers/scsi/qla2xxx/qla_isr.c
index 696e4a2..a04a1b1 100644
--- a/drivers/scsi/qla2xxx/qla_isr.c
+++ b/drivers/scsi/qla2xxx/qla_isr.c
@@ -575,8 +575,9 @@ qla2x00_async_event(scsi_qla_host_t *vha, struct rsp_que *rsp, uint16_t *mb)
        struct device_reg_2xxx __iomem *reg = &ha->iobase->isp;
        struct device_reg_24xx __iomem *reg24 = &ha->iobase->isp24;
        struct device_reg_82xx __iomem *reg82 = &ha->iobase->isp82;
-       uint32_t        rscn_entry, host_pid;
+       uint32_t        rscn_entry, host_pid, tmp_pid;
        unsigned long   flags;
+       fc_port_t       *fcport = NULL;
 
        /* Setup to process RIO completion. */
        handle_cnt = 0;
@@ -979,6 +980,20 @@ skip_rio:
                if (qla2x00_is_a_vp_did(vha, rscn_entry))
                        break;
 
+               /*
+                * Search for the rport related to this RSCN entry and mark it
+                * as lost.
+                */
+               list_for_each_entry(fcport, &vha->vp_fcports, list) {
+                       if (atomic_read(&fcport->state) != FCS_ONLINE)
+                               continue;
+                       tmp_pid = fcport->d_id.b24;
+                       if (fcport->d_id.b24 == rscn_entry) {
+                               qla2x00_mark_device_lost(vha, fcport, 0, 0);
+                               break;
+                       }
+               }
+
                atomic_set(&vha->loop_down_timer, 0);
                vha->flags.management_server_logged_in = 0;


Chad & Co, how would you like to proceed here..?

Thanks,

--nab

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html