Re: LIO crashing Fedora box, multiple versions and kernels tested

Giridhar Malavali <giridhar.malavali@xxxxxxxxxx> · Sat, 11 Apr 2015 03:25:01 +0000

Hi Dan, Nicholas,

We do see the problem and work to make sure the fix is available upstream
ASAP.

Thanks for bringing this to our attention.

-- Giri

On 4/10/15 8:21 PM, "Dan Lane" <dracodan@xxxxxxxxx> wrote:

>YES!  Finally I have an answer to this headache!  I reverted my Fedora
>21 system to the 3.17.4 kernel and everything seems to be working
>flawlessly.
>
>Thank you so much for tracking this down, hopefully the fix can be
>implemented upstream quickly...
>
>Dan
>
>On Fri, Apr 10, 2015 at 4:15 PM, Nicholas A. Bellinger
><nab@xxxxxxxxxxxxxxx> wrote:
>> Hi Dan,
>>
>> Adding Qlogic folk CC'
>>
>> On Fri, 2015-04-10 at 00:08 -0400, Dan Lane wrote:
>>> I'm starting to get a better idea of the trigger is for my problem...
>>> Something during startup/shutdown of ESXi is doing something that
>>> either causes or triggers the failure.  I had ran the storage for over
>>> a day with a VM working flawlessly (it was using a single hard drive
>>> as the back-end after my ramdisk wasn't big enough and didn't cause a
>>> crash.  When I had presented the LUN the host was already running, and
>>> everyone played happy.  but once I tried shutting the host down, the
>>> storage server crashed... I was able to get MOST of the error message
>>> as I was running a tailf of messages from ssh, but it's abruptly cut
>>> off.  I plan to do further testing to see if I can figure out exactly
>>> what is triggering the failure and get better logs (I'm open to
>>> suggestions on this!).
>>>
>>> Error message:
>>> Apr 10 00:00:19 labsan2 kernel: [90003.576341] ------------[ cut here
>>> ]------------
>>> Apr 10 00:00:19 labsan2 kernel: [90003.576341] WARNING: CPU: 1 PID: 0
>>> at kernel/watchdog.c:317 watchdog_overflow_callback+0x82/0xc0()
>>> Apr 10 00:00:19 labsan2 kernel: [90003.576341] Watchdog detected hard
>>> LOCKUP on cpu 1
>>> Apr 10 00:00:19 labsan2 kernel: [90003.576341] Modules linked in:
>>> tcm_qla2xxx target_core_user uio target_core_pscsi target_core_file
>>> target_core_iblock iscsi_target_mod target_core_mod ip6t_rpfilter
>>> ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute
>>> bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6
>>> nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security
>>> ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4
>>> nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle
>>> iptable_security iptable_raw coretemp gpio_ich iTCO_wdt
>>> iTCO_vendor_support ipmi_devintf kvm_intel kvm lpc_ich mfd_core
>>> i5000_edac edac_core ipmi_ssif serio_raw i5k_amb ipmi_si
>>> ipmi_msghandler ioatdma shpchp dca acpi_cpufreq nfsd auth_rpcgss
>>> nfs_acl lockd grace sunrpc xfs libcrc32c radeon i2c_algo_bit
>>> drm_kms_helper mptsas ttm scsi_transport_sas drm mptscsih qla2xxx bnx2
>>> mptbase usb_storage ata_generic pata_acpi scsi_transport_fc
>>> Apr 10 00:00:19 labsan2 kernel: [90003.576341] CPU: 1 PID: 0 Comm:
>>> swapper/1 Not tainted 4.0.0-0.rc2.git0.1.fc22.x86_64 #1
>>> Apr 10 00:00:19 labsan2 kernel: [90003.576341] Hardware name: IBM IBM
>>> eServer BladeCenter HS21 -[8853L6U]-/Server Blade, BIOS
>>> -[BCE142BUS-1.18]- 06/17/2009
>>> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  0000000000000000
>>> 0f19f9d994cd1b0a ffff88042fc85a60 ffffffff81780388
>>> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  0000000000000000
>>> ffff88042fc85ab8 ffff88042fc85aa0 ffffffff8109c83a
>>> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  0000000000000000
>>> ffff88041d6d0000 0000000000000000 ffff88042fc85c00
>>> Apr 10 00:00:19 labsan2 kernel: [90003.576341] Call Trace:
>>> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  <NMI>
>>> [<ffffffff81780388>] dump_stack+0x45/0x57
>>> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff8109c83a>]
>>> warn_slowpath_common+0x8a/0xc0
>>> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff8109c8c5>]
>>> warn_slowpath_fmt+0x55/0x70
>>> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff81153e92>]
>>> watchdog_overflow_callback+0x82/0xc0
>>> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff8119599b>]
>>> __perf_event_overflow+0x9b/0x250
>>> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff811964b4>]
>>> perf_event_overflow+0x14/0x20
>>> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff81034f22>]
>>> intel_pmu_handle_irq+0x1d2/0x3e0
>>> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff8102c1db>]
>>> perf_event_nmi_handler+0x2b/0x50
>>> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff81019148>]
>>> nmi_handle+0x88/0x130
>>> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff810196c2>]
>>> default_do_nmi+0x42/0x110
>>> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff81019818>]
>>> do_nmi+0x88/0xd0
>>> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff81788b01>]
>>> end_repeat_nmi+0x1e/0x2e
>>> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff81786415>] ?
>>> _raw_spin_lock_irqsave+0x55/0x60
>>> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff81786415>] ?
>>> _raw_spin_lock_irqsave+0x55/0x60
>>> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff81786415>] ?
>>> _raw_spin_lock_irqsave+0x55/0x60
>>> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  <<EOE>>  <IRQ>
>>> [<ffffffffa00ffeb2>] qlt_fc_port_deleted+0x62/0xd0 [qla2xxx]
>>> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffffa00a2c43>]
>>> qla2x00_mark_device_lost+0x153/0x2e0 [qla2xxx]
>>> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffffa00c2339>]
>>> qla2x00_async_event+0xe19/0x1870 [qla2xxx]
>>> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffffa00c3901>]
>>> qla24xx_intr_handler+0x1a1/0x2f0 [qla2xxx]
>>> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff810f4c57>]
>>> handle_irq_event_percpu+0x77/0x1a0
>>> Apr 10 00:00:19 labsan2 kernel: [90003.576341]  [<ffffffff8
>>>
>>
>> Ok, it looks like a deadlock on qla_hw_data->hardware_lock, because
>> qla24xx_intr_handler() is holding the lock when qla2x00_async_event() ->
>> qla2x00_mark_device_lost() -> qlt_fc_port_deleted() attempts to take the
>> same lock.
>>
>> AFAICT, this regression was introduced during the v3.18-rc timeframe
>> with the following commit:
>>
>> commit ef86cb2059a14b4024c7320999ee58e938873032
>> Author: Chad Dupuis <chad.dupuis@xxxxxxxxxx>
>> Date:   Thu Sep 25 05:17:01 2014 -0400
>>
>>     qla2xxx: Mark port lost when we receive an RSCN for it.
>>
>>     Signed-off-by: Chad Dupuis <chad.dupuis@xxxxxxxxxx>
>>     Signed-off-by: Saurav Kashyap <saurav.kashyap@xxxxxxxxxx>
>>     Signed-off-by: Christoph Hellwig <hch@xxxxxx>
>>
>> diff --git a/drivers/scsi/qla2xxx/qla_isr.c
>>b/drivers/scsi/qla2xxx/qla_isr.c
>> index 696e4a2..a04a1b1 100644
>> --- a/drivers/scsi/qla2xxx/qla_isr.c
>> +++ b/drivers/scsi/qla2xxx/qla_isr.c
>> @@ -575,8 +575,9 @@ qla2x00_async_event(scsi_qla_host_t *vha, struct
>>rsp_que *rsp, uint16_t *mb)
>>         struct device_reg_2xxx __iomem *reg = &ha->iobase->isp;
>>         struct device_reg_24xx __iomem *reg24 = &ha->iobase->isp24;
>>         struct device_reg_82xx __iomem *reg82 = &ha->iobase->isp82;
>> -       uint32_t        rscn_entry, host_pid;
>> +       uint32_t        rscn_entry, host_pid, tmp_pid;
>>         unsigned long   flags;
>> +       fc_port_t       *fcport = NULL;
>>
>>         /* Setup to process RIO completion. */
>>         handle_cnt = 0;
>> @@ -979,6 +980,20 @@ skip_rio:
>>                 if (qla2x00_is_a_vp_did(vha, rscn_entry))
>>                         break;
>>
>> +               /*
>> +                * Search for the rport related to this RSCN entry and
>>mark it
>> +                * as lost.
>> +                */
>> +               list_for_each_entry(fcport, &vha->vp_fcports, list) {
>> +                       if (atomic_read(&fcport->state) != FCS_ONLINE)
>> +                               continue;
>> +                       tmp_pid = fcport->d_id.b24;
>> +                       if (fcport->d_id.b24 == rscn_entry) {
>> +                               qla2x00_mark_device_lost(vha, fcport,
>>0, 0);
>> +                               break;
>> +                       }
>> +               }
>> +
>>                 atomic_set(&vha->loop_down_timer, 0);
>>                 vha->flags.management_server_logged_in = 0;
>>
>>
>> Chad & Co, how would you like to proceed here..?
>>
>> Thanks,
>>
>> --nab
>>

________________________________

This message and any attached documents contain information from the sending company or its parent company(s), subsidiaries, divisions or branch offices that may be confidential. If you are not the intended recipient, you may not read, copy, distribute, or use this information. If you have received this transmission in error, please notify the sender immediately by reply e-mail and then delete this message.
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html