Re: Update on crash with kernel 3.19

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 06/03/2015 12:06 PM, Alex Gorbachev wrote:
> Hi Mike,
> 
> On Tue, Jun 2, 2015 at 2:45 PM, Mike Christie <mchristi@xxxxxxxxxx> wrote:
>> On 06/02/2015 08:11 AM, Alex Gorbachev wrote:
>>> Hi Mike,
>>>
>>> Thank you for your responses.  I have sent last night's logs in a
>>> separate email.  Basically, in absence of any ceph related warnings,
>>> there seems to be a large set of ESXi rescans, then a message such as:
>>>
>>> Jun  1 22:57:03 roc-4r-scd212 kernel: [531823.664766] iSCSI Login
>>> timeout on Network Portal 10.70.2.211:3260
>>
>> Yeah, it looks like some commands timed out, esx tried to
>> abort and that failed. It then escalated to session level recovery and
>> it could not relogin, because the login code was waiting for the
>> outstanding tasks as we see this:
> 
> Would you say there is no graceful mechanism for LIO and ESXi to just
> give up on one login and try another one?


I guess it is sort of being debated. SCST and I think tgt allow the new
login even though a cmd/task is still hung. When discussing why LIO does
not do the same, we thought there was something in the SCSI/iSCSI specs
that said we had to do it how LIO does it.



> 
>>
>>
>> Jun  1 22:56:56 roc-4r-scd212 kernel: [531817.070242] Unexpected ret: -32 send data 48
>> Jun  1 22:56:57 roc-4r-scd212 kernel: [531817.326345] Unexpected ret: -32 send data 48
>> Jun  1 22:57:01 roc-4r-scd212 kernel: [531821.680011] Unexpected ret: -32 send data 48
>> Jun  1 22:57:02 roc-4r-scd212 kernel: [531822.384283] Unexpected ret: -32 send data 48
>> Jun  1 22:57:03 roc-4r-scd212 kernel: [531823.376666] Unexpected ret: -32 send data 48
>> Jun  1 22:57:03 roc-4r-scd212 kernel: [531823.664766] iSCSI Login timeout on Network Portal 10.70.2.211:3260
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373632] NMI watchdog: BUG: soft lockup - CPU#4 stuck for 23s! [iscsi_trx:23174]
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373680] Modules linked in: rbd libceph libcrc32c iscsi_target_mod target_core_file target_core_pscsi target_core_iblock target_core_mod configfs xt_multiport iptable_filter ip_tables x_tables enhanceio_rand(OE) enhanceio_lru(OE) enhanceio_fifo(OE) enhanceio(OE) ipmi_devintf ipmi_ssif x86_pkg_temp_thermal intel_powerclamp coretemp kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd joydev sb_edac edac_core mei_me mei lpc_ich ioatdma ipmi_si ipmi_msghandler 8250_fintek 8021q garp mrp stp shpchp llc wmi bonding mac_hid lp parport mlx4_en vxlan ip6_udp_tunnel udp_tunnel hid_generic igb usbhid mpt2sas ahci hid i2c_algo_bit mlx4_core dca libahci raid_class ptp scsi_transport_sas pps_core
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373742] CPU: 4 PID: 23174 Comm: iscsi_trx Tainted: G         C OE   4.1.0-040100rc2-generic #201505032335
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373744] Hardware name: Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.0a 12/05/2013
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373747] task: ffff88105912e4a0 ti: ffff880844a44000 task.ti: ffff880844a44000
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373749] RIP: 0010:[<ffffffff8180379d>]  [<ffffffff8180379d>] _raw_spin_lock_bh+0x1d/0x70
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373758] RSP: 0018:ffff880844a47d08  EFLAGS: 00000286
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373760] RAX: 00000000abc6abc6 RBX: ffff8808446b5410 RCX: 00000000f7b8aa20
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373762] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8808446b5224
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373764] RBP: ffff880844a47d08 R08: ffff8808446b5410 R09: 0000000000000101
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373766] R10: 0000000000000004 R11: dead000000200200 R12: 0000000000000004
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373768] R13: dead000000200200 R14: 0000000000000004 R15: ffff880844a47ce8
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373770] FS:  0000000000000000(0000) GS:ffff88085fb00000(0000) knlGS:0000000000000000
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373772] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373774] CR2: 00007f19f67ff2d0 CR3: 0000000001e0f000 CR4: 00000000001407e0
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373776] Stack:
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373778]  ffff880844a47d28 ffffffffc04194c7 ffff8808446b5140 ffff88085bfb7000
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373781]  ffff880844a47d58 ffffffffc0424665 0000000064d064d0 ffff8808446b5140
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373784]  0000000000000001 0000000000000001 ffff880844a47d98 ffffffffc042473e
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373788] Call Trace:
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373804]  [<ffffffffc04194c7>] iscsit_stop_dataout_timer+0x27/0x90 [iscsi_target_mod]
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373813]  [<ffffffffc0424665>] __iscsit_free_cmd+0x85/0xa0 [iscsi_target_mod]
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373821]  [<ffffffffc042473e>] iscsit_free_cmd+0xbe/0x160 [iscsi_target_mod]
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373829]  [<ffffffffc042d8dc>] iscsit_close_connection+0x47c/0x770 [iscsi_target_mod]
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373837]  [<ffffffffc0418c83>] iscsit_take_action_for_connection_exit+0x83/0x110 [iscsi_target_mod]
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373845]  [<ffffffffc042c947>] iscsi_target_rx_thread+0x207/0x2b0 [iscsi_target_mod]
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373848]  [<ffffffff817ff4fe>] ? __schedule+0x38e/0x8db
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373856]  [<ffffffffc042c740>] ? iscsi_target_tx_thread+0x1d0/0x1d0 [iscsi_target_mod]
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373861]  [<ffffffff8109cdc9>] kthread+0xc9/0xe0
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373865]  [<ffffffff8109cd00>] ? flush_kthread_worker+0x90/0x90
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373868]  [<ffffffff81803fe2>] ret_from_fork+0x42/0x70
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373871]  [<ffffffff8109cd00>] ? flush_kthread_worker+0x90/0x90
>> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373872] Code: 00 00 00 00 eb ce 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 65 81 05 0f 81 80 7e 00 02 00 00 48 89 e5 b8 00 00 02 00 f0 0f c1 07 <89> c2 c1 ea 10 66 39 c2 75 02 5d c3 0f b7 ca b8 00 80 00 00 44
>>
>>
>>
>>>
>>> Then more errors.  I have attached the syslog from the LIO node, as
>>> well as logs from one of the six ESXi nodes, but this one has the most
>>> storage.
>>>
>>> Do you think there is an ESXi scan storm or such that breaks LIO?  Or
>>
>> How are you detecting scans in the logs? The non existent lun messages in
>> the logs right? If so, are the scans before or after the aborts? I thought it was
>> after. In the logs I was looking at the issue at around Jun  1 22:41:23, so
>> I'm not sure if there was a event before the log started or not.
> 
> Right, the nonexistent LUN.  There are scans before and after the
> aborts (I sent you the logs via email from yesterday that show a lot
> of these).


In the log you sent before, the aborts happened first.
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux SCSI]     [Kernel Newbies]     [Linux SCSI Target Infrastructure]     [Share Photos]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Device Mapper]

  Powered by Linux