Re: Update on crash with kernel 3.19

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Wed, 2015-06-03 at 16:41 -0500, Mike Christie wrote:
> On 06/03/2015 12:06 PM, Alex Gorbachev wrote:
> > Hi Mike,
> > 
> > On Tue, Jun 2, 2015 at 2:45 PM, Mike Christie <mchristi@xxxxxxxxxx> wrote:
> >> On 06/02/2015 08:11 AM, Alex Gorbachev wrote:
> >>> Hi Mike,
> >>>
> >>> Thank you for your responses.  I have sent last night's logs in a
> >>> separate email.  Basically, in absence of any ceph related warnings,
> >>> there seems to be a large set of ESXi rescans, then a message such as:
> >>>
> >>> Jun  1 22:57:03 roc-4r-scd212 kernel: [531823.664766] iSCSI Login
> >>> timeout on Network Portal 10.70.2.211:3260
> >>
> >> Yeah, it looks like some commands timed out, esx tried to
> >> abort and that failed. It then escalated to session level recovery and
> >> it could not relogin, because the login code was waiting for the
> >> outstanding tasks as we see this:
> > 
> > Would you say there is no graceful mechanism for LIO and ESXi to just
> > give up on one login and try another one?
> 
> 
> I guess it is sort of being debated. SCST and I think tgt allow the new
> login even though a cmd/task is still hung. When discussing why LIO does
> not do the same, we thought there was something in the SCSI/iSCSI specs
> that said we had to do it how LIO does it.
> 
> 

The intention is outstanding tasks from a failed session are expected to
complete back to target-core, before allowing a new session to take the
failed one's place.

In the spec, this is touched on in Section 5.3.5:

   "Session reinstatement causes all the tasks that were
    active on the old session to be immediately terminated by the target
    without further notice to the initiator."

So it's not specific that a target needs to wait for outstanding I/O to
complete, but due to the way connection recovery works, it's implied
given a target must get outstanding I/Os in a proper state before
allowing TMR TASK_REASSIGN on the failed connection's tasks to occur.

For normal session reinstatement, it's also useful to wait for the
outstanding I/Os to finish processing before allowing new I/Os to
process, in order to guarantee old I/Os are not potentially processed
after new I/Os from the new session.

Currently, the target will wait indefinitely for the outstanding I/Os to
complete.  If a raw block driver is not returning it's outstanding I/Os
back to target (ever..?), then it's a driver bug. 

> 
> > 
> >>
> >>
> >> Jun  1 22:56:56 roc-4r-scd212 kernel: [531817.070242] Unexpected ret: -32 send data 48
> >> Jun  1 22:56:57 roc-4r-scd212 kernel: [531817.326345] Unexpected ret: -32 send data 48
> >> Jun  1 22:57:01 roc-4r-scd212 kernel: [531821.680011] Unexpected ret: -32 send data 48
> >> Jun  1 22:57:02 roc-4r-scd212 kernel: [531822.384283] Unexpected ret: -32 send data 48
> >> Jun  1 22:57:03 roc-4r-scd212 kernel: [531823.376666] Unexpected ret: -32 send data 48
> >> Jun  1 22:57:03 roc-4r-scd212 kernel: [531823.664766] iSCSI Login timeout on Network Portal 10.70.2.211:3260
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373632] NMI watchdog: BUG: soft lockup - CPU#4 stuck for 23s! [iscsi_trx:23174]
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373680] Modules linked in: rbd libceph libcrc32c iscsi_target_mod target_core_file target_core_pscsi target_core_iblock target_core_mod configfs xt_multiport iptable_filter ip_tables x_tables enhanceio_rand(OE) enhanceio_lru(OE) enhanceio_fifo(OE) enhanceio(OE) ipmi_devintf ipmi_ssif x86_pkg_temp_thermal intel_powerclamp coretemp kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd joydev sb_edac edac_core mei_me mei lpc_ich ioatdma ipmi_si ipmi_msghandler 8250_fintek 8021q garp mrp stp shpchp llc wmi bonding mac_hid lp parport mlx4_en vxlan ip6_udp_tunnel udp_tunnel hid_generic igb usbhid mpt2sas ahci hid i2c_algo_bit mlx4_core dca libahci raid_class ptp scsi_transport_sas pps_core
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373742] CPU: 4 PID: 23174 Comm: iscsi_trx Tainted: G         C OE   4.1.0-040100rc2-generic #201505032335
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373744] Hardware name: Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.0a 12/05/2013
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373747] task: ffff88105912e4a0 ti: ffff880844a44000 task.ti: ffff880844a44000
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373749] RIP: 0010:[<ffffffff8180379d>]  [<ffffffff8180379d>] _raw_spin_lock_bh+0x1d/0x70
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373758] RSP: 0018:ffff880844a47d08  EFLAGS: 00000286
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373760] RAX: 00000000abc6abc6 RBX: ffff8808446b5410 RCX: 00000000f7b8aa20
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373762] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8808446b5224
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373764] RBP: ffff880844a47d08 R08: ffff8808446b5410 R09: 0000000000000101
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373766] R10: 0000000000000004 R11: dead000000200200 R12: 0000000000000004
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373768] R13: dead000000200200 R14: 0000000000000004 R15: ffff880844a47ce8
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373770] FS:  0000000000000000(0000) GS:ffff88085fb00000(0000) knlGS:0000000000000000
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373772] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373774] CR2: 00007f19f67ff2d0 CR3: 0000000001e0f000 CR4: 00000000001407e0
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373776] Stack:
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373778]  ffff880844a47d28 ffffffffc04194c7 ffff8808446b5140 ffff88085bfb7000
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373781]  ffff880844a47d58 ffffffffc0424665 0000000064d064d0 ffff8808446b5140
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373784]  0000000000000001 0000000000000001 ffff880844a47d98 ffffffffc042473e
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373788] Call Trace:
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373804]  [<ffffffffc04194c7>] iscsit_stop_dataout_timer+0x27/0x90 [iscsi_target_mod]
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373813]  [<ffffffffc0424665>] __iscsit_free_cmd+0x85/0xa0 [iscsi_target_mod]
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373821]  [<ffffffffc042473e>] iscsit_free_cmd+0xbe/0x160 [iscsi_target_mod]
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373829]  [<ffffffffc042d8dc>] iscsit_close_connection+0x47c/0x770 [iscsi_target_mod]
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373837]  [<ffffffffc0418c83>] iscsit_take_action_for_connection_exit+0x83/0x110 [iscsi_target_mod]
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373845]  [<ffffffffc042c947>] iscsi_target_rx_thread+0x207/0x2b0 [iscsi_target_mod]
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373848]  [<ffffffff817ff4fe>] ? __schedule+0x38e/0x8db
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373856]  [<ffffffffc042c740>] ? iscsi_target_tx_thread+0x1d0/0x1d0 [iscsi_target_mod]
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373861]  [<ffffffff8109cdc9>] kthread+0xc9/0xe0
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373865]  [<ffffffff8109cd00>] ? flush_kthread_worker+0x90/0x90
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373868]  [<ffffffff81803fe2>] ret_from_fork+0x42/0x70
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373871]  [<ffffffff8109cd00>] ? flush_kthread_worker+0x90/0x90
> >> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373872] Code: 00 00 00 00 eb ce 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 65 81 05 0f 81 80 7e 00 02 00 00 48 89 e5 b8 00 00 02 00 f0 0f c1 07 <89> c2 c1 ea 10 66 39 c2 75 02 5d c3 0f b7 ca b8 00 80 00 00 44
> >>
> >>
> >>
> >>>
> >>> Then more errors.  I have attached the syslog from the LIO node, as
> >>> well as logs from one of the six ESXi nodes, but this one has the most
> >>> storage.
> >>>
> >>> Do you think there is an ESXi scan storm or such that breaks LIO?  Or
> >>
> >> How are you detecting scans in the logs? The non existent lun messages in
> >> the logs right? If so, are the scans before or after the aborts? I thought it was
> >> after. In the logs I was looking at the issue at around Jun  1 22:41:23, so
> >> I'm not sure if there was a event before the log started or not.
> > 
> > Right, the nonexistent LUN.  There are scans before and after the
> > aborts (I sent you the logs via email from yesterday that show a lot
> > of these).
> 
> 
> In the log you sent before, the aborts happened first.
> --

Out of curiosity, are RCU stalls still occurring later in the logs..?
If they are, then I'm still leaning toward a bug outside of iscsi-target
code.

Just because the NMI watching shows a hung task earlier than the RCU
stall itself, doesn't mean the first hung task is the root cause.

So more preceding and subsequent context of the log would be helpful..

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux SCSI]     [Kernel Newbies]     [Linux SCSI Target Infrastructure]     [Share Photos]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Device Mapper]

  Powered by Linux