Re: Update on crash with kernel 3.19

Mike Christie <mchristi@xxxxxxxxxx> · Tue, 02 Jun 2015 13:45:36 -0500

On 06/02/2015 08:11 AM, Alex Gorbachev wrote:
> Hi Mike,
> 
> Thank you for your responses.  I have sent last night's logs in a
> separate email.  Basically, in absence of any ceph related warnings,
> there seems to be a large set of ESXi rescans, then a message such as:
> 
> Jun  1 22:57:03 roc-4r-scd212 kernel: [531823.664766] iSCSI Login
> timeout on Network Portal 10.70.2.211:3260

Yeah, it looks like some commands timed out, esx tried to
abort and that failed. It then escalated to session level recovery and
it could not relogin, because the login code was waiting for the
outstanding tasks as we see this:

Jun  1 22:56:56 roc-4r-scd212 kernel: [531817.070242] Unexpected ret: -32 send data 48
Jun  1 22:56:57 roc-4r-scd212 kernel: [531817.326345] Unexpected ret: -32 send data 48
Jun  1 22:57:01 roc-4r-scd212 kernel: [531821.680011] Unexpected ret: -32 send data 48
Jun  1 22:57:02 roc-4r-scd212 kernel: [531822.384283] Unexpected ret: -32 send data 48
Jun  1 22:57:03 roc-4r-scd212 kernel: [531823.376666] Unexpected ret: -32 send data 48
Jun  1 22:57:03 roc-4r-scd212 kernel: [531823.664766] iSCSI Login timeout on Network Portal 10.70.2.211:3260
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373632] NMI watchdog: BUG: soft lockup - CPU#4 stuck for 23s! [iscsi_trx:23174]
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373680] Modules linked in: rbd libceph libcrc32c iscsi_target_mod target_core_file target_core_pscsi target_core_iblock target_core_mod configfs xt_multiport iptable_filter ip_tables x_tables enhanceio_rand(OE) enhanceio_lru(OE) enhanceio_fifo(OE) enhanceio(OE) ipmi_devintf ipmi_ssif x86_pkg_temp_thermal intel_powerclamp coretemp kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd joydev sb_edac edac_core mei_me mei lpc_ich ioatdma ipmi_si ipmi_msghandler 8250_fintek 8021q garp mrp stp shpchp llc wmi bonding mac_hid lp parport mlx4_en vxlan ip6_udp_tunnel udp_tunnel hid_generic igb usbhid mpt2sas ahci hid i2c_algo_bit mlx4_core dca libahci raid_class ptp scsi_transport_sas pps_core
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373742] CPU: 4 PID: 23174 Comm: iscsi_trx Tainted: G         C OE   4.1.0-040100rc2-generic #201505032335
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373744] Hardware name: Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.0a 12/05/2013
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373747] task: ffff88105912e4a0 ti: ffff880844a44000 task.ti: ffff880844a44000
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373749] RIP: 0010:[<ffffffff8180379d>]  [<ffffffff8180379d>] _raw_spin_lock_bh+0x1d/0x70
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373758] RSP: 0018:ffff880844a47d08  EFLAGS: 00000286
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373760] RAX: 00000000abc6abc6 RBX: ffff8808446b5410 RCX: 00000000f7b8aa20
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373762] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8808446b5224
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373764] RBP: ffff880844a47d08 R08: ffff8808446b5410 R09: 0000000000000101
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373766] R10: 0000000000000004 R11: dead000000200200 R12: 0000000000000004
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373768] R13: dead000000200200 R14: 0000000000000004 R15: ffff880844a47ce8
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373770] FS:  0000000000000000(0000) GS:ffff88085fb00000(0000) knlGS:0000000000000000
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373772] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373774] CR2: 00007f19f67ff2d0 CR3: 0000000001e0f000 CR4: 00000000001407e0
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373776] Stack:
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373778]  ffff880844a47d28 ffffffffc04194c7 ffff8808446b5140 ffff88085bfb7000
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373781]  ffff880844a47d58 ffffffffc0424665 0000000064d064d0 ffff8808446b5140
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373784]  0000000000000001 0000000000000001 ffff880844a47d98 ffffffffc042473e
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373788] Call Trace:
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373804]  [<ffffffffc04194c7>] iscsit_stop_dataout_timer+0x27/0x90 [iscsi_target_mod]
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373813]  [<ffffffffc0424665>] __iscsit_free_cmd+0x85/0xa0 [iscsi_target_mod]
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373821]  [<ffffffffc042473e>] iscsit_free_cmd+0xbe/0x160 [iscsi_target_mod]
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373829]  [<ffffffffc042d8dc>] iscsit_close_connection+0x47c/0x770 [iscsi_target_mod]
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373837]  [<ffffffffc0418c83>] iscsit_take_action_for_connection_exit+0x83/0x110 [iscsi_target_mod]
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373845]  [<ffffffffc042c947>] iscsi_target_rx_thread+0x207/0x2b0 [iscsi_target_mod]
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373848]  [<ffffffff817ff4fe>] ? __schedule+0x38e/0x8db
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373856]  [<ffffffffc042c740>] ? iscsi_target_tx_thread+0x1d0/0x1d0 [iscsi_target_mod]
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373861]  [<ffffffff8109cdc9>] kthread+0xc9/0xe0
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373865]  [<ffffffff8109cd00>] ? flush_kthread_worker+0x90/0x90
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373868]  [<ffffffff81803fe2>] ret_from_fork+0x42/0x70
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373871]  [<ffffffff8109cd00>] ? flush_kthread_worker+0x90/0x90
Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373872] Code: 00 00 00 00 eb ce 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 65 81 05 0f 81 80 7e 00 02 00 00 48 89 e5 b8 00 00 02 00 f0 0f c1 07 <89> c2 c1 ea 10 66 39 c2 75 02 5d c3 0f b7 ca b8 00 80 00 00 44

> 
> Then more errors.  I have attached the syslog from the LIO node, as
> well as logs from one of the six ESXi nodes, but this one has the most
> storage.
> 
> Do you think there is an ESXi scan storm or such that breaks LIO?  Or

How are you detecting scans in the logs? The non existent lun messages in
the logs right? If so, are the scans before or after the aborts? I thought it was
after. In the logs I was looking at the issue at around Jun  1 22:41:23, so
I'm not sure if there was a event before the log started or not.

> is this still a ceph related issue?  I did change the osd op complaint
> time as you indicated, it was set to 30 and I set it to 10.
> 

Ok, and nothing in the osd logs (on the osd boxes that would be in
/var/log/ceph/)?

I cannot remember, did you say you were not using enhance io for this
(it was just loaded but not used for lio) or did you try with it off?

Can we replicate this in a non production box? I can send you a patch with
a bunch of extra debugging on.

> Thank you,
> Alex
> 
> On Mon, Jun 1, 2015 at 10:55 PM, Mike Christie <mchristi@xxxxxxxxxx> wrote:
>> On 06/01/2015 09:18 AM, Alex Gorbachev wrote:
>>> Hi Mike,
>>>
>>> On Fri, May 29, 2015 at 3:11 PM, Mike Christie <mchristi@xxxxxxxxxx> wrote:
>>>> Just want to make sure we are on the same page. The logs you have been
>>>> posting the last week do not show crashes do they (there are no null
>>>> pointers or kernel lock ups)? They are just showing error handling
>>>> issues. I just want to make sure you are no longer seeing the crashes
>>>> with the newer kernels like we saw before.
>>>
>>> That is true, it may be due to the newer kernels.  We also implemented
>>> a fast monitoring system, which alerts us whenever there is an issue -
>>> so that may play a factor as we would proactively reboot or fail over
>>> services.  I did notice that after a few timeouts the system becomes
>>> sluggish or unresponsive, and to clear that it takes a reboot.  The
>>> good news is that we are seeing fewer timeouts overall (none last
>>> week).
>>>
>>>> For the timeouts/abort_tasks, the issue is either that LIO box is locked
>>>> up in some way that RBD IO is not executing like we saw with the
>>>> workqueue issues in older kernels, or there is just some issue on the
>>>> OSD boxes or connection to them that is not allowing IO to execute like
>>>> normal. This is the normal case and for this, you need to post the logs
>>>> on the ceph list if there is anything suspicious.
>>>
>>> Is there any way to tell which backstore device is causing the
>>> timeouts by the LIO messages?  Then we could map it to RBD and then
>>> OSD and PG.  As it is right now, ceph logs are not showing any errors
>>> when these events happen.
>>
>> On the LIO box you could try turning on more debugging, but if you just
>> look on the ESXi box vmkernel log you have the info.
>>
>> On the ESXi box in the logs you see error messages with the device id like:
>>
>> naa.6001405a37e54d400000000000000000
>>
>> or something with the host, channel, target lun identifier like
>>
>>  iscsi_vmk:
>> iscsivmk_TaskMgmtIssue: vmhba33:CH:3 T:22 L:1 : Task mgmt "Abort Task"
>> with itt=0x10d4450 (refITT=0x10d444f) timed ou
>>
>> You just then just map that to the device on the LIO box. So for
>> example, if you know the LUN (1 in the example above), then just look at
>> the LIO target config and see what rbd device you exported as lun 1.
>>
>> What is your value for:
>>
>> osd op complaint time
>>
>> Lower it to around 10 or 15 secs, so it is lower than the ESXi timeout
>> that would result in abort tasks being sent.
> --
> To unsubscribe from this list: send the line "unsubscribe target-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html