Re: Update on crash with kernel 3.19

Alex Gorbachev <ag@xxxxxxxxxxxxxxxxxxx> · Wed, 3 Jun 2015 13:06:11 -0400

Hi Mike,

On Tue, Jun 2, 2015 at 2:45 PM, Mike Christie <mchristi@xxxxxxxxxx> wrote:
> On 06/02/2015 08:11 AM, Alex Gorbachev wrote:
>> Hi Mike,
>>
>> Thank you for your responses.  I have sent last night's logs in a
>> separate email.  Basically, in absence of any ceph related warnings,
>> there seems to be a large set of ESXi rescans, then a message such as:
>>
>> Jun  1 22:57:03 roc-4r-scd212 kernel: [531823.664766] iSCSI Login
>> timeout on Network Portal 10.70.2.211:3260
>
> Yeah, it looks like some commands timed out, esx tried to
> abort and that failed. It then escalated to session level recovery and
> it could not relogin, because the login code was waiting for the
> outstanding tasks as we see this:

Would you say there is no graceful mechanism for LIO and ESXi to just
give up on one login and try another one?

>
>
> Jun  1 22:56:56 roc-4r-scd212 kernel: [531817.070242] Unexpected ret: -32 send data 48
> Jun  1 22:56:57 roc-4r-scd212 kernel: [531817.326345] Unexpected ret: -32 send data 48
> Jun  1 22:57:01 roc-4r-scd212 kernel: [531821.680011] Unexpected ret: -32 send data 48
> Jun  1 22:57:02 roc-4r-scd212 kernel: [531822.384283] Unexpected ret: -32 send data 48
> Jun  1 22:57:03 roc-4r-scd212 kernel: [531823.376666] Unexpected ret: -32 send data 48
> Jun  1 22:57:03 roc-4r-scd212 kernel: [531823.664766] iSCSI Login timeout on Network Portal 10.70.2.211:3260
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373632] NMI watchdog: BUG: soft lockup - CPU#4 stuck for 23s! [iscsi_trx:23174]
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373680] Modules linked in: rbd libceph libcrc32c iscsi_target_mod target_core_file target_core_pscsi target_core_iblock target_core_mod configfs xt_multiport iptable_filter ip_tables x_tables enhanceio_rand(OE) enhanceio_lru(OE) enhanceio_fifo(OE) enhanceio(OE) ipmi_devintf ipmi_ssif x86_pkg_temp_thermal intel_powerclamp coretemp kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd joydev sb_edac edac_core mei_me mei lpc_ich ioatdma ipmi_si ipmi_msghandler 8250_fintek 8021q garp mrp stp shpchp llc wmi bonding mac_hid lp parport mlx4_en vxlan ip6_udp_tunnel udp_tunnel hid_generic igb usbhid mpt2sas ahci hid i2c_algo_bit mlx4_core dca libahci raid_class ptp scsi_transport_sas pps_core
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373742] CPU: 4 PID: 23174 Comm: iscsi_trx Tainted: G         C OE   4.1.0-040100rc2-generic #201505032335
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373744] Hardware name: Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.0a 12/05/2013
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373747] task: ffff88105912e4a0 ti: ffff880844a44000 task.ti: ffff880844a44000
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373749] RIP: 0010:[<ffffffff8180379d>]  [<ffffffff8180379d>] _raw_spin_lock_bh+0x1d/0x70
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373758] RSP: 0018:ffff880844a47d08  EFLAGS: 00000286
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373760] RAX: 00000000abc6abc6 RBX: ffff8808446b5410 RCX: 00000000f7b8aa20
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373762] RDX: 0000000000000001 RSI: 0000000000000001 RDI: ffff8808446b5224
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373764] RBP: ffff880844a47d08 R08: ffff8808446b5410 R09: 0000000000000101
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373766] R10: 0000000000000004 R11: dead000000200200 R12: 0000000000000004
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373768] R13: dead000000200200 R14: 0000000000000004 R15: ffff880844a47ce8
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373770] FS:  0000000000000000(0000) GS:ffff88085fb00000(0000) knlGS:0000000000000000
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373772] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373774] CR2: 00007f19f67ff2d0 CR3: 0000000001e0f000 CR4: 00000000001407e0
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373776] Stack:
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373778]  ffff880844a47d28 ffffffffc04194c7 ffff8808446b5140 ffff88085bfb7000
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373781]  ffff880844a47d58 ffffffffc0424665 0000000064d064d0 ffff8808446b5140
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373784]  0000000000000001 0000000000000001 ffff880844a47d98 ffffffffc042473e
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373788] Call Trace:
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373804]  [<ffffffffc04194c7>] iscsit_stop_dataout_timer+0x27/0x90 [iscsi_target_mod]
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373813]  [<ffffffffc0424665>] __iscsit_free_cmd+0x85/0xa0 [iscsi_target_mod]
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373821]  [<ffffffffc042473e>] iscsit_free_cmd+0xbe/0x160 [iscsi_target_mod]
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373829]  [<ffffffffc042d8dc>] iscsit_close_connection+0x47c/0x770 [iscsi_target_mod]
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373837]  [<ffffffffc0418c83>] iscsit_take_action_for_connection_exit+0x83/0x110 [iscsi_target_mod]
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373845]  [<ffffffffc042c947>] iscsi_target_rx_thread+0x207/0x2b0 [iscsi_target_mod]
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373848]  [<ffffffff817ff4fe>] ? __schedule+0x38e/0x8db
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373856]  [<ffffffffc042c740>] ? iscsi_target_tx_thread+0x1d0/0x1d0 [iscsi_target_mod]
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373861]  [<ffffffff8109cdc9>] kthread+0xc9/0xe0
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373865]  [<ffffffff8109cd00>] ? flush_kthread_worker+0x90/0x90
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373868]  [<ffffffff81803fe2>] ret_from_fork+0x42/0x70
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373871]  [<ffffffff8109cd00>] ? flush_kthread_worker+0x90/0x90
> Jun  1 22:57:16 roc-4r-scd212 kernel: [531836.373872] Code: 00 00 00 00 eb ce 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 65 81 05 0f 81 80 7e 00 02 00 00 48 89 e5 b8 00 00 02 00 f0 0f c1 07 <89> c2 c1 ea 10 66 39 c2 75 02 5d c3 0f b7 ca b8 00 80 00 00 44
>
>
>
>>
>> Then more errors.  I have attached the syslog from the LIO node, as
>> well as logs from one of the six ESXi nodes, but this one has the most
>> storage.
>>
>> Do you think there is an ESXi scan storm or such that breaks LIO?  Or
>
> How are you detecting scans in the logs? The non existent lun messages in
> the logs right? If so, are the scans before or after the aborts? I thought it was
> after. In the logs I was looking at the issue at around Jun  1 22:41:23, so
> I'm not sure if there was a event before the log started or not.

Right, the nonexistent LUN.  There are scans before and after the
aborts (I sent you the logs via email from yesterday that show a lot
of these).

>
>> is this still a ceph related issue?  I did change the osd op complaint
>> time as you indicated, it was set to 30 and I set it to 10.
>>
>
> Ok, and nothing in the osd logs (on the osd boxes that would be in
> /var/log/ceph/)?

I found some oddities, I realize this is ceph related so will follow
up appropriately:

2015-06-02 08:50:58.263988 7f3ca1c85700 10 -- 10.80.4.20:6812/3546 >>
10.80.3.25:0/4891 pipe(0xd543900 sd=241 :6812 s=4 pgs=265 cs=1 l=1
c=0x9547860).writer done
2015-06-02 08:50:58.264487 7f3ca42d9700  2 -- 10.80.3.20:6818/3546 >>
10.80.4.25:0/4933 pipe(0xda19c00 sd=26 :6818 s=2 pgs=242 cs=1 l=1
c=0x7382160).reader couldn't read tag, (0) Success
2015-06-02 08:50:58.264513 7f3ca42d9700  2 -- 10.80.3.20:6818/3546 >>
10.80.4.25:0/4933 pipe(0xda19c00 sd=26 :6818 s=2 pgs=242 cs=1 l=1
c=0x7382160).fault (0) Success
2015-06-02 08:50:58.264532 7f3ca42d9700 10 -- 10.80.3.20:6818/3546 >>
10.80.4.25:0/4933 pipe(0xda19c00 sd=26 :6818 s=2 pgs=242 cs=1 l=1
c=0x7382160).fault on lossy channel, failing
2015-06-02 08:50:58.264540 7f3ca42d9700 10 -- 10.80.3.20:6818/3546 >>
10.80.4.25:0/4933 pipe(0xda19c00 sd=26 :6818 s=2 pgs=242 cs=1 l=1
c=0x7382160).stop
2015-06-02 08:50:58.264549 7f3ca42d9700 10 -- 10.80.3.20:6818/3546 >>
10.80.4.25:0/4933 pipe(0xda19c00 sd=26 :6818 s=4 pgs=242 cs=1 l=1
c=0x7382160).unregister_pipe

> I cannot remember, did you say you were not using enhance io for this
> (it was just loaded but not used for lio) or did you try with it off?

EnhanceIO is installed, but I am not currently using any caches on any volumes.

>
> Can we replicate this in a non production box? I can send you a patch with
> a bunch of extra debugging on.

Unfortunately no, the non production box dos not see these issues.

Thank you,
Alex

>
>
>
>> Thank you,
>> Alex
>>
>> On Mon, Jun 1, 2015 at 10:55 PM, Mike Christie <mchristi@xxxxxxxxxx> wrote:
>>> On 06/01/2015 09:18 AM, Alex Gorbachev wrote:
>>>> Hi Mike,
>>>>
>>>> On Fri, May 29, 2015 at 3:11 PM, Mike Christie <mchristi@xxxxxxxxxx> wrote:
>>>>> Just want to make sure we are on the same page. The logs you have been
>>>>> posting the last week do not show crashes do they (there are no null
>>>>> pointers or kernel lock ups)? They are just showing error handling
>>>>> issues. I just want to make sure you are no longer seeing the crashes
>>>>> with the newer kernels like we saw before.
>>>>
>>>> That is true, it may be due to the newer kernels.  We also implemented
>>>> a fast monitoring system, which alerts us whenever there is an issue -
>>>> so that may play a factor as we would proactively reboot or fail over
>>>> services.  I did notice that after a few timeouts the system becomes
>>>> sluggish or unresponsive, and to clear that it takes a reboot.  The
>>>> good news is that we are seeing fewer timeouts overall (none last
>>>> week).
>>>>
>>>>> For the timeouts/abort_tasks, the issue is either that LIO box is locked
>>>>> up in some way that RBD IO is not executing like we saw with the
>>>>> workqueue issues in older kernels, or there is just some issue on the
>>>>> OSD boxes or connection to them that is not allowing IO to execute like
>>>>> normal. This is the normal case and for this, you need to post the logs
>>>>> on the ceph list if there is anything suspicious.
>>>>
>>>> Is there any way to tell which backstore device is causing the
>>>> timeouts by the LIO messages?  Then we could map it to RBD and then
>>>> OSD and PG.  As it is right now, ceph logs are not showing any errors
>>>> when these events happen.
>>>
>>> On the LIO box you could try turning on more debugging, but if you just
>>> look on the ESXi box vmkernel log you have the info.
>>>
>>> On the ESXi box in the logs you see error messages with the device id like:
>>>
>>> naa.6001405a37e54d400000000000000000
>>>
>>> or something with the host, channel, target lun identifier like
>>>
>>>  iscsi_vmk:
>>> iscsivmk_TaskMgmtIssue: vmhba33:CH:3 T:22 L:1 : Task mgmt "Abort Task"
>>> with itt=0x10d4450 (refITT=0x10d444f) timed ou
>>>
>>> You just then just map that to the device on the LIO box. So for
>>> example, if you know the LUN (1 in the example above), then just look at
>>> the LIO target config and see what rbd device you exported as lun 1.
>>>
>>> What is your value for:
>>>
>>> osd op complaint time
>>>
>>> Lower it to around 10 or 15 secs, so it is lower than the ESXi timeout
>>> that would result in abort tasks being sent.
>> --
>> To unsubscribe from this list: send the line "unsubscribe target-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html