Re: Update on crash with kernel 3.19

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 05/04/2015 12:54 AM, Nicholas A. Bellinger wrote:
> On Sun, 2015-05-03 at 16:58 -0400, Robert Wood wrote:
>> Hi Nicholas,
>>
>> On Thu, Apr 30, 2015 at 2:47 AM, Nicholas A. Bellinger
>> <nab@xxxxxxxxxxxxxxx> wrote:
>>> Hi Robert, Alex, & Mike,
>>>
>>> On Thu, 2015-04-23 at 15:23 -0500, Mike Christie wrote:
>>>> I think I can replicate the same problem:
>>>>
>>>> I do:
>>>>
>>>> 1. run IO from the initiator
>>>> 2. force the backing store device to halt IO
>>>> 3. let the initiator's timeout fire and its error handler run
>>>> 4. wait until the initiator tries to relogin and let "iSCSI Login
>>>> timeout" fire
>>>> 5. let the backing store device execute IO again
>>>>
>>>> I am not certain if #5 is required. I think maybe it oopsed after #4. I
>>>> cannot remember. I have not had time to debug it yet. Maybe someone else
>>>> has seen it.
>>>>
>>>
>>> So looking at the original dmesg, the system appears to be hitting an
>>> rcu stall, but the NMI watchdog for the soft-lockup in iscsi_trx:13963
>>> triggers first when attempting to spin_unlock_bh() ->
>>> __local_bh_enable_ip() from somewhere in iscsit_close_connection() code.
>>>
>>> Apr 19 03:30:02 roc-4r-scd212 kernel: [74486.177488] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 9072
>>> Apr 19 03:30:10 roc-4r-scd212 kernel: [74493.812492] Unexpected ret: -32 send data 48
>>> Apr 19 03:30:12 roc-4r-scd212 kernel: [74495.813309] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x00000015
>>> Apr 19 03:30:12 roc-4r-scd212 kernel: [74495.987564] libceph: osd25 10.80.3.30:6809 socket closed (con state OPEN)
>>> Apr 19 03:30:15 roc-4r-scd212 kernel: [74498.326272] Unexpected ret: -32 send data 48
>>> Apr 19 03:30:17 roc-4r-scd212 kernel: [74500.983349] iSCSI Login timeout on Network Portal 10.70.2.211:3260
>>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.167913] NMI watchdog: BUG: soft lockup - CPU#9 stuck for 22s! [iscsi_trx:13963]
>>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.167959] Modules linked
>>> in: target_core_user uio rbd libceph libcrc32c iscsi_target_mod
>>> target_core_file target_core_pscsi target_core_iblock target_core_mod
>>> configfs xt_multiport iptable_filter ip_tables x_tables
>>> enhanceio_rand(OE) enhanceio_lru(OE) enhanceio_fifo(OE) enhanceio(OE)
>>> ipmi_devintf ipmi_ssif x86_pkg_temp_thermal intel_powerclamp coretemp
>>> kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel
>>> aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd sb_edac joydev
>>> edac_core mei_me mei lpc_ich ioatdma ipmi_si ipmi_msghandler
>>> 8250_fintek wmi mac_hid 8021q garp mrp stp llc bonding lp parport nfsd
>>> auth_rpcgss nfs_acl nfs lockd grace sunrpc fscache mlx4_en vxlan
>>> ip6_udp_tunnel udp_tunnel hid_generic igb usbhid mpt2sas i2c_algo_bit
>>> ahci hid dca mlx4_core raid_class ptp libahci scsi_transport_sas
>>> pps_core
>>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168026] CPU: 9 PID: 13963 Comm: iscsi_trx Tainted: G         C OE  3.19.4-031904-generic #201504131440
>>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168028] Hardware name: Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.0a 12/05/2013
>>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168030] task: ffff8808591a1d70 ti: ffff880859620000 task.ti: ffff880859620000
>>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168032] RIP: 0010:[<ffffffff8107ae70>]  [<ffffffff8107ae70>] __local_bh_enable_ip+0x0/0x90
>>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168041] RSP: 0018:ffff880859623d70  EFLAGS: 00000216
>>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168043] RAX: ffff88104ae4de80 RBX: ffff8810582a5000 RCX: ffff88104ae4de80
>>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168044] RDX: ffff88104ae4de80 RSI: 0000000000000200 RDI: ffffffffc07cb25b
>>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168046] RBP: ffff880859623d78 R08: ffff88104ae4df50 R09: 0000000000000101
>>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168048] R10: 0000000000000001 R11: dead000000200200 R12: ffffffffc07c0f68
>>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168049] R13: ffff880859623d08 R14: ffffffff817d2e30 R15: ffff880859623cd8
>>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168052] FS: 0000000000000000(0000) GS:ffff88107fc60000(0000) knlGS:0000000000000000
>>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168054] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168055] CR2: 00007f8e74ed7e60 CR3: 0000000001c15000 CR4: 00000000001407e0
>>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168057] Stack:
>>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168059]  ffffffff817d2e30 ffff880859623dd8 ffffffffc07cb25b 0000000000000000
>>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168062]  ffff880842942000 ffff8810582a53e0 ffff8810582a5440 ffff8808591a1d70
>>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168065]  ffff8810582a5000 ffff8810582a53f4 ffff880859623e4c ffff8808591a1d70
>>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168069] Call Trace:
>>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168076] [<ffffffff817d2e30>] ? _raw_spin_unlock_bh+0x20/0x50
>>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168094] [<ffffffffc07cb25b>] iscsit_close_connection+0x3ab/0x660 [iscsi_target_mod]
>>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168102] [<ffffffffc07b69b3>] iscsit_take_action_for_connection_exit+0x83/0x110 [iscsi_target_mod]
>>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168111] [<ffffffffc07ca33e>] iscsi_target_rx_thread+0x22e/0x320 [iscsi_target_mod]
>>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168119] [<ffffffffc07ca110>] ? iscsi_target_tx_thread+0x220/0x220 [iscsi_target_mod]
>>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168124] [<ffffffff81095e29>] kthread+0xc9/0xe0
>>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168128] [<ffffffff81095d60>] ? flush_kthread_worker+0x90/0x90
>>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168132] <ffffffff817d3718>] ret_from_fork+0x58/0x90
>>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168135] [<ffffffff81095d60>] ? flush_kthread_worker+0x90/0x90
>>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168136] Code: 1f 44 00 00
>>> 65 8b 05 70 64 f9 7e 85 c0 75 14 48 89 df 57 9d 0f 1f 44 00 00 48 83
>>> c4 08 5b 5d c3 0f 1f 00 e8 b3 a6 75 00 eb e5 90 <0f> 1f 44 00 00 55 48
>>> 89 e5 53 89 f3 48 83 ec 08 65 8b 05 19 0a
>>>
>>> and then a bit further along:
>>>
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248624] INFO: rcu_sched self-detected stall on CPU { 9}  (t=15000 jiffies g=846173 c=846172 q=0)
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248684] Task dump for CPU 9:
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248686] iscsi_trx       R running task        0 13963      2 0x0000000c
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248689]  ffffffff81c54cc0 ffff88107fc63d18 ffffffff810a3444 0000000000000009
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248692]  ffffffff81c54cc0 ffff88107fc63d38 ffffffff810a60bf 0000000000000087
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248695]  ffffffff81c54dc0 ffff88107fc63d68 ffffffff810d68fb ffffffff81c54cc0
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248699] Call Trace:
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248700]  <IRQ> [<ffffffff810a3444>] sched_show_task+0xb4/0x120
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248710] [<ffffffff810a60bf>] dump_cpu_task+0x3f/0x50
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248717] [<ffffffff810d68fb>] rcu_dump_cpu_stacks+0x8b/0xc0
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248721] [<ffffffff810d84d7>] print_cpu_stall+0xd7/0x140
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248725] [<ffffffff810d8ef5>] check_cpu_stall.isra.55+0xa5/0xb0
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248728] [<ffffffff810d8f38>] __rcu_pending+0x38/0x170
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248732] [<ffffffff810daa8d>] rcu_check_callbacks+0x9d/0x190
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248735] [<ffffffff810dfdf8>] update_process_times+0x38/0x70
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248742] [<ffffffff810efab3>] tick_sched_handle.isra.12+0x33/0x70
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248745] [<ffffffff810efbeb>] tick_sched_timer+0x4b/0x80
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248749] [<ffffffff810e08b6>] __run_hrtimer+0x76/0x230
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248752] [<ffffffff810efba0>] ? tick_nohz_handler+0xb0/0xb0
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248756] [<ffffffff810e10bf>] hrtimer_interrupt+0xff/0x260
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248763] [<ffffffff8104c319>] local_apic_timer_interrupt+0x39/0x60
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248768] [<ffffffff817d6805>] smp_apic_timer_interrupt+0x45/0x60
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248772] [<ffffffff817d487d>] apic_timer_interrupt+0x6d/0x80
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248773]  <EOI> [<ffffffffc0765d41>] ? transport_wait_for_tasks+0xc1/0x150 [target_core_mod]
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248798] [<ffffffffc0765d3b>] ? transport_wait_for_tasks+0xbb/0x150 [target_core_mod]
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248806] [<ffffffffc07c0f68>] ? iscsit_remove_cmd_from_response_queue+0xe8/0x120 [iscsi_target_mod]
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248817] [<ffffffffc0767cf5>] transport_generic_free_cmd+0xc5/0xe0 [target_core_mod]
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248825] [<ffffffffc07cb25b>] ? iscsit_close_connection+0x3ab/0x660 [iscsi_target_mod]
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248833] [<ffffffffc07c2266>] iscsit_free_cmd+0x96/0x160 [iscsi_target_mod]
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248840] [<ffffffffc07cb274>] iscsit_close_connection+0x3c4/0x660 [iscsi_target_mod]
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248847] [<ffffffffc07b69b3>] iscsit_take_action_for_connection_exit+0x83/0x110 [iscsi_target_mod]
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248855] [<ffffffffc07ca33e>] iscsi_target_rx_thread+0x22e/0x320 [iscsi_target_mod]
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248862] [<ffffffffc07ca110>] ? iscsi_target_tx_thread+0x220/0x220 [iscsi_target_mod]
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248865] [<ffffffff81095e29>] kthread+0xc9/0xe0
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248869] [<ffffffff81095d60>] ? flush_kthread_worker+0x90/0x90
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248872] [<ffffffff817d3718>] ret_from_fork+0x58/0x90
>>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248875] [<ffffffff81095d60>] ? flush_kthread_worker+0x90/0x90
>>> Apr 19 03:31:29 roc-4r-scd212 kernel: [74573.190977] NMI watchdog: BUG: soft lockup - CPU#9 stuck for 22s! [iscsi_trx:13963]
>>> Apr 19 03:31:29 roc-4r-scd212 kernel: [74573.191023] Modules linked
>>>
>>> Note there where a couple bugs fixed last year involving iscsi-target
>>> session reinstatement + target aborted_task + iblock backend holding
>>> onto bios for a long time.  The series for v3.16-rc1 is here:
>>>
>>> [PATCH 0/3] target: Fix ABORT_TASK response + session reset hang
>>> http://www.spinics.net/lists/target-devel/msg07060.html
>>>
>>> AFAICT from Robert + Alex's log this is the same type of scenario, and
>>> I'm pretty sure I was hitting the same login timeout handler back then,
>>> and was able to survive at least with iblock + scsi_debug backend.
>>>
>>> Give that ceph and enhancedio backend are involved, it's not completely
>>> clear yet if this is a target specific issue or not..
>>
>> EnhanceIO was not active on any of the volumes at the time of this
>> crash.  We also had another crash like this just recently, and same
>> pattern, also not using eio.  The root cause was definitely a slow
>> ceph response due to an XFS issue on an OSD node, but the target node
>> showed the same symptoms.
> 
> I still don't think this is a target specific issue, but will verify
> this week using a non ceph backend.
> 
> Looking around, you're not the only person who's seeing ceph client RCU
> stall issues:
> 
> http://www.spinics.net/lists/stable/msg74074.html
> 
> I've no idea if those commits actually resolves the issue wrt ceph (they
> are present in v3.19.y btw), so I'd recommend asking around the ceph
> list to verify if there is something else required for ceph atop v3.19.y
> to address the rcu stalls your observing.
> 

Just to be clear, there are 2 issues. One is a ceph/rbd hang. The other
issue is the LIO crash when trying to recovery from the hang. For that I
can replicate it with scsi_debug.
--
To unsubscribe from this list: send the line "unsubscribe target-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html




[Index of Archives]     [Linux SCSI]     [Kernel Newbies]     [Linux SCSI Target Infrastructure]     [Share Photos]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Device Mapper]

  Powered by Linux