On 05/04/2015 12:54 AM, Nicholas A. Bellinger wrote: > On Sun, 2015-05-03 at 16:58 -0400, Robert Wood wrote: >> Hi Nicholas, >> >> On Thu, Apr 30, 2015 at 2:47 AM, Nicholas A. Bellinger >> <nab@xxxxxxxxxxxxxxx> wrote: >>> Hi Robert, Alex, & Mike, >>> >>> On Thu, 2015-04-23 at 15:23 -0500, Mike Christie wrote: >>>> I think I can replicate the same problem: >>>> >>>> I do: >>>> >>>> 1. run IO from the initiator >>>> 2. force the backing store device to halt IO >>>> 3. let the initiator's timeout fire and its error handler run >>>> 4. wait until the initiator tries to relogin and let "iSCSI Login >>>> timeout" fire >>>> 5. let the backing store device execute IO again >>>> >>>> I am not certain if #5 is required. I think maybe it oopsed after #4. I >>>> cannot remember. I have not had time to debug it yet. Maybe someone else >>>> has seen it. >>>> >>> >>> So looking at the original dmesg, the system appears to be hitting an >>> rcu stall, but the NMI watchdog for the soft-lockup in iscsi_trx:13963 >>> triggers first when attempting to spin_unlock_bh() -> >>> __local_bh_enable_ip() from somewhere in iscsit_close_connection() code. >>> >>> Apr 19 03:30:02 roc-4r-scd212 kernel: [74486.177488] ABORT_TASK: Sending TMR_FUNCTION_COMPLETE for ref_tag: 9072 >>> Apr 19 03:30:10 roc-4r-scd212 kernel: [74493.812492] Unexpected ret: -32 send data 48 >>> Apr 19 03:30:12 roc-4r-scd212 kernel: [74495.813309] TARGET_CORE[iSCSI]: Detected NON_EXISTENT_LUN Access for 0x00000015 >>> Apr 19 03:30:12 roc-4r-scd212 kernel: [74495.987564] libceph: osd25 10.80.3.30:6809 socket closed (con state OPEN) >>> Apr 19 03:30:15 roc-4r-scd212 kernel: [74498.326272] Unexpected ret: -32 send data 48 >>> Apr 19 03:30:17 roc-4r-scd212 kernel: [74500.983349] iSCSI Login timeout on Network Portal 10.70.2.211:3260 >>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.167913] NMI watchdog: BUG: soft lockup - CPU#9 stuck for 22s! [iscsi_trx:13963] >>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.167959] Modules linked >>> in: target_core_user uio rbd libceph libcrc32c iscsi_target_mod >>> target_core_file target_core_pscsi target_core_iblock target_core_mod >>> configfs xt_multiport iptable_filter ip_tables x_tables >>> enhanceio_rand(OE) enhanceio_lru(OE) enhanceio_fifo(OE) enhanceio(OE) >>> ipmi_devintf ipmi_ssif x86_pkg_temp_thermal intel_powerclamp coretemp >>> kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel >>> aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd sb_edac joydev >>> edac_core mei_me mei lpc_ich ioatdma ipmi_si ipmi_msghandler >>> 8250_fintek wmi mac_hid 8021q garp mrp stp llc bonding lp parport nfsd >>> auth_rpcgss nfs_acl nfs lockd grace sunrpc fscache mlx4_en vxlan >>> ip6_udp_tunnel udp_tunnel hid_generic igb usbhid mpt2sas i2c_algo_bit >>> ahci hid dca mlx4_core raid_class ptp libahci scsi_transport_sas >>> pps_core >>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168026] CPU: 9 PID: 13963 Comm: iscsi_trx Tainted: G C OE 3.19.4-031904-generic #201504131440 >>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168028] Hardware name: Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.0a 12/05/2013 >>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168030] task: ffff8808591a1d70 ti: ffff880859620000 task.ti: ffff880859620000 >>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168032] RIP: 0010:[<ffffffff8107ae70>] [<ffffffff8107ae70>] __local_bh_enable_ip+0x0/0x90 >>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168041] RSP: 0018:ffff880859623d70 EFLAGS: 00000216 >>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168043] RAX: ffff88104ae4de80 RBX: ffff8810582a5000 RCX: ffff88104ae4de80 >>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168044] RDX: ffff88104ae4de80 RSI: 0000000000000200 RDI: ffffffffc07cb25b >>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168046] RBP: ffff880859623d78 R08: ffff88104ae4df50 R09: 0000000000000101 >>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168048] R10: 0000000000000001 R11: dead000000200200 R12: ffffffffc07c0f68 >>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168049] R13: ffff880859623d08 R14: ffffffff817d2e30 R15: ffff880859623cd8 >>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168052] FS: 0000000000000000(0000) GS:ffff88107fc60000(0000) knlGS:0000000000000000 >>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168054] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168055] CR2: 00007f8e74ed7e60 CR3: 0000000001c15000 CR4: 00000000001407e0 >>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168057] Stack: >>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168059] ffffffff817d2e30 ffff880859623dd8 ffffffffc07cb25b 0000000000000000 >>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168062] ffff880842942000 ffff8810582a53e0 ffff8810582a5440 ffff8808591a1d70 >>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168065] ffff8810582a5000 ffff8810582a53f4 ffff880859623e4c ffff8808591a1d70 >>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168069] Call Trace: >>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168076] [<ffffffff817d2e30>] ? _raw_spin_unlock_bh+0x20/0x50 >>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168094] [<ffffffffc07cb25b>] iscsit_close_connection+0x3ab/0x660 [iscsi_target_mod] >>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168102] [<ffffffffc07b69b3>] iscsit_take_action_for_connection_exit+0x83/0x110 [iscsi_target_mod] >>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168111] [<ffffffffc07ca33e>] iscsi_target_rx_thread+0x22e/0x320 [iscsi_target_mod] >>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168119] [<ffffffffc07ca110>] ? iscsi_target_tx_thread+0x220/0x220 [iscsi_target_mod] >>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168124] [<ffffffff81095e29>] kthread+0xc9/0xe0 >>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168128] [<ffffffff81095d60>] ? flush_kthread_worker+0x90/0x90 >>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168132] <ffffffff817d3718>] ret_from_fork+0x58/0x90 >>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168135] [<ffffffff81095d60>] ? flush_kthread_worker+0x90/0x90 >>> Apr 19 03:30:29 roc-4r-scd212 kernel: [74513.168136] Code: 1f 44 00 00 >>> 65 8b 05 70 64 f9 7e 85 c0 75 14 48 89 df 57 9d 0f 1f 44 00 00 48 83 >>> c4 08 5b 5d c3 0f 1f 00 e8 b3 a6 75 00 eb e5 90 <0f> 1f 44 00 00 55 48 >>> 89 e5 53 89 f3 48 83 ec 08 65 8b 05 19 0a >>> >>> and then a bit further along: >>> >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248624] INFO: rcu_sched self-detected stall on CPU { 9} (t=15000 jiffies g=846173 c=846172 q=0) >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248684] Task dump for CPU 9: >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248686] iscsi_trx R running task 0 13963 2 0x0000000c >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248689] ffffffff81c54cc0 ffff88107fc63d18 ffffffff810a3444 0000000000000009 >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248692] ffffffff81c54cc0 ffff88107fc63d38 ffffffff810a60bf 0000000000000087 >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248695] ffffffff81c54dc0 ffff88107fc63d68 ffffffff810d68fb ffffffff81c54cc0 >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248699] Call Trace: >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248700] <IRQ> [<ffffffff810a3444>] sched_show_task+0xb4/0x120 >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248710] [<ffffffff810a60bf>] dump_cpu_task+0x3f/0x50 >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248717] [<ffffffff810d68fb>] rcu_dump_cpu_stacks+0x8b/0xc0 >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248721] [<ffffffff810d84d7>] print_cpu_stall+0xd7/0x140 >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248725] [<ffffffff810d8ef5>] check_cpu_stall.isra.55+0xa5/0xb0 >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248728] [<ffffffff810d8f38>] __rcu_pending+0x38/0x170 >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248732] [<ffffffff810daa8d>] rcu_check_callbacks+0x9d/0x190 >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248735] [<ffffffff810dfdf8>] update_process_times+0x38/0x70 >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248742] [<ffffffff810efab3>] tick_sched_handle.isra.12+0x33/0x70 >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248745] [<ffffffff810efbeb>] tick_sched_timer+0x4b/0x80 >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248749] [<ffffffff810e08b6>] __run_hrtimer+0x76/0x230 >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248752] [<ffffffff810efba0>] ? tick_nohz_handler+0xb0/0xb0 >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248756] [<ffffffff810e10bf>] hrtimer_interrupt+0xff/0x260 >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248763] [<ffffffff8104c319>] local_apic_timer_interrupt+0x39/0x60 >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248768] [<ffffffff817d6805>] smp_apic_timer_interrupt+0x45/0x60 >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248772] [<ffffffff817d487d>] apic_timer_interrupt+0x6d/0x80 >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248773] <EOI> [<ffffffffc0765d41>] ? transport_wait_for_tasks+0xc1/0x150 [target_core_mod] >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248798] [<ffffffffc0765d3b>] ? transport_wait_for_tasks+0xbb/0x150 [target_core_mod] >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248806] [<ffffffffc07c0f68>] ? iscsit_remove_cmd_from_response_queue+0xe8/0x120 [iscsi_target_mod] >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248817] [<ffffffffc0767cf5>] transport_generic_free_cmd+0xc5/0xe0 [target_core_mod] >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248825] [<ffffffffc07cb25b>] ? iscsit_close_connection+0x3ab/0x660 [iscsi_target_mod] >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248833] [<ffffffffc07c2266>] iscsit_free_cmd+0x96/0x160 [iscsi_target_mod] >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248840] [<ffffffffc07cb274>] iscsit_close_connection+0x3c4/0x660 [iscsi_target_mod] >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248847] [<ffffffffc07b69b3>] iscsit_take_action_for_connection_exit+0x83/0x110 [iscsi_target_mod] >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248855] [<ffffffffc07ca33e>] iscsi_target_rx_thread+0x22e/0x320 [iscsi_target_mod] >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248862] [<ffffffffc07ca110>] ? iscsi_target_tx_thread+0x220/0x220 [iscsi_target_mod] >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248865] [<ffffffff81095e29>] kthread+0xc9/0xe0 >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248869] [<ffffffff81095d60>] ? flush_kthread_worker+0x90/0x90 >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248872] [<ffffffff817d3718>] ret_from_fork+0x58/0x90 >>> Apr 19 03:31:02 roc-4r-scd212 kernel: [74546.248875] [<ffffffff81095d60>] ? flush_kthread_worker+0x90/0x90 >>> Apr 19 03:31:29 roc-4r-scd212 kernel: [74573.190977] NMI watchdog: BUG: soft lockup - CPU#9 stuck for 22s! [iscsi_trx:13963] >>> Apr 19 03:31:29 roc-4r-scd212 kernel: [74573.191023] Modules linked >>> >>> Note there where a couple bugs fixed last year involving iscsi-target >>> session reinstatement + target aborted_task + iblock backend holding >>> onto bios for a long time. The series for v3.16-rc1 is here: >>> >>> [PATCH 0/3] target: Fix ABORT_TASK response + session reset hang >>> http://www.spinics.net/lists/target-devel/msg07060.html >>> >>> AFAICT from Robert + Alex's log this is the same type of scenario, and >>> I'm pretty sure I was hitting the same login timeout handler back then, >>> and was able to survive at least with iblock + scsi_debug backend. >>> >>> Give that ceph and enhancedio backend are involved, it's not completely >>> clear yet if this is a target specific issue or not.. >> >> EnhanceIO was not active on any of the volumes at the time of this >> crash. We also had another crash like this just recently, and same >> pattern, also not using eio. The root cause was definitely a slow >> ceph response due to an XFS issue on an OSD node, but the target node >> showed the same symptoms. > > I still don't think this is a target specific issue, but will verify > this week using a non ceph backend. > > Looking around, you're not the only person who's seeing ceph client RCU > stall issues: > > http://www.spinics.net/lists/stable/msg74074.html > > I've no idea if those commits actually resolves the issue wrt ceph (they > are present in v3.19.y btw), so I'd recommend asking around the ceph > list to verify if there is something else required for ceph atop v3.19.y > to address the rcu stalls your observing. > Just to be clear, there are 2 issues. One is a ceph/rbd hang. The other issue is the LIO crash when trying to recovery from the hang. For that I can replicate it with scsi_debug. -- To unsubscribe from this list: send the line "unsubscribe target-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html