Hi Robert, On Wed, 2016-10-19 at 10:41 -0600, Robert LeBlanc wrote: > Nicholas, > > I didn't have high hopes for the patch because we were not seeing > TMR_ABORT_TASK (or 'abort') in dmesg or /var/log/messages, but it > seemed to help regardless. Our clients finally OOMed from the hung > sessions, so we are having to reboot them and we will do some more > testing. We haven't put the updated kernel on our clients yet. Our > clients have iSCSI root disks so I'm not sure if we can get a vmcore > on those, but we will do what we can to get you a vmcore from the > target if it happens again. > Just checking in to see if you've observed further issues with iser-target ports, and/or able to generate a crashdump with v4.4.y..? > As far as our configuration: It is a superMicro box with 6 SAMSUNG > MZ7LM3T8HCJM-00005 SSDs. Two are for root and four are in mdadm > RAID-10 for exporting via iSCSI/iSER. We have ZFS on top of the > RAID-10 for checksum and snapshots only and we export ZVols to the > clients (one or more per VM on the client). We do not persist the > export info (targetcli saveconfig), but regenerate it from scripts. > The client receives two or more of these exports and puts them in a > RAID-1 device. The exports are served by iSER one one port and also by > normal iSCSI on a different port for compatibility, but not normally > used. If you need more info about the config, please let me know. It > was kind of a vague request so I'm not sure what exactly is important > to you. Thanks for the extra details of your hardware + user-space configuration. > Thanks for helping us with this, > Robert LeBlanc > > When we have problems, we usually see this in the logs: > Oct 17 08:57:50 prv-0-12-sanstack kernel: iSCSI Login timeout on > Network Portal 0.0.0.0:3260 > Oct 17 08:57:50 prv-0-12-sanstack kernel: Unexpected ret: -104 send data 48 > Oct 17 08:57:50 prv-0-12-sanstack kernel: tx_data returned -32, expecting 48. > Oct 17 08:57:50 prv-0-12-sanstack kernel: iSCSI Login negotiation failed. > > I found some backtraces in the logs, not sure if this is helpful, this > is before your patch (your patch booted at Oct 18 10:36:59): > Oct 17 15:43:12 prv-0-12-sanstack kernel: INFO: rcu_sched > self-detected stall on CPU > Oct 17 15:43:12 prv-0-12-sanstack kernel: #0115-...: (41725 ticks this > GP) idle=b59/140000000000001/0 softirq=535/535 fqs=30992 > Oct 17 15:43:12 prv-0-12-sanstack kernel: #011 (t=42006 jiffies g=1550 > c=1549 q=0) > Oct 17 15:43:12 prv-0-12-sanstack kernel: Task dump for CPU 5: > Oct 17 15:43:12 prv-0-12-sanstack kernel: kworker/u68:2 R running > task 0 17967 2 0x00000008 > Oct 17 15:43:12 prv-0-12-sanstack kernel: Workqueue: isert_comp_wq > isert_cq_work [ib_isert] > Oct 17 15:43:12 prv-0-12-sanstack kernel: ffff883f4c0dca80 > 00000000af8ca7a4 ffff883f7fb43da8 ffffffff810ac83f > Oct 17 15:43:12 prv-0-12-sanstack kernel: 0000000000000005 > ffffffff81adb680 ffff883f7fb43dc0 ffffffff810af179 > Oct 17 15:43:12 prv-0-12-sanstack kernel: 0000000000000006 > ffff883f7fb43df0 ffffffff810e1c10 ffff883f7fb57b80 > Oct 17 15:43:12 prv-0-12-sanstack kernel: Call Trace: > Oct 17 15:43:12 prv-0-12-sanstack kernel: <IRQ> [<ffffffff810ac83f>] > sched_show_task+0xaf/0x110 > Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff810af179>] > dump_cpu_task+0x39/0x40 > Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff810e1c10>] > rcu_dump_cpu_stacks+0x80/0xb0 > Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff810e6040>] > rcu_check_callbacks+0x540/0x820 > Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff810afd51>] ? > account_system_time+0x81/0x110 > Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff810fa9a0>] ? > tick_sched_do_timer+0x50/0x50 > Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff810eb4d9>] > update_process_times+0x39/0x60 > Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff810fa755>] > tick_sched_handle.isra.17+0x25/0x60 > Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff810fa9dd>] > tick_sched_timer+0x3d/0x70 > Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff810ec0c2>] > __hrtimer_run_queues+0x102/0x290 > Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff810ec5a8>] > hrtimer_interrupt+0xa8/0x1a0 > Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff81052c65>] > local_apic_timer_interrupt+0x35/0x60 > Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff8172343d>] > smp_apic_timer_interrupt+0x3d/0x50 > Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff817216f7>] > apic_timer_interrupt+0x87/0x90 > Oct 17 15:43:12 prv-0-12-sanstack kernel: <EOI> [<ffffffff810d70fe>] > ? console_unlock+0x41e/0x4e0 > Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff810d74bc>] > vprintk_emit+0x2fc/0x500 > Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff810d783f>] > vprintk_default+0x1f/0x30 > Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff81174c2a>] printk+0x5d/0x74 > Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff814bc351>] > transport_lookup_cmd_lun+0x1d1/0x200 > Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff814edcf0>] > iscsit_setup_scsi_cmd+0x230/0x540 > Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffffa0890bf3>] > isert_rx_do_work+0x3f3/0x7f0 [ib_isert] > Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffffa0891174>] > isert_cq_work+0x184/0x770 [ib_isert] > Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff8109734f>] > process_one_work+0x14f/0x400 > Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff81097bc4>] > worker_thread+0x114/0x470 > Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff8171c55a>] ? > __schedule+0x34a/0x7f0 > Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff81097ab0>] ? > rescuer_thread+0x310/0x310 > Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff8109d708>] kthread+0xd8/0xf0 > Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff8109d630>] ? > kthread_park+0x60/0x60 > Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff81720c8f>] > ret_from_fork+0x3f/0x70 > Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff8109d630>] ? > kthread_park+0x60/0x60 > > Oct 17 16:34:03 prv-0-12-sanstack kernel: INFO: rcu_sched > self-detected stall on CPU > Oct 17 16:34:03 prv-0-12-sanstack kernel: #01128-...: (5999 ticks this > GP) idle=2f9/140000000000001/0 softirq=457/457 fqs=4830 > Oct 17 16:34:03 prv-0-12-sanstack kernel: #011 (t=6000 jiffies g=3546 > c=3545 q=0) > Oct 17 16:34:03 prv-0-12-sanstack kernel: Task dump for CPU 28: > Oct 17 16:34:03 prv-0-12-sanstack kernel: iscsi_np R running > task 0 16597 2 0x0000000c > Oct 17 16:34:03 prv-0-12-sanstack kernel: ffff887f40350000 > 00000000b98a67bb ffff887f7f503da8 ffffffff810ac8ff > Oct 17 16:34:03 prv-0-12-sanstack kernel: 000000000000001c > ffffffff81adb680 ffff887f7f503dc0 ffffffff810af239 > Oct 17 16:34:03 prv-0-12-sanstack kernel: 000000000000001d > ffff887f7f503df0 ffffffff810e1cd0 ffff887f7f517b80 > Oct 17 16:34:03 prv-0-12-sanstack kernel: Call Trace: > Oct 17 16:34:03 prv-0-12-sanstack kernel: <IRQ> [<ffffffff810ac8ff>] > sched_show_task+0xaf/0x110 > Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff810af239>] > dump_cpu_task+0x39/0x40 > Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff810e1cd0>] > rcu_dump_cpu_stacks+0x80/0xb0 > Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff810e6100>] > rcu_check_callbacks+0x540/0x820 > Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff810afe11>] ? > account_system_time+0x81/0x110 > Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff810faa60>] ? > tick_sched_do_timer+0x50/0x50 > Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff810eb599>] > update_process_times+0x39/0x60 > Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff810fa815>] > tick_sched_handle.isra.17+0x25/0x60 > Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff810faa9d>] > tick_sched_timer+0x3d/0x70 > Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff810ec182>] > __hrtimer_run_queues+0x102/0x290 > Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff810ec668>] > hrtimer_interrupt+0xa8/0x1a0 > Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff81052c65>] > local_apic_timer_interrupt+0x35/0x60 > Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff81723cbd>] > smp_apic_timer_interrupt+0x3d/0x50 > Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff81721f77>] > apic_timer_interrupt+0x87/0x90 > Oct 17 16:34:03 prv-0-12-sanstack kernel: <EOI> [<ffffffff810d71be>] > ? console_unlock+0x41e/0x4e0 > Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff810d757c>] > vprintk_emit+0x2fc/0x500 > Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff810d78ff>] > vprintk_default+0x1f/0x30 > Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff81174dde>] printk+0x5d/0x74 > Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff814e71ad>] > iscsi_target_locate_portal+0x62d/0x6f0 > Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff814e5100>] > iscsi_target_login_thread+0x6f0/0xfc0 > Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff814e4a10>] ? > iscsi_target_login_sess_out+0x250/0x250 > Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff8109d7c8>] kthread+0xd8/0xf0 > Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff8109d6f0>] ? > kthread_park+0x60/0x60 > Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff8172150f>] > ret_from_fork+0x3f/0x70 > Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff8109d6f0>] ? > kthread_park+0x60/0x60 > > I don't think this one is related, but it happened a couple of times: > Oct 17 11:46:52 prv-0-12-sanstack kernel: INFO: rcu_sched > self-detected stall on CPU > Oct 17 11:46:52 prv-0-12-sanstack kernel: #01119-...: (5999 ticks this > GP) idle=727/140000000000001/0 softirq=1346/1346 fqs=4990 > Oct 17 11:46:52 prv-0-12-sanstack kernel: #011 (t=6000 jiffies g=4295 > c=4294 q=0) > Oct 17 11:46:52 prv-0-12-sanstack kernel: Task dump for CPU 19: > Oct 17 11:46:52 prv-0-12-sanstack kernel: kworker/19:1 R running > task 0 301 2 0x00000008 > Oct 17 11:46:52 prv-0-12-sanstack kernel: Workqueue: > events_power_efficient fb_flashcursor > Oct 17 11:46:52 prv-0-12-sanstack kernel: ffff883f6009ca80 > 00000000010a7cdd ffff883f7fcc3da8 ffffffff810ac8ff > Oct 17 11:46:52 prv-0-12-sanstack kernel: 0000000000000013 > ffffffff81adb680 ffff883f7fcc3dc0 ffffffff810af239 > Oct 17 11:46:52 prv-0-12-sanstack kernel: 0000000000000014 > ffff883f7fcc3df0 ffffffff810e1cd0 ffff883f7fcd7b80 > Oct 17 11:46:52 prv-0-12-sanstack kernel: Call Trace: > Oct 17 11:46:52 prv-0-12-sanstack kernel: <IRQ> [<ffffffff810ac8ff>] > sched_show_task+0xaf/0x110 > Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff810af239>] > dump_cpu_task+0x39/0x40 > Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff810e1cd0>] > rcu_dump_cpu_stacks+0x80/0xb0 > Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff810e6100>] > rcu_check_callbacks+0x540/0x820 > Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff810afe11>] ? > account_system_time+0x81/0x110 > Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff810faa60>] ? > tick_sched_do_timer+0x50/0x50 > Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff810eb599>] > update_process_times+0x39/0x60 > Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff810fa815>] > tick_sched_handle.isra.17+0x25/0x60 > Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff810faa9d>] > tick_sched_timer+0x3d/0x70 > Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff810ec182>] > __hrtimer_run_queues+0x102/0x290 > Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff810ec668>] > hrtimer_interrupt+0xa8/0x1a0 > Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff81052c65>] > local_apic_timer_interrupt+0x35/0x60 > Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff81723cbd>] > smp_apic_timer_interrupt+0x3d/0x50 > Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff81721f77>] > apic_timer_interrupt+0x87/0x90 > Oct 17 11:46:52 prv-0-12-sanstack kernel: <EOI> [<ffffffff810d71be>] > ? console_unlock+0x41e/0x4e0 > Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff813866ad>] > fb_flashcursor+0x5d/0x140 > Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff8138bc00>] ? > bit_clear+0x110/0x110 > Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff8109740f>] > process_one_work+0x14f/0x400 > Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff81097c84>] > worker_thread+0x114/0x470 > Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff8171cdda>] ? > __schedule+0x34a/0x7f0 > Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff81097b70>] ? > rescuer_thread+0x310/0x310 > Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff8109d7c8>] kthread+0xd8/0xf0 > Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff8109d6f0>] ? > kthread_park+0x60/0x60 > Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff8172150f>] > ret_from_fork+0x3f/0x70 > Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff8109d6f0>] ? > kthread_park+0x60/0x60 RCU self-detected schedule stalls typically mean some code is monopolizing execution on a specific CPU for an extended period of time (eg: endless loop), preventing normal RCU grace-period callbacks from running in a timely manner. It's hard to tell without more log context and/or crashdump what was going on here. -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html