Nicholas, I didn't have high hopes for the patch because we were not seeing TMR_ABORT_TASK (or 'abort') in dmesg or /var/log/messages, but it seemed to help regardless. Our clients finally OOMed from the hung sessions, so we are having to reboot them and we will do some more testing. We haven't put the updated kernel on our clients yet. Our clients have iSCSI root disks so I'm not sure if we can get a vmcore on those, but we will do what we can to get you a vmcore from the target if it happens again. As far as our configuration: It is a superMicro box with 6 SAMSUNG MZ7LM3T8HCJM-00005 SSDs. Two are for root and four are in mdadm RAID-10 for exporting via iSCSI/iSER. We have ZFS on top of the RAID-10 for checksum and snapshots only and we export ZVols to the clients (one or more per VM on the client). We do not persist the export info (targetcli saveconfig), but regenerate it from scripts. The client receives two or more of these exports and puts them in a RAID-1 device. The exports are served by iSER one one port and also by normal iSCSI on a different port for compatibility, but not normally used. If you need more info about the config, please let me know. It was kind of a vague request so I'm not sure what exactly is important to you. Thanks for helping us with this, Robert LeBlanc When we have problems, we usually see this in the logs: Oct 17 08:57:50 prv-0-12-sanstack kernel: iSCSI Login timeout on Network Portal 0.0.0.0:3260 Oct 17 08:57:50 prv-0-12-sanstack kernel: Unexpected ret: -104 send data 48 Oct 17 08:57:50 prv-0-12-sanstack kernel: tx_data returned -32, expecting 48. Oct 17 08:57:50 prv-0-12-sanstack kernel: iSCSI Login negotiation failed. I found some backtraces in the logs, not sure if this is helpful, this is before your patch (your patch booted at Oct 18 10:36:59): Oct 17 15:43:12 prv-0-12-sanstack kernel: INFO: rcu_sched self-detected stall on CPU Oct 17 15:43:12 prv-0-12-sanstack kernel: #0115-...: (41725 ticks this GP) idle=b59/140000000000001/0 softirq=535/535 fqs=30992 Oct 17 15:43:12 prv-0-12-sanstack kernel: #011 (t=42006 jiffies g=1550 c=1549 q=0) Oct 17 15:43:12 prv-0-12-sanstack kernel: Task dump for CPU 5: Oct 17 15:43:12 prv-0-12-sanstack kernel: kworker/u68:2 R running task 0 17967 2 0x00000008 Oct 17 15:43:12 prv-0-12-sanstack kernel: Workqueue: isert_comp_wq isert_cq_work [ib_isert] Oct 17 15:43:12 prv-0-12-sanstack kernel: ffff883f4c0dca80 00000000af8ca7a4 ffff883f7fb43da8 ffffffff810ac83f Oct 17 15:43:12 prv-0-12-sanstack kernel: 0000000000000005 ffffffff81adb680 ffff883f7fb43dc0 ffffffff810af179 Oct 17 15:43:12 prv-0-12-sanstack kernel: 0000000000000006 ffff883f7fb43df0 ffffffff810e1c10 ffff883f7fb57b80 Oct 17 15:43:12 prv-0-12-sanstack kernel: Call Trace: Oct 17 15:43:12 prv-0-12-sanstack kernel: <IRQ> [<ffffffff810ac83f>] sched_show_task+0xaf/0x110 Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff810af179>] dump_cpu_task+0x39/0x40 Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff810e1c10>] rcu_dump_cpu_stacks+0x80/0xb0 Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff810e6040>] rcu_check_callbacks+0x540/0x820 Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff810afd51>] ? account_system_time+0x81/0x110 Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff810fa9a0>] ? tick_sched_do_timer+0x50/0x50 Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff810eb4d9>] update_process_times+0x39/0x60 Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff810fa755>] tick_sched_handle.isra.17+0x25/0x60 Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff810fa9dd>] tick_sched_timer+0x3d/0x70 Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff810ec0c2>] __hrtimer_run_queues+0x102/0x290 Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff810ec5a8>] hrtimer_interrupt+0xa8/0x1a0 Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff81052c65>] local_apic_timer_interrupt+0x35/0x60 Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff8172343d>] smp_apic_timer_interrupt+0x3d/0x50 Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff817216f7>] apic_timer_interrupt+0x87/0x90 Oct 17 15:43:12 prv-0-12-sanstack kernel: <EOI> [<ffffffff810d70fe>] ? console_unlock+0x41e/0x4e0 Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff810d74bc>] vprintk_emit+0x2fc/0x500 Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff810d783f>] vprintk_default+0x1f/0x30 Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff81174c2a>] printk+0x5d/0x74 Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff814bc351>] transport_lookup_cmd_lun+0x1d1/0x200 Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff814edcf0>] iscsit_setup_scsi_cmd+0x230/0x540 Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffffa0890bf3>] isert_rx_do_work+0x3f3/0x7f0 [ib_isert] Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffffa0891174>] isert_cq_work+0x184/0x770 [ib_isert] Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff8109734f>] process_one_work+0x14f/0x400 Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff81097bc4>] worker_thread+0x114/0x470 Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff8171c55a>] ? __schedule+0x34a/0x7f0 Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff81097ab0>] ? rescuer_thread+0x310/0x310 Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff8109d708>] kthread+0xd8/0xf0 Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff8109d630>] ? kthread_park+0x60/0x60 Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff81720c8f>] ret_from_fork+0x3f/0x70 Oct 17 15:43:12 prv-0-12-sanstack kernel: [<ffffffff8109d630>] ? kthread_park+0x60/0x60 Oct 17 16:34:03 prv-0-12-sanstack kernel: INFO: rcu_sched self-detected stall on CPU Oct 17 16:34:03 prv-0-12-sanstack kernel: #01128-...: (5999 ticks this GP) idle=2f9/140000000000001/0 softirq=457/457 fqs=4830 Oct 17 16:34:03 prv-0-12-sanstack kernel: #011 (t=6000 jiffies g=3546 c=3545 q=0) Oct 17 16:34:03 prv-0-12-sanstack kernel: Task dump for CPU 28: Oct 17 16:34:03 prv-0-12-sanstack kernel: iscsi_np R running task 0 16597 2 0x0000000c Oct 17 16:34:03 prv-0-12-sanstack kernel: ffff887f40350000 00000000b98a67bb ffff887f7f503da8 ffffffff810ac8ff Oct 17 16:34:03 prv-0-12-sanstack kernel: 000000000000001c ffffffff81adb680 ffff887f7f503dc0 ffffffff810af239 Oct 17 16:34:03 prv-0-12-sanstack kernel: 000000000000001d ffff887f7f503df0 ffffffff810e1cd0 ffff887f7f517b80 Oct 17 16:34:03 prv-0-12-sanstack kernel: Call Trace: Oct 17 16:34:03 prv-0-12-sanstack kernel: <IRQ> [<ffffffff810ac8ff>] sched_show_task+0xaf/0x110 Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff810af239>] dump_cpu_task+0x39/0x40 Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff810e1cd0>] rcu_dump_cpu_stacks+0x80/0xb0 Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff810e6100>] rcu_check_callbacks+0x540/0x820 Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff810afe11>] ? account_system_time+0x81/0x110 Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff810faa60>] ? tick_sched_do_timer+0x50/0x50 Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff810eb599>] update_process_times+0x39/0x60 Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff810fa815>] tick_sched_handle.isra.17+0x25/0x60 Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff810faa9d>] tick_sched_timer+0x3d/0x70 Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff810ec182>] __hrtimer_run_queues+0x102/0x290 Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff810ec668>] hrtimer_interrupt+0xa8/0x1a0 Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff81052c65>] local_apic_timer_interrupt+0x35/0x60 Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff81723cbd>] smp_apic_timer_interrupt+0x3d/0x50 Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff81721f77>] apic_timer_interrupt+0x87/0x90 Oct 17 16:34:03 prv-0-12-sanstack kernel: <EOI> [<ffffffff810d71be>] ? console_unlock+0x41e/0x4e0 Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff810d757c>] vprintk_emit+0x2fc/0x500 Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff810d78ff>] vprintk_default+0x1f/0x30 Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff81174dde>] printk+0x5d/0x74 Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff814e71ad>] iscsi_target_locate_portal+0x62d/0x6f0 Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff814e5100>] iscsi_target_login_thread+0x6f0/0xfc0 Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff814e4a10>] ? iscsi_target_login_sess_out+0x250/0x250 Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff8109d7c8>] kthread+0xd8/0xf0 Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff8109d6f0>] ? kthread_park+0x60/0x60 Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff8172150f>] ret_from_fork+0x3f/0x70 Oct 17 16:34:03 prv-0-12-sanstack kernel: [<ffffffff8109d6f0>] ? kthread_park+0x60/0x60 I don't think this one is related, but it happened a couple of times: Oct 17 11:46:52 prv-0-12-sanstack kernel: INFO: rcu_sched self-detected stall on CPU Oct 17 11:46:52 prv-0-12-sanstack kernel: #01119-...: (5999 ticks this GP) idle=727/140000000000001/0 softirq=1346/1346 fqs=4990 Oct 17 11:46:52 prv-0-12-sanstack kernel: #011 (t=6000 jiffies g=4295 c=4294 q=0) Oct 17 11:46:52 prv-0-12-sanstack kernel: Task dump for CPU 19: Oct 17 11:46:52 prv-0-12-sanstack kernel: kworker/19:1 R running task 0 301 2 0x00000008 Oct 17 11:46:52 prv-0-12-sanstack kernel: Workqueue: events_power_efficient fb_flashcursor Oct 17 11:46:52 prv-0-12-sanstack kernel: ffff883f6009ca80 00000000010a7cdd ffff883f7fcc3da8 ffffffff810ac8ff Oct 17 11:46:52 prv-0-12-sanstack kernel: 0000000000000013 ffffffff81adb680 ffff883f7fcc3dc0 ffffffff810af239 Oct 17 11:46:52 prv-0-12-sanstack kernel: 0000000000000014 ffff883f7fcc3df0 ffffffff810e1cd0 ffff883f7fcd7b80 Oct 17 11:46:52 prv-0-12-sanstack kernel: Call Trace: Oct 17 11:46:52 prv-0-12-sanstack kernel: <IRQ> [<ffffffff810ac8ff>] sched_show_task+0xaf/0x110 Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff810af239>] dump_cpu_task+0x39/0x40 Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff810e1cd0>] rcu_dump_cpu_stacks+0x80/0xb0 Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff810e6100>] rcu_check_callbacks+0x540/0x820 Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff810afe11>] ? account_system_time+0x81/0x110 Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff810faa60>] ? tick_sched_do_timer+0x50/0x50 Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff810eb599>] update_process_times+0x39/0x60 Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff810fa815>] tick_sched_handle.isra.17+0x25/0x60 Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff810faa9d>] tick_sched_timer+0x3d/0x70 Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff810ec182>] __hrtimer_run_queues+0x102/0x290 Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff810ec668>] hrtimer_interrupt+0xa8/0x1a0 Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff81052c65>] local_apic_timer_interrupt+0x35/0x60 Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff81723cbd>] smp_apic_timer_interrupt+0x3d/0x50 Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff81721f77>] apic_timer_interrupt+0x87/0x90 Oct 17 11:46:52 prv-0-12-sanstack kernel: <EOI> [<ffffffff810d71be>] ? console_unlock+0x41e/0x4e0 Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff813866ad>] fb_flashcursor+0x5d/0x140 Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff8138bc00>] ? bit_clear+0x110/0x110 Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff8109740f>] process_one_work+0x14f/0x400 Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff81097c84>] worker_thread+0x114/0x470 Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff8171cdda>] ? __schedule+0x34a/0x7f0 Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff81097b70>] ? rescuer_thread+0x310/0x310 Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff8109d7c8>] kthread+0xd8/0xf0 Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff8109d6f0>] ? kthread_park+0x60/0x60 Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff8172150f>] ret_from_fork+0x3f/0x70 Oct 17 11:46:52 prv-0-12-sanstack kernel: [<ffffffff8109d6f0>] ? kthread_park+0x60/0x60 ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Wed, Oct 19, 2016 at 12:25 AM, Nicholas A. Bellinger <nab@xxxxxxxxxxxxxxx> wrote: > On Tue, 2016-10-18 at 16:13 -0600, Robert LeBlanc wrote: >> Nicholas, >> >> We patched this in and for the first time in many reboots, we didn't >> have iSCSI going straight into D state. We have had to work on a >> couple of other things, so we don't know if this is just a coincidence >> or not. We will reboot back into the old kernel and back a few times >> and do some more testing, but so far it has given us a little bit of >> hope that we may be narrowing down on the root cause. We will report >> back once we have some more info. >> >> Thank you, >> Robert LeBlanc >> ---------------- >> Robert LeBlanc >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >> > > Hello Robert, > > Thanks for the update. Btw, if the original /var/log/messages > reproduction logs for iser-target are still handy, I'm happy to have > a look to confirm. Feel free to send them along here, or off-list if > necessary. > > For further reference, you can also enable Linux kernel crash dump > (LKCD) at build time using CONFIG_CRASH_DUMP=y, so it's possible to > manually generate a vmcore dumpfile of the running system via 'echo c >> /proc/sysrq-trigger', once the bug occurs. > > http://cateee.net/lkddb/web-lkddb/CRASH_DUMP.html > > Note in order to fully debug within this in a LKCD environment, it > requires the vmcore dump from /var/crash/, unstripped vmlinux, > target_core_mod, iscsi_target_mod and ib_isert modules matching the > specific particular x86_64 build setup of the running system. > > Also, can you share a bit more about the details of your particular > iser-target + backend setup..? > -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html