Sorry hit send too soon. In addition, on the client we see: # ps -aux | grep D | grep kworker root 5583 0.0 0.0 0 0 ? D 11:55 0:03 [kworker/11:0] root 7721 0.1 0.0 0 0 ? D 12:00 0:04 [kworker/4:25] root 10877 0.0 0.0 0 0 ? D 09:27 0:00 [kworker/22:1] root 11246 0.0 0.0 0 0 ? D 10:28 0:00 [kworker/30:2] root 14034 0.0 0.0 0 0 ? D 12:20 0:02 [kworker/19:15] root 14048 0.0 0.0 0 0 ? D 12:20 0:00 [kworker/16:0] root 15871 0.0 0.0 0 0 ? D 12:25 0:00 [kworker/13:0] root 17442 0.0 0.0 0 0 ? D 12:28 0:00 [kworker/9:1] root 17816 0.0 0.0 0 0 ? D 12:30 0:00 [kworker/11:1] root 18744 0.0 0.0 0 0 ? D 12:32 0:00 [kworker/10:2] root 19060 0.0 0.0 0 0 ? D 12:32 0:00 [kworker/29:0] root 21748 0.0 0.0 0 0 ? D 12:40 0:00 [kworker/21:0] root 21967 0.0 0.0 0 0 ? D 12:40 0:00 [kworker/22:0] root 21978 0.0 0.0 0 0 ? D 12:40 0:00 [kworker/22:2] root 22024 0.0 0.0 0 0 ? D 12:40 0:00 [kworker/22:4] root 22035 0.0 0.0 0 0 ? D 12:40 0:00 [kworker/22:5] root 22060 0.0 0.0 0 0 ? D 12:40 0:00 [kworker/16:1] root 22282 0.0 0.0 0 0 ? D 12:41 0:00 [kworker/26:0] root 22362 0.0 0.0 0 0 ? D 12:42 0:00 [kworker/18:9] root 22426 0.0 0.0 0 0 ? D 12:42 0:00 [kworker/16:3] root 23298 0.0 0.0 0 0 ? D 12:43 0:00 [kworker/12:1] root 23302 0.0 0.0 0 0 ? D 12:43 0:00 [kworker/12:5] root 24264 0.0 0.0 0 0 ? D 12:46 0:00 [kworker/30:1] root 24271 0.0 0.0 0 0 ? D 12:46 0:00 [kworker/14:8] root 24441 0.0 0.0 0 0 ? D 12:47 0:00 [kworker/9:7] root 24443 0.0 0.0 0 0 ? D 12:47 0:00 [kworker/9:9] root 25005 0.0 0.0 0 0 ? D 12:48 0:00 [kworker/30:3] root 25158 0.0 0.0 0 0 ? D 12:49 0:00 [kworker/9:12] root 26382 0.0 0.0 0 0 ? D 12:52 0:00 [kworker/13:2] root 26453 0.0 0.0 0 0 ? D 12:52 0:00 [kworker/21:2] root 26724 0.0 0.0 0 0 ? D 12:53 0:00 [kworker/19:1] root 28400 0.0 0.0 0 0 ? D 05:20 0:00 [kworker/25:1] root 29552 0.0 0.0 0 0 ? D 11:40 0:00 [kworker/17:1] root 29811 0.0 0.0 0 0 ? D 11:40 0:00 [kworker/7:10] root 31903 0.0 0.0 0 0 ? D 11:43 0:00 [kworker/26:1] And all of the processes have this stack: [<ffffffffa0727ed5>] iser_release_work+0x25/0x60 [ib_iser] [<ffffffff8109633f>] process_one_work+0x14f/0x400 [<ffffffff81096bb4>] worker_thread+0x114/0x470 [<ffffffff8109c6f8>] kthread+0xd8/0xf0 [<ffffffff8172004f>] ret_from_fork+0x3f/0x70 [<ffffffffffffffff>] 0xffffffffffffffff We are not able to log out of the sessions in all cases. And have to restart the box. iscsiadm -m session will show messages like: iscsiadm: could not read session targetname: 5 iscsiadm: could not find session info for session100 iscsiadm: could not read session targetname: 5 iscsiadm: could not find session info for session101 iscsiadm: could not read session targetname: 5 iscsiadm: could not find session info for session103 ... I can't find any way to force iscsiadm to clean up these sessions possibly due to tasks in D state. ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Mon, Oct 17, 2016 at 10:32 AM, Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote: > Some more info as we hit this this morning. We have volumes mirrored > between two targets and we had one target on the kernel with the three > patches mentioned in this thread [0][1][2] and the other was on a > kernel without the patches. We decided that after a week and a half we > wanted to get both targets on the same kernel so we rebooted the > non-patched target. Within an hour we saw iSCSI in D state with the > same stack trace so it seems that we are not hitting any of the > WARN_ON lines. We are getting both iscsi_trx and iscsi_np both in D > state, this time we have two iscsi_trx processes in D state. I don't > know if stale sessions on the clients could be contributing to this > issue (the target trying to close non-existent sessions??). This is on > 4.4.23. Any more debug info we can throw at this problem to help? > > Thank you, > Robert LeBlanc > > # ps aux | grep D | grep iscsi > root 16525 0.0 0.0 0 0 ? D 08:50 0:00 [iscsi_np] > root 16614 0.0 0.0 0 0 ? D 08:50 0:00 [iscsi_trx] > root 16674 0.0 0.0 0 0 ? D 08:50 0:00 [iscsi_trx] > > # for i in 16525 16614 16674; do echo $i; cat /proc/$i/stack; done > 16525 > [<ffffffff814f0d5f>] iscsit_stop_session+0x19f/0x1d0 > [<ffffffff814e2516>] iscsi_check_for_session_reinstatement+0x1e6/0x270 > [<ffffffff814e4ed0>] iscsi_target_check_for_existing_instances+0x30/0x40 > [<ffffffff814e5020>] iscsi_target_do_login+0x140/0x640 > [<ffffffff814e63bc>] iscsi_target_start_negotiation+0x1c/0xb0 > [<ffffffff814e410b>] iscsi_target_login_thread+0xa9b/0xfc0 > [<ffffffff8109c748>] kthread+0xd8/0xf0 > [<ffffffff8172018f>] ret_from_fork+0x3f/0x70 > [<ffffffffffffffff>] 0xffffffffffffffff > 16614 > [<ffffffff814cca79>] target_wait_for_sess_cmds+0x49/0x1a0 > [<ffffffffa064692b>] isert_wait_conn+0x1ab/0x2f0 [ib_isert] > [<ffffffff814f0ef2>] iscsit_close_connection+0x162/0x870 > [<ffffffff814df9bf>] iscsit_take_action_for_connection_exit+0x7f/0x100 > [<ffffffff814f00a0>] iscsi_target_rx_thread+0x5a0/0xe80 > [<ffffffff8109c748>] kthread+0xd8/0xf0 > [<ffffffff8172018f>] ret_from_fork+0x3f/0x70 > [<ffffffffffffffff>] 0xffffffffffffffff > 16674 > [<ffffffff814cca79>] target_wait_for_sess_cmds+0x49/0x1a0 > [<ffffffffa064692b>] isert_wait_conn+0x1ab/0x2f0 [ib_isert] > [<ffffffff814f0ef2>] iscsit_close_connection+0x162/0x870 > [<ffffffff814df9bf>] iscsit_take_action_for_connection_exit+0x7f/0x100 > [<ffffffff814f00a0>] iscsi_target_rx_thread+0x5a0/0xe80 > [<ffffffff8109c748>] kthread+0xd8/0xf0 > [<ffffffff8172018f>] ret_from_fork+0x3f/0x70 > [<ffffffffffffffff>] 0xffffffffffffffff > > > [0] https://www.spinics.net/lists/target-devel/msg13463.html > [1] http://marc.info/?l=linux-scsi&m=147282568910535&w=2 > [2] http://www.spinics.net/lists/linux-scsi/msg100221.html > ---------------- > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > > On Fri, Oct 7, 2016 at 8:59 PM, Zhu Lingshan <lszhu@xxxxxxxx> wrote: >> Hi Robert, >> >> I also see this issue, but this is not the only code path can trigger this >> problem, I think you may also see iscsi_np in D status. I fixed one code >> path whitch still not merged to mainline. I will forward you my patch later. >> Note: my patch only fixed one code path, you may see other call statck with >> D status. >> >> Thanks, >> BR >> Zhu Lingshan >> >> >> 在 2016/10/1 1:14, Robert LeBlanc 写道: >>> >>> We are having a reoccurring problem where iscsi_trx is going into D >>> state. It seems like it is waiting for a session tear down to happen >>> or something, but keeps waiting. We have to reboot these targets on >>> occasion. This is running the 4.4.12 kernel and we have seen it on >>> several previous 4.4.x and 4.2.x kernels. There is no message in dmesg >>> or /var/log/messages. This seems to happen with increased frequency >>> when we have a disruption in our Infiniband fabric, but can happen >>> without any changes to the fabric (other than hosts rebooting). >>> >>> # ps aux | grep iscsi | grep D >>> root 4185 0.0 0.0 0 0 ? D Sep29 0:00 >>> [iscsi_trx] >>> root 18505 0.0 0.0 0 0 ? D Sep29 0:00 >>> [iscsi_np] >>> >>> # cat /proc/4185/stack >>> [<ffffffff814cc999>] target_wait_for_sess_cmds+0x49/0x1a0 >>> [<ffffffffa087292b>] isert_wait_conn+0x1ab/0x2f0 [ib_isert] >>> [<ffffffff814f0de2>] iscsit_close_connection+0x162/0x840 >>> [<ffffffff814df8df>] iscsit_take_action_for_connection_exit+0x7f/0x100 >>> [<ffffffff814effc0>] iscsi_target_rx_thread+0x5a0/0xe80 >>> [<ffffffff8109c6f8>] kthread+0xd8/0xf0 >>> [<ffffffff8172004f>] ret_from_fork+0x3f/0x70 >>> [<ffffffffffffffff>] 0xffffffffffffffff >>> >>> # cat /proc/18505/stack >>> [<ffffffff814f0c71>] iscsit_stop_session+0x1b1/0x1c0 >>> [<ffffffff814e2436>] iscsi_check_for_session_reinstatement+0x1e6/0x270 >>> [<ffffffff814e4df0>] iscsi_target_check_for_existing_instances+0x30/0x40 >>> [<ffffffff814e4f40>] iscsi_target_do_login+0x140/0x640 >>> [<ffffffff814e62dc>] iscsi_target_start_negotiation+0x1c/0xb0 >>> [<ffffffff814e402b>] iscsi_target_login_thread+0xa9b/0xfc0 >>> [<ffffffff8109c6f8>] kthread+0xd8/0xf0 >>> [<ffffffff8172004f>] ret_from_fork+0x3f/0x70 >>> [<ffffffffffffffff>] 0xffffffffffffffff >>> >>> What can we do to help get this resolved? >>> >>> Thanks, >>> >>> ---------------- >>> Robert LeBlanc >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in >>> the body of a message to majordomo@xxxxxxxxxxxxxxx >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> -- To unsubscribe from this list: send the line "unsubscribe linux-scsi" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html