[BUG] iscsi hangs during login attempt

Paul Dagnelie <pcd@xxxxxxxxxxx> · Wed, 14 Sep 2022 11:00:10 -0700

Hi all,

We've encountered an issue a few times now that results in a hung
iscsi session that cannot be recovered, and ultimately leads to the
system being wedged. I've done some initial analysis on what the issue
might be, but I would appreciate some more experienced eyes on it and
suggestions about the right course of action.

The machines in question are running Linux 5.4 (plus some other
changes; it's a custom kernel built on top of Ubuntu 20.04, but I
don't see any relevant patches anywhere in the stack, so I decided to
come straight here). The issue presents with iscsi login attempts
timing out repeatedly. We took a core dump of the machine, and
analysis of the stack traces showed that the iscsi_target_login_thread
is waiting for another thread to finish with the np_login_sem. That
semaphore is being held by a thread in iscsi_target_do_login_rx, which
is currently waiting for a session to be reinstated.
iscsit_cause_connection_reinstatement signals the rx and tx threads,
and then waits for the conn_wait_comp completion, which is signalled
in iscsit_close_connection. That appears to be called by the tx and rx
threads when they exit.  After puzzling through the core for a bit, I
found the kernel threads in question, and they appear to be calmly
waiting in the normal blocking path waiting for IOs to come in for
them to respond to. I would think that if they were in that state when
the SIGINT came in they would have exited properly.

My theory, after examining the code, is if two connection requests
were received from one initiator in rapid succession, it seems like
the second one would use the connection reinstatement logic. It may be
possible that if a reinstatement happens fast enough after the initial
login, the rx and tx threads would not yet have marked themselves as
able to receive the SIGINT that the connection reinstatement logic
uses to prompt them to close the connection so a new one can be
created. As I understand how signals are processed for kernel threads,
this would result in the signal being dropped. After that happens, the
kernel threads would never exit. If that is possible, then this could
explain the issue we’re seeing.

We've enabled some of the pr_debug statements on the systems in
question, but the issue hasn't recurred on any system yet. In
addition, it's worth noting that this issue did present itself shortly
after patching the initiating windows systems, which would presumably
result in one or more connection reinstatements. Does this theory seem
plausible? We haven't managed to reproduce it in-house or with
debugging statements enabled yet, but if it is the root cause it seems
to me the best fix would be to add or use an existing flag that is set
during reconnection (before the signal is sent), and have the rx and
tx threads check it after enabling signals to close the window for the
race.

-- 
Paul Dagnelie