Hi all, We've encountered an issue a few times now that results in a hung iscsi session that cannot be recovered, and ultimately leads to the system being wedged. I've done some initial analysis on what the issue might be, but I would appreciate some more experienced eyes on it and suggestions about the right course of action. The machines in question are running Linux 5.4 (plus some other changes; it's a custom kernel built on top of Ubuntu 20.04, but I don't see any relevant patches anywhere in the stack, so I decided to come straight here). The issue presents with iscsi login attempts timing out repeatedly. We took a core dump of the machine, and analysis of the stack traces showed that the iscsi_target_login_thread is waiting for another thread to finish with the np_login_sem. That semaphore is being held by a thread in iscsi_target_do_login_rx, which is currently waiting for a session to be reinstated. iscsit_cause_connection_reinstatement signals the rx and tx threads, and then waits for the conn_wait_comp completion, which is signalled in iscsit_close_connection. That appears to be called by the tx and rx threads when they exit. After puzzling through the core for a bit, I found the kernel threads in question, and they appear to be calmly waiting in the normal blocking path waiting for IOs to come in for them to respond to. I would think that if they were in that state when the SIGINT came in they would have exited properly. My theory, after examining the code, is if two connection requests were received from one initiator in rapid succession, it seems like the second one would use the connection reinstatement logic. It may be possible that if a reinstatement happens fast enough after the initial login, the rx and tx threads would not yet have marked themselves as able to receive the SIGINT that the connection reinstatement logic uses to prompt them to close the connection so a new one can be created. As I understand how signals are processed for kernel threads, this would result in the signal being dropped. After that happens, the kernel threads would never exit. If that is possible, then this could explain the issue we’re seeing. We've enabled some of the pr_debug statements on the systems in question, but the issue hasn't recurred on any system yet. In addition, it's worth noting that this issue did present itself shortly after patching the initiating windows systems, which would presumably result in one or more connection reinstatements. Does this theory seem plausible? We haven't managed to reproduce it in-house or with debugging statements enabled yet, but if it is the root cause it seems to me the best fix would be to add or use an existing flag that is set during reconnection (before the signal is sent), and have the rx and tx threads check it after enabling signals to close the window for the race. -- Paul Dagnelie