Leak of tpg->np_login_sem, possibly due to connection interruptions

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hey all,

I'm investigating an issue that we've been seeing on and off for about a year and a half now. We're running on Linux 5.15 with a couple custom patches, but none that I think would be relevant to this issue. A customer (usually with a somewhat flaky network) will report that they're getting constant "iSCSI Login timeout on Network Portal 0.0.0.0:3260" messages because the iSCSI Target login thread will wait on the np_login_sem  semaphore until it gets interrupted by the timer timeout.

We can see a thread blocked on the semaphore if we look at the stack traces on the system:

0xffff8bdf62f2ac80 INTERRUPTIBLE         1
                  __schedule+0x2c1
                  schedule+0x33
                  schedule_timeout+0x205
                  __down_interruptible+0xbb
                  down_interruptible+0x4b
                  iscsit_access_np+0x5a
                  iscsi_target_locate_portal+0x429
                  __iscsi_target_login_thread+0x332
                  iscsi_target_login_thread+0x6f3
                  kthread+0x120
                  ret_from_fork+0x1f

And using bpftrace, we can see that all calls to iscsit_access_np with one tpg return -1 when the timeout hits.

I've also managed to run some bpftrace when the problem first starts happening. The problem first starts when we get a LUN_RESET (XXX here are redacted, but all the lines have the same iqn in the original logs):

Aug  6 23:24:19 HOSTNAME kernel: [24805.722904] LUN_RESET: TMR caller
fabric: iSCSI initiator port iqn.1991-05.com.XXX:XXX.XXX.XXX
Aug  6 23:24:19 HOSTNAME kernel: [24805.722907] LUN_RESET: TMR starting for [iblock], tas: 1
Aug  6 23:24:19 HOSTNAME kernel: [24805.722919] LUN_RESET: ITT[0x96786300] -  pr_res_key: 0x0000000000000000

Twenty seconds later, we then start doing the iscsi login flow. The bpftrace that shows that an access_np call comes in for that tpg, but no corresponding deaccess_np call ever comes in. This system also has pr_debug printing enabled for various files, which allows us to see that iscsit_take_action_for_connection_exit was called, it looks like because connection reinstatement was triggered:

Aug  6 23:24:39 HOSTNAME kernel: [24825.690817] iSCSI security negotiation completed successfully.
Aug  6 23:24:39 HOSTNAME kernel: [24825.690826] Normal iSCSI Session SID 20 is still active for iqn.1991-05.com.XXX:XXX.XXX.XXX, performing session reinstatement.
Aug  6 23:24:39 HOSTNAME kernel: [24825.690844] Moving to TARG_CONN_STATE_CLEANUP_WAIT.
Aug  6 23:24:39 HOSTNAME kernel: [24825.690847] Performing cleanup for failed iSCSI Connection ID: 1 from iqn.1991-05.com.XXX:XXX.XXX.XXX

As a result, it seems like there is a way for the np_login_sem hold to leak.  It seems like what's happening is that after the LUN_RESET, the session somehow wasn't closed. As a result, the login process decides to try to reinstate the old session, which tries to reinstate the old connection. This kills the tx and rx thread, and one of them invokes iscsit_take_action_for_connection_exit, cancelling the login attempt and killing the delayed worker (who would usually be responsible for calling deaccess_np). As a result, the semaphore is never released.

My current best idea for how to fix this (if it is the issue I think it is) is to somehow detect in the connection exit code that we're holding the semaphore, and forcibly release it in that case. However, I'm not familiar enough with this code path to know if that idea is feasible, or if it would break another valid flow through this code. I would welcome any alternative explanation for the symptoms and logging I've been seeing, as well as any other proposals on a way to fix this. We have a customer system that reliably hits this state within a couple hours after booting, which is causing them a lot of grief.

Thanks,
Paul Dagnelie

This e-mail may contain information that is privileged or confidential. If you are not the intended recipient, please delete the e-mail and any attachments and notify us immediately.






[Index of Archives]     [Linux SCSI]     [Kernel Newbies]     [Linux SCSI Target Infrastructure]     [Share Photos]     [IDE]     [Security]     [Git]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux RAID]     [Linux ATA RAID]     [Linux IIO]     [Device Mapper]

  Powered by Linux