On 2020/9/2 10:57 AM, Michael Christie wrote:
On Jul 29, 2020, at 8:03 AM, Hou Pu <houpu@xxxxxxxxxxxxx> wrote:
The iscsi target login thread might stuck in following stack:
cat /proc/`pidof iscsi_np`/stack
[<0>] down_interruptible+0x42/0x50
[<0>] iscsit_access_np+0xe3/0x167
[<0>] iscsi_target_locate_portal+0x695/0x8ac
[<0>] __iscsi_target_login_thread+0x855/0xb82
[<0>] iscsi_target_login_thread+0x2f/0x5a
[<0>] kthread+0xfa/0x130
[<0>] ret_from_fork+0x1f/0x30
This could be reproduced by following steps:
1. Initiator A try to login iqn1-tpg1 on port 3260. After finishing
PDU exchange in the login thread and before the negotiation is
finished, at this time the network link is down. In a production
environment, this could happen. I could emulated it by bring
the network card down in the initiator node by ifconfig eth0 down.
(Now A could never finish this login. And tpg->np_login_sem is
hold by it).
2. Initiator B try to login iqn2-tpg1 on port 3260. After finishing
PDU exchange in the login thread. The target expect to process
remaining login PDUs in workqueue context.
3. Initiator A' try to re-login to iqn1-tpg1 on port 3260 from
a new socket. It will wait for tpg->np_login_sem with
np->np_login_timer loaded to wait for at most 15 second.
(Because the lock is held by A. A never gets a change to
release tpg->np_login_sem. so A' should finally get timeout).
4. Before A' got timeout. Initiator B gets negotiation failed and
calls iscsi_target_login_drop()->iscsi_target_login_sess_out().
The np->np_login_timer is canceled. And initiator A' will hang
there forever. Because A' is now in the login thread. All other
login requests could not be serviced.
iqn1 and iqn1 are different targets right? It’s not clear to me how when initiator B fails negotiation that it cancels the timer for the portal under a different iqn/target.
iqn1-tpg1 in step1 and step3 are same one. (same target volume)
iqn2-tpg1 in step2 is a different volume on the same host.
The configuration likes below:
iqn1-tpg1:
root@storageXXX:/sys/kernel/config/target/iscsi# ls
iqn.2010-10.org.openstack\:volume-00e50deb-5296-4f18-xxxx-106f96a880c8/tpgt_1/np/
10.129.77.16:3260
iqn2-tpg1:
root@storageXXX:/sys/kernel/config/target/iscsi# ls
iqn.2010-10.org.openstack\:volume-86af15c6-c529-4715-xxxx-3c9ca068635d/tpgt_1/np/
10.129.77.16:3260
(I could provide more is needed)
Is iqn2-tpg1->np1 a different struct than iqn1-tpg1-np1? I mean iscsit_get_tpg_from_np would return a different np struct for initiator B and for A?
iscsit_get_tpg_from_np() returned different struct iscsi_portal_group
for initiator A and B. But struct iscsi_np is shared by them.
Because they have the same portal(ip address and port).
Thanks,
Hou