From: Rock Li <lihongweizz@xxxxxxxxxx> We met a crash issue when iscsi connection was not stable: [ 5581.778976] connection929:0: detected conn error (1020) [ 5592.539182] scsi host17: iSCSI Initiator over TCP/IP [ 5592.548847] connection930:0: detected conn error (1020) [ 5592.548890] BUG: unable to handle kernel NULL pointer dereference at 0000000000000020 [ 5592.548935] PGD 0 [ 5592.548947] Oops: 0000 [#1] SNP NOPTI [ 5592.548966] CPU: 51 PID: 912890 Comm: kworker/u161:2 Kdump: loaded Tainted: G OE - xxxxxx #1 [ 5592.549022] Hardware name: xxxxxx [ 5592.549053] Workqueue: iscsi_conn_cleanup scsi_cleanup_conn_work_fn[scsi_transport_iscsi] [ 5592.549098] RIP: 0010:iscsi_sw_tcp_release_conn+0x54/0x110[iscsi_tcp] [ 5592.549130] Code: fb be 02 00 00 00 48 89 0f e8 88 65 8b c9 48 8b 45 20 f0 ff 80 80 00 00 00 0f 88 e3 06 00 00 48 8b 43 08 4c 8b 70 08 49 8b 06 <48> 8b 58 20 4c 8d bb 30 02 00 00 4c 89 ff e8 49 75 as c9 4c 89 ff [ 5592.549209] RSP: 0018:ff6937f4283e7e00 EFLAGS: 00010202 [ 5592.549233] RAX: 0000000000000000 RBX: ff347b03a4a4b478 RCX: 0000000000000000 [ 5592.549265] RDX: 0000000000000000 RSI: 00000000fffffe0l RDI: ffffffff8a2bc977 [ 5592.549296] RBP: ff347b063d49d600 R08: ff347b20bffeb878 R09: 00000000000003e8 [ 5592.549327] R10: 0000000000000000 R11: ff347b20bffe9b44 R12: ff347b03a4a4b7a8 [ 5592.549358] R13: ff347b03a4a4e610 R14: ff347b03a4a4b7a8 R15: ff347b03a4a4b068 [ 5592.549389] FS: 0000000000000000(0000) Gs:ff347b20bffc0000(0000) knlGs:0000000000000000 [ 5592.549424] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 5592.549446] CR2: 0000000000000020 CR3: 0000003a22610005 CR4: 0000000000773ee0 [ 5592.549469] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 5592.549491] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 5592.549510] PKRU: 55555554 [ 5592.549518] Call Trace: [ 5592.549528] iscsi_sw_tcp_conn_stop+0x5d/0x80 [iscsi_tcp] [ 5592.549546] iscsi_stop_conn+0x66/0xc0 [scsi_transport_iscsi] [ 5592.549568] iscsi_cleanup_conn_workin+0x6e/0xbe [scsi_transport_iscsi] After digging the vmcore file, a concurrency scenario was found: <netlink recv msg> <sk_state_change triggered> <kworker> iscsi_if_rx iscsi_sw_tcp_state_change iscsi_if_recv_msg iscsi_sw_sk_state_check iscsi_if_stop_conn iscsi_conn_failure cancel_work_sync( iscsi_conn_error_event &conn->cleanup_work) queue_work( &conn->cleanup_work) iscsi_stop_conn <- Excute cocurrenty --> iscsi_stop_conn iscsi_stop_conn will be excuted cocurrently in different paths. Fix this issue by leveraging ep_mutex to protect iscsi_stop_conn. Signed-off-by: Rock Li <lihongweizz@xxxxxxxxxx> --- drivers/scsi/scsi_transport_iscsi.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/drivers/scsi/scsi_transport_iscsi.c b/drivers/scsi/scsi_transport_iscsi.c index b9b97300e3b3..1da1083509b6 100644 --- a/drivers/scsi/scsi_transport_iscsi.c +++ b/drivers/scsi/scsi_transport_iscsi.c @@ -2307,7 +2307,16 @@ static int iscsi_if_stop_conn(struct iscsi_cls_conn *conn, int flag) */ if (flag == STOP_CONN_TERM) { cancel_work_sync(&conn->cleanup_work); + + /* There is a race window between cancel clean_work and + * connection stopped. Socket down event may queue clean up + * work again before connection stopped which will result + * stop connection cocurrency issue. Avoid this issue + * by leveraging ep_mutex + */ + mutex_lock(&conn->ep_mutex); iscsi_stop_conn(conn, flag); + mutex_unlock(&conn->ep_mutex); } else { /* * Figure out if it was the kernel or userspace initiating this. -- 2.27.0