[BUG] lpfc: kernel NULL pointer dereference

"Dietmar Hahn (Fujitsu)" <dietmar.hahn@xxxxxxxxxxx> · Tue, 25 Oct 2022 12:47:16 +0000

Hi,

we have almost every day system crashes because of a slightly defective periphery.

[21004.663296] lpfc 0000:65:00.1: 7:(0):2858 FLOGI failure Status:x3/x31420002 TMO:x14 Data x11140820 x0
[21011.733417]  rport-18:0-1: blocked FC remote port time out: removing rport
[21011.733424] **** lpfc_rport_invalid: Null vport on ndlp xffff8e25e1a4a000, DID xfffffe rport xffff8e061354b000 SID xffffffff
[21011.733432] BUG: kernel NULL pointer dereference, address: 0000000000000000
[21011.733438] #PF: supervisor read access in kernel mode
[21011.733441] #PF: error_code(0x0000) - not-present page
[21011.733444] PGD 0 P4D 0 
[21011.733448] Oops: 0000 [#1] PREEMPT SMP NOPTI
[21011.733453] CPU: 47 PID: 1303 Comm: kworker/47:4 Kdump: loaded Not tainted 5.14.21-150400.24.21-default #1 SLE15-SP4 7550826c4c7e8c258239e300508e0c8b2a69bad2
[21011.733460] Hardware name: FUJITSU SE SERVER SU320 M1/D3892-A1, BIOS V1.0.0.0 R1.13.0 for D3892-A1x            11/25/2021
[21011.733463] Workqueue: fc_wq_18 fc_rport_final_delete [scsi_transport_fc]
[21011.733475] RIP: 0010:lpfc_dev_loss_tmo_callbk+0x50/0x4d0 [lpfc]
[21011.733497] Code: 00 00 00 0f b7 8b ac 00 00 00 48 c7 c2 68 82 93 c0 44 8b 83 98 00 00 00 44 8b 8b 94 00 00 00 48 89 fd be 80 00 00 00 4c 89 e7 <4d> 8b 2c 24 e8 c7 8e 04 00 4c 8b 83 f8 00 00 00 41 8b 90 e0 02 00
[21011.733502] RSP: 0018:ffff9ecb604bbe38 EFLAGS: 00010286
[21011.733505] RAX: ffff8e061354b510 RBX: ffff8e25e1a4a000 RCX: 000000000000ffff
[21011.733508] RDX: ffffffffc0938268 RSI: 0000000000000080 RDI: 0000000000000000
[21011.733511] RBP: ffff8e061354b000 R08: 0000000000fffffe R09: 0000000000000000
[21011.733513] R10: ffff9ecb4c923d80 R11: ffff9ecb604bbc80 R12: 0000000000000000
[21011.733515] R13: ffff8e061354b000 R14: ffff8e4505b21000 R15: ffff8e4503944e40
[21011.733518] FS:  0000000000000000(0000) GS:ffff8e647fdc0000(0000) knlGS:0000000000000000
[21011.733521] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[21011.733523] CR2: 0000000000000000 CR3: 0000002a97010001 CR4: 00000000007706e0
[21011.733526] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[21011.733528] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[21011.733530] PKRU: 55555554
[21011.733532] Call Trace:
[21011.733537]  <TASK>
[21011.733541]  fc_rport_final_delete+0xec/0x1c0 [scsi_transport_fc 3bc651e7b65441f21e0602fb7ca4ac10797e0b7e]
[21011.733550]  process_one_work+0x264/0x440
[21011.733566]  worker_thread+0x2d/0x3d0
[21011.733571]  ? process_one_work+0x440/0x440
[21011.733574]  kthread+0x154/0x180
[21011.733580]  ? set_kthread_struct+0x50/0x50
[21011.733584]  ret_from_fork+0x1f/0x30
[21011.733591]  </TASK>

It's a kernel 5.14.21-150400.24.21-default from SuSE but with
lpfc_version.h: #define LPFC_DRIVER_VERSION "14.2.0.6"

The cause is that struct fc_rport *rport->dd_data->pnode->vport == 0x0.

In fc_rport_final_delete():
 -> lpfc_terminate_rport_io(rport)
    -> lpfc_rport_invalid()
       -> if (!ndlp->vport) {
                pr_err("**** %s: Null vport on ndlp ...

But later in lpfc_dev_loss_tmo_callbk():
   vport = ndlp->vport;
   phba  = vport->phba;  -> Panic!

Not being familiar with the code, I'm not sure if a simple check would do the trick:

diff --git a/drivers/scsi/lpfc/lpfc_hbadisc.c b/drivers/scsi/lpfc/lpfc_hbadisc.c
index d38ebd7281b9..5c5684909d24 100644
--- a/drivers/scsi/lpfc/lpfc_hbadisc.c
+++ b/drivers/scsi/lpfc/lpfc_hbadisc.c
@@ -160,6 +160,9 @@ lpfc_dev_loss_tmo_callbk(struct fc_rport *rport)
        if (!ndlp)
                return;
 
+       if (!ndlp->vport)
+               return;
+
        vport = ndlp->vport;
        phba  = vport->phba;
 
Thanks.
Dietmar.