Re: [bug report]concurrent blktests nvme-rdma execution lead kernel null pointer

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



-----"Yi Zhang" <yi.zhang@xxxxxxxxxx> wrote: -----

>To: "RDMA mailing list" <linux-rdma@xxxxxxxxxxxxxxx>
>From: "Yi Zhang" <yi.zhang@xxxxxxxxxx>
>Date: 12/03/2021 03:20AM
>Subject: [EXTERNAL] [bug report]concurrent blktests nvme-rdma
>execution lead kernel null pointer
>
>Hello
>With the concurrent blktests nvme-rdma execution with both rdma_rxe
>and siw lead kernel BUG on 5.16.0-rc3, pls help check it, thanks.
>

The RDMA core currently does not prevent us from
assigning  both siw and rxe to the same netdev. I think this
is what is happening here. This setting is of no sense, but
obviously not prohibited by the RDMA infrastructure. Behavior
is undefined and a kernel panic not unexpected. Shall we
prevent the privileged user from doing this type of
experiments?

A related question: should we also explicitly refuse to
add software RDMA drivers to netdevs with RDMA hardware active?
This is, while stupid and resulting behavior undefined, currently
possible as well.

Thanks
Bernard.

>Reproducer:
>Run blktests nvme-rdma on two terminal at the same time
>terminal 1:
># use_siw=1 nvme_trtype=rdma ./check nvme/
>terminal 2:
># nvme_trtype=rdma ./check nvme/
>
>[ 1685.584327] run blktests nvme/013 at 2021-12-02 21:08:46
>[ 1685.669804] eno2 speed is unknown, defaulting to 1000
>[ 1685.674866] eno2 speed is unknown, defaulting to 1000
>[ 1685.679941] eno2 speed is unknown, defaulting to 1000
>[ 1685.686033] eno2 speed is unknown, defaulting to 1000
>[ 1685.691087] eno2 speed is unknown, defaulting to 1000
>[ 1685.697677] eno2 speed is unknown, defaulting to 1000
>[ 1685.703727] eno3 speed is unknown, defaulting to 1000
>[ 1685.708798] eno3 speed is unknown, defaulting to 1000
>[ 1685.713863] eno3 speed is unknown, defaulting to 1000
>[ 1685.719965] eno3 speed is unknown, defaulting to 1000
>[ 1685.725043] eno3 speed is unknown, defaulting to 1000
>[ 1685.731688] eno2 speed is unknown, defaulting to 1000
>[ 1685.736763] eno3 speed is unknown, defaulting to 1000
>[ 1685.742818] eno4 speed is unknown, defaulting to 1000
>[ 1685.747881] eno4 speed is unknown, defaulting to 1000
>[ 1685.752949] eno4 speed is unknown, defaulting to 1000
>[ 1685.759134] eno4 speed is unknown, defaulting to 1000
>[ 1685.764195] eno4 speed is unknown, defaulting to 1000
>[ 1685.770914] eno2 speed is unknown, defaulting to 1000
>[ 1685.775980] eno3 speed is unknown, defaulting to 1000
>[ 1685.781047] eno4 speed is unknown, defaulting to 1000
>[ 1686.002801] eno2 speed is unknown, defaulting to 1000
>[ 1686.007867] eno3 speed is unknown, defaulting to 1000
>[ 1686.012934] eno4 speed is unknown, defaulting to 1000
>[ 1686.022521] rdma_rxe: rxe-ah pool destroyed with unfree'd elem
>[ 1686.289384] run blktests nvme/013 at 2021-12-02 21:08:46
>[ 1686.356666] eno2 speed is unknown, defaulting to 1000
>[ 1686.361735] eno2 speed is unknown, defaulting to 1000
>[ 1686.366807] eno2 speed is unknown, defaulting to 1000
>[ 1686.371876] eno2 speed is unknown, defaulting to 1000
>[ 1686.378400] eno2 speed is unknown, defaulting to 1000
>[ 1686.384419] eno3 speed is unknown, defaulting to 1000
>[ 1686.389494] eno3 speed is unknown, defaulting to 1000
>[ 1686.394583] eno3 speed is unknown, defaulting to 1000
>[ 1686.399660] eno3 speed is unknown, defaulting to 1000
>[ 1686.406219] eno2 speed is unknown, defaulting to 1000
>[ 1686.411291] eno3 speed is unknown, defaulting to 1000
>[ 1686.417275] eno4 speed is unknown, defaulting to 1000
>[ 1686.422338] eno4 speed is unknown, defaulting to 1000
>[ 1686.427401] eno4 speed is unknown, defaulting to 1000
>[ 1686.432475] eno4 speed is unknown, defaulting to 1000
>[ 1686.439038] eno2 speed is unknown, defaulting to 1000
>[ 1686.444109] eno3 speed is unknown, defaulting to 1000
>[ 1686.449180] eno4 speed is unknown, defaulting to 1000
>[ 1686.873596] xfs filesystem being mounted at /mnt/blktests supports
>timestamps until 2038 (0x7fffffff)
>[ 1687.540606] xfs filesystem being mounted at /mnt/blktests supports
>timestamps until 2038 (0x7fffffff)
>[ 1693.658327] block nvme0n1: no available path - failing I/O
>[ 1693.663038] block nvme0n1: no available path - failing I/O
>[ 1693.663828] XFS (nvme0n1): log I/O error -5
>[ 1693.665024] block nvme0n1: no available path - failing I/O
>[ 1693.665041] XFS (nvme0n1): log I/O error -5
>[ 1693.665044] XFS (nvme0n1): Log I/O Error (0x2) detected at
>xlog_ioend_work+0x71/0x80 [xfs] (fs/xfs/xfs_log.c:1377).  Shutting
>down filesystem.
>[ 1693.665142] XFS (nvme0n1): Please unmount the filesystem and
>rectify the problem(s)
>[ 1693.720462] block nvme0n1: no available path - failing I/O
>[ 1693.728150] nvmet_rdma: post_recv cmd failed
>[ 1693.732432] nvmet_rdma: sending cmd response failed
>[ 1693.836083] eno2 speed is unknown, defaulting to 1000
>[ 1693.841152] eno3 speed is unknown, defaulting to 1000
>[ 1693.846217] eno4 speed is unknown, defaulting to 1000
>[ 1693.852280] BUG: unable to handle page fault for address:
>ffffffffc09d2680
>[ 1693.859156] #PF: supervisor instruction fetch in kernel mode
>[ 1693.864815] #PF: error_code(0x0010) - not-present page
>[ 1693.869953] PGD 2b5813067 P4D 2b5813067 PUD 2b5815067 PMD
>13a157067 PTE 0
>[ 1693.876740] Oops: 0010 [#1] PREEMPT SMP NOPTI
>[ 1693.881098] CPU: 15 PID: 16091 Comm: rdma Tainted: G S        I
>  5.16.0-rc3 #1
>[ 1693.888751] Hardware name: Dell Inc. PowerEdge R640/06NR82, BIOS
>2.11.2 004/21/2021
>[ 1693.896403] RIP: 0010:0xffffffffc09d2680
>[ 1693.900329] Code: Unable to access opcode bytes at RIP
>0xffffffffc09d2656.
>[ 1693.907202] RSP: 0018:ffffb3d5456237b0 EFLAGS: 00010286
>[ 1693.912428] RAX: ffffffffc09d2680 RBX: ffff9d4adade2000 RCX:
>0000000000000001
>[ 1693.919559] RDX: 0000000080000001 RSI: ffffb3d5456237e8 RDI:
>ffff9d4adade2000
>[ 1693.926693] RBP: ffffb3d5456237e8 R08: ffffb3d545623850 R09:
>0000000000000230
>[ 1693.933823] R10: 0000000000000002 R11: ffffb3d545623840 R12:
>ffff9d4adade2270
>[ 1693.940957] R13: ffff9d4adade21e0 R14: 0000000000000005 R15:
>ffff9d4adade2220
>[ 1693.948089] FS:  00007f2f0601c000(0000) GS:ffff9d59ffdc0000(0000)
>knlGS:0000000000000000
>[ 1693.956176] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>[ 1693.961921] CR2: ffffffffc09d2656 CR3: 0000000180578004 CR4:
>00000000007706e0
>[ 1693.969052] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>0000000000000000
>[ 1693.976177] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>0000000000000400
>[ 1693.983309] PKRU: 55555554
>[ 1693.986023] Call Trace:
>[ 1693.988474]  <TASK>
>[ 1693.990582]  ? cma_cm_event_handler+0x1d/0xd0 [rdma_cm]
>[ 1693.995817]  ? cma_process_remove+0x73/0x290 [rdma_cm]
>[ 1694.000954]  ? cma_remove_one+0x5a/0xd0 [rdma_cm]
>[ 1694.005661]  ? remove_client_context+0x88/0xd0 [ib_core]
>[ 1694.010990]  ? disable_device+0x8c/0x130 [ib_core]
>[ 1694.015790]  ? xa_load+0x73/0xa0
>[ 1694.019024]  ? __ib_unregister_device+0x40/0xa0 [ib_core]
>[ 1694.024431]  ? ib_unregister_device_and_put+0x33/0x50 [ib_core]
>[ 1694.030360]  ? nldev_dellink+0x86/0xe0 [ib_core]
>[ 1694.035000]  ? rdma_nl_rcv_msg+0x109/0x200 [ib_core]
>[ 1694.039978]  ? __alloc_skb+0x8c/0x1b0
>[ 1694.043645]  ? __kmalloc_node_track_caller+0x184/0x340
>[ 1694.048785]  ? rdma_nl_rcv+0xc8/0x110 [ib_core]
>[ 1694.053325]  ? netlink_unicast+0x1a2/0x280
>[ 1694.057424]  ? netlink_sendmsg+0x244/0x480
>[ 1694.061524]  ? sock_sendmsg+0x58/0x60
>[ 1694.065188]  ? __sys_sendto+0xee/0x160
>[ 1694.068944]  ? netlink_setsockopt+0x26e/0x3d0
>[ 1694.073300]  ? __sys_setsockopt+0xdc/0x1d0
>[ 1694.077400]  ? __x64_sys_sendto+0x24/0x30
>[ 1694.081414]  ? do_syscall_64+0x37/0x80
>[ 1694.085164]  ? entry_SYSCALL_64_after_hwframe+0x44/0xae
>[ 1694.090391]  </TASK>
>[ 1694.092584] Modules linked in: siw rpcrdma rdma_ucm ib_uverbs
>ib_srpt ib_isert iscsi_target_mod target_core_mod loop ib_iser
>libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm ib_core
>rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace
>fscache
>netfs rfkill sunrpc vfat fat dm_multipath intel_rapl_msr
>intel_rapl_common isst_if_common skx_edac x86_pkg_temp_thermal
>intel_powerclamp coretemp kvm_intel ipmi_ssif kvm mgag200
>i2c_algo_bit
>drm_kms_helper iTCO_wdt iTCO_vendor_support syscopyarea irqbypass
>sysfillrect crct10dif_pclmul sysimgblt crc32_pclmul fb_sys_fops
>ghash_clmulni_intel acpi_ipmi drm rapl ipmi_si intel_cstate mei_me
>intel_uncore i2c_i801 mei ipmi_devintf nd_pmem dax_pmem_compat
>wmi_bmof pcspkr device_dax intel_pch_thermal i2c_smbus lpc_ich
>ipmi_msghandler nd_btt dax_pmem_core acpi_power_meter xfs libcrc32c
>sd_mod t10_pi sg ahci libahci libata megaraid_sas nfit tg3
>crc32c_intel libnvdimm wmi dm_mirror dm_region_hash dm_log dm_mod
>[last unloaded: nvmet]
>[ 1694.178277] CR2: ffffffffc09d2680
>[ 1694.181596] ---[ end trace 9c234cd612cbb92a ]---
>[ 1694.217410] RIP: 0010:0xffffffffc09d2680
>[ 1694.221343] Code: Unable to access opcode bytes at RIP
>0xffffffffc09d2656.
>[ 1694.228212] RSP: 0018:ffffb3d5456237b0 EFLAGS: 00010286
>[ 1694.233437] RAX: ffffffffc09d2680 RBX: ffff9d4adade2000 RCX:
>0000000000000001
>[ 1694.240570] RDX: 0000000080000001 RSI: ffffb3d5456237e8 RDI:
>ffff9d4adade2000
>[ 1694.247702] RBP: ffffb3d5456237e8 R08: ffffb3d545623850 R09:
>0000000000000230
>[ 1694.254828] R10: 0000000000000002 R11: ffffb3d545623840 R12:
>ffff9d4adade2270
>[ 1694.261958] R13: ffff9d4adade21e0 R14: 0000000000000005 R15:
>ffff9d4adade2220
>[ 1694.269091] FS:  00007f2f0601c000(0000) GS:ffff9d59ffdc0000(0000)
>knlGS:0000000000000000
>[ 1694.277178] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>[ 1694.282922] CR2: ffffffffc09d2656 CR3: 0000000180578004 CR4:
>00000000007706e0
>[ 1694.290054] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>0000000000000000
>[ 1694.297180] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
>0000000000000400
>[ 1694.304312] PKRU: 55555554
>[ 1694.307025] Kernel panic - not syncing: Fatal exception
>[ 1694.772244] Kernel Offset: 0x35c00000 from 0xffffffff81000000
>(relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>[ 1694.794394] ---[ end Kernel panic - not syncing: Fatal exception
>]---
>
>
>-- 
>Best Regards,
>  Yi Zhang
>
>




[Index of Archives]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Photo]     [Yosemite News]     [Yosemite Photos]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux