rdmavt panic in long term stable linux-5.10.y

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



The following panic happens on the 5.10.20 long term stable running qperf with rdmavt/hfi1:

[ 1467.730495] BUG: kernel NULL pointer dereference, address: 0000000000000268
[ 1467.738940] #PF: supervisor read access in kernel mode
[ 1467.745052] #PF: error_code(0x0000) - not-present page
[ 1467.751159] PGD 0 P4D 0 
[ 1467.754350] Oops: 0000 [#1] SMP PTI
[ 1467.758621] CPU: 43 PID: 42843 Comm: qperf Tainted: G S                5.10.17 #1
[ 1467.767370] HISS-219ardware name: Intel Corporation S2600CWR/S2600CW, BIOS SE5C610.86B.01.01.0014.121820151719 12/18/2015
[ 1467.779357] RIP: 0010:ib_umem_get+0x233/0x3d0 [ib_uverbs]
[ 1467.785811] Code: 02 00 00 48 0f 46 f5 e8 9b 67 27 ca 85 c0 0f 88 40 01 00 00 4c 63 f0 4c 89 f2 4c 29 f5 48 c1 e2 0c 89 e9 48 01 d3 49 8b 14 24 <48> 8b 92 68 02 00 00 48 85 d2 0f 85 5a ff ff ff 41 b9 00 00 01 00
[ 1467.807715] RSP: 0018:ffffb7ba87303aa8 EFLAGS: 00010206
[ 1467.814026] RAX: 0000000000000010 RBX: 000055ad89f11000 RCX: 0000000000000000
[ 1467.822457] RDX: 0000000000000000 RSI: 000000000000000f RDI: ffff8954bffd6000
[ 1467.830888] RBP: 0000000000000000 R08: 0000000000031443 R09: 0000000000000000
[ 1467.839322] R10: 0000000000031420 R11: 0000000000000022 R12: ffff894d50930000
[ 1467.847751] R13: 0000000000000000 R14: 0000000000000010 R15: ffff894d4a2fe880
[ 1467.856193] FS:  00007fb12f44c740(0000) GS:ffff89549fa40000(0000) knlGS:0000000000000000
[ 1467.865721] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1467.872657] CR2: 0000000000000268 CR3: 00000001c0534001 CR4: 00000000001706e0
[ 1467.881136] Call Trace:
[ 1467.884398]  rvt_reg_user_mr+0x70/0x200 [rdmavt]

The panic happens in the call to dma_get_max_seg_size() because the dma_device is NULL.

Here is the stable patch that causes the issue:

commit 404fa093741e15e16fd522cc76cd9f86e9ef81d2
Author: Christoph Hellwig <hch@xxxxxx>
Date:   Fri Nov 6 19:19:38 2020 +0100

    RDMA/core: remove use of dma_virt_ops
    
    [ Upstream commit 5a7a9e038b032137ae9c45d5429f18a2ffdf7d42 ]
    
    Use the ib_dma_* helpers to skip the DMA translation instead.  This
    removes the last user if dma_virt_ops and keeps the weird layering
    violation inside the RDMA core instead of burderning the DMA mapping
    subsystems with it.  This also means the software RDMA drivers now don't
    have to mess with DMA parameters that are not relevant to them at all, and
    that in the future we can use PCI P2P transfers even for software RDMA, as
    there is no first fake layer of DMA mapping that the P2P DMA support.
    
    Link: https://lore.kernel.org/r/20201106181941.1878556-8-hch@xxxxxx
    Signed-off-by: Christoph Hellwig <hch@xxxxxx>
    Tested-by: Mike Marciniszyn <mike.marciniszyn@xxxxxxxxxxxxxxxxxxxx>
    Signed-off-by: Jason Gunthorpe <jgg@xxxxxxxxxx>
    Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>

The stable backport missed a prereq patch:

commit b116c702791a9834e6485f67ca6267d9fdf59b87
Author: Christoph Hellwig <hch@xxxxxx>
Date:   Fri Nov 6 19:19:33 2020 +0100

    RDMA/umem: Use ib_dma_max_seg_size instead of dma_get_max_seg_size
    
    RDMA ULPs must not call DMA mapping APIs directly but instead use the
    ib_dma_* wrappers.
    
    Fixes: 0c16d9635e3a ("RDMA/umem: Move to allocate SG table from pages")
    Link: https://lore.kernel.org/r/20201106181941.1878556-3-hch@xxxxxx
    Reported-by: Jason Gunthorpe <jgg@xxxxxxxxxx>
    Signed-off-by: Christoph Hellwig <hch@xxxxxx>
    Signed-off-by: Jason Gunthorpe <jgg@xxxxxxxxxx>

The missing patch adds the necessary RDMA wrappers to handle the ib_device dma_device member being NULL.

The missing patch picks clean and fixes the issue.

Do you want me to send the stable request?

Mike




[Index of Archives]     [Linux Kernel]     [Kernel Development Newbies]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite Hiking]     [Linux Kernel]     [Linux SCSI]

  Powered by Linux