On Mon, Mar 15, 2021 at 01:05:43PM +0000, Marciniszyn, Mike wrote: > The following panic happens on the 5.10.20 long term stable running qperf with rdmavt/hfi1: > > [ 1467.730495] BUG: kernel NULL pointer dereference, address: 0000000000000268 > [ 1467.738940] #PF: supervisor read access in kernel mode > [ 1467.745052] #PF: error_code(0x0000) - not-present page > [ 1467.751159] PGD 0 P4D 0 > [ 1467.754350] Oops: 0000 [#1] SMP PTI > [ 1467.758621] CPU: 43 PID: 42843 Comm: qperf Tainted: G S 5.10.17 #1 > [ 1467.767370] HISS-219ardware name: Intel Corporation S2600CWR/S2600CW, BIOS SE5C610.86B.01.01.0014.121820151719 12/18/2015 > [ 1467.779357] RIP: 0010:ib_umem_get+0x233/0x3d0 [ib_uverbs] > [ 1467.785811] Code: 02 00 00 48 0f 46 f5 e8 9b 67 27 ca 85 c0 0f 88 40 01 00 00 4c 63 f0 4c 89 f2 4c 29 f5 48 c1 e2 0c 89 e9 48 01 d3 49 8b 14 24 <48> 8b 92 68 02 00 00 48 85 d2 0f 85 5a ff ff ff 41 b9 00 00 01 00 > [ 1467.807715] RSP: 0018:ffffb7ba87303aa8 EFLAGS: 00010206 > [ 1467.814026] RAX: 0000000000000010 RBX: 000055ad89f11000 RCX: 0000000000000000 > [ 1467.822457] RDX: 0000000000000000 RSI: 000000000000000f RDI: ffff8954bffd6000 > [ 1467.830888] RBP: 0000000000000000 R08: 0000000000031443 R09: 0000000000000000 > [ 1467.839322] R10: 0000000000031420 R11: 0000000000000022 R12: ffff894d50930000 > [ 1467.847751] R13: 0000000000000000 R14: 0000000000000010 R15: ffff894d4a2fe880 > [ 1467.856193] FS: 00007fb12f44c740(0000) GS:ffff89549fa40000(0000) knlGS:0000000000000000 > [ 1467.865721] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 1467.872657] CR2: 0000000000000268 CR3: 00000001c0534001 CR4: 00000000001706e0 > [ 1467.881136] Call Trace: > [ 1467.884398] rvt_reg_user_mr+0x70/0x200 [rdmavt] > > The panic happens in the call to dma_get_max_seg_size() because the dma_device is NULL. > > Here is the stable patch that causes the issue: > > commit 404fa093741e15e16fd522cc76cd9f86e9ef81d2 > Author: Christoph Hellwig <hch@xxxxxx> > Date: Fri Nov 6 19:19:38 2020 +0100 > > RDMA/core: remove use of dma_virt_ops > > [ Upstream commit 5a7a9e038b032137ae9c45d5429f18a2ffdf7d42 ] > > Use the ib_dma_* helpers to skip the DMA translation instead. This > removes the last user if dma_virt_ops and keeps the weird layering > violation inside the RDMA core instead of burderning the DMA mapping > subsystems with it. This also means the software RDMA drivers now don't > have to mess with DMA parameters that are not relevant to them at all, and > that in the future we can use PCI P2P transfers even for software RDMA, as > there is no first fake layer of DMA mapping that the P2P DMA support. > > Link: https://lore.kernel.org/r/20201106181941.1878556-8-hch@xxxxxx > Signed-off-by: Christoph Hellwig <hch@xxxxxx> > Tested-by: Mike Marciniszyn <mike.marciniszyn@xxxxxxxxxxxxxxxxxxxx> > Signed-off-by: Jason Gunthorpe <jgg@xxxxxxxxxx> > Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx> > > The stable backport missed a prereq patch: > > commit b116c702791a9834e6485f67ca6267d9fdf59b87 > Author: Christoph Hellwig <hch@xxxxxx> > Date: Fri Nov 6 19:19:33 2020 +0100 > > RDMA/umem: Use ib_dma_max_seg_size instead of dma_get_max_seg_size > > RDMA ULPs must not call DMA mapping APIs directly but instead use the > ib_dma_* wrappers. > > Fixes: 0c16d9635e3a ("RDMA/umem: Move to allocate SG table from pages") > Link: https://lore.kernel.org/r/20201106181941.1878556-3-hch@xxxxxx > Reported-by: Jason Gunthorpe <jgg@xxxxxxxxxx> > Signed-off-by: Christoph Hellwig <hch@xxxxxx> > Signed-off-by: Jason Gunthorpe <jgg@xxxxxxxxxx> > > The missing patch adds the necessary RDMA wrappers to handle the ib_device dma_device member being NULL. > > The missing patch picks clean and fixes the issue. > > Do you want me to send the stable request? You just did, now queued up :) greg k-h