Re: [PATCH 1/1] RDMA/rxe: Fix blktests srp lead kernel panic with 64k page size

Greg Sword <gregsword0@xxxxxxxxx> · Tue, 31 Oct 2023 16:14:13 +0800

On Tue, Oct 31, 2023 at 3:53 PM Greg Sword <gregsword0@xxxxxxxxx> wrote:
>
>
>
> On Tue, Oct 31, 2023 at 2:03 PM Greg Sword <gregsword0@xxxxxxxxx> wrote:
>>
>>
>>
>>
>> On Tue, Oct 31, 2023 at 9:40 AM Zhu Yanjun <yanjun.zhu@xxxxxxxxx> wrote:
>>>
>>> 在 2023/10/26 17:05, Zhijian Li (Fujitsu) 写道:
>>> > The root cause is that
>>> >
>>> > rxe:rxe_set_page() gets wrong when mr.page_size != PAGE_SIZE where it only stores the *page to xarray.
>>> > So the offset will get lost.
>>>
>>> It is interesting root cause.
>>> page_size is 0x1000, PAGE_SIZE 0x10000.
>>>
>>> Zhu Yanjun
>>>
>>> >
>>> > For example,
>>> > store process:
>>> > page_size = 0x1000;
>>> > PAGE_SIZE = 0x10000;
>>
>>
>> page_size is 0x1000, PAGE_SIZE is 0x10000.
>>
>> Is it correct? page_size should be PAGE_SIZE.
>> That is, page_size should be greater than PAGE_SIZE.
>> It is not reasonable now.
>>
>> The root cause is to find out why page_size is set 4K (0x1000). On ARM64 host withCONFIG_ARM64_64K_PAGES enabled,
>> the page_size should not be 4K, it should be 64K.
>>
>> So Zhijian's commit seems not to be correct.
>
>

linux-rdma not reached. Resend it again.

> Because rxe splits everything into PAGE_SIZE units in the xarray, so it should only support PAGE_SIZE page size. WithCONFIG_ARM64_64K_PAGES,
> PAGE_SIZE is 64K.
>
> But from NVME side, the page size is 4K bytes, it is not 64K.
>
> In the function ib_map_mr_sg is fixed SZ_4K.  So the root cause is not in rxe. Before rxe is called, the capability of rxe should be tested.
>
> static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue,
>                 struct nvme_rdma_request *req, struct nvme_command *c,
>                 int count)
> {
>         struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl;
>         int nr;
>
>         req->mr = ib_mr_pool_get(queue->qp, &queue->qp->rdma_mrs);
>         if (WARN_ON_ONCE(!req->mr))
>                 return -EAGAIN;
>
>         /*
>          * Align the MR to a 4K page size to match the ctrl page size and
>          * the block virtual boundary.
>          */
>         nr = ib_map_mr_sg(req->mr, req->data_sgl.sg_table.sgl, count, NULL,
>                           SZ_4K);
>         if (unlikely(nr < count)) {
>                 ib_mr_pool_put(queue->qp, &queue->qp->rdma_mrs, req->mr);
>                 req->mr = NULL;
>                 if (nr < 0)
>                         return nr;
>                 return -EINVAL;
>         }
>
>         ib_update_fast_reg_key(req->mr, ib_inc_rkey(req->mr->rkey));
>
>>
>>
>>
>>>
>>> > va0 = 0xffff000020651000;
>>> > page_offset = 0 = va & (page_size - 1);
>>> > page = va_to_page(va);
>>> > xa_store(&mr->page_list, mr->nbuf, page, GFP_KERNEL);
>>> >
>>> > load_process:
>>> > page = xa_load(&mr->page_list, index);
>>> > page_va = kmap_local_page(page) --> it must be a PAGE_SIZE align value, assume it as 0xffff000020650000
>>> > va1 = page_va + page_offset = 0xffff000020650000 + 0 = 0xffff000020650000;
>>> >
>>> > Obviously, *va0 != va1*, page_offset get lost.
>>> >
>>> >
>>> > How to fix:
>>> > - revert 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c")
>>> > - don't allow ulp registering mr.page_size != PAGE_SIZE ?
>>> >
>>> >
>>> >
>>> > Thanks
>>> > Zhijian
>>> >
>>> >
>>> > On 24/10/2023 17:13, Li Zhijian wrote:
>>> >>
>>> >>
>>> >> On 24/10/2023 16:15, Zhijian Li (Fujitsu) wrote:
>>> >>>
>>> >>>
>>> >>> On 23/10/2023 18:45, Yi Zhang wrote:
>>> >>>> On Mon, Oct 23, 2023 at 11:52 AM Zhijian Li (Fujitsu)
>>> >>>> <lizhijian@xxxxxxxxxxx> wrote:
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> On 20/10/2023 22:01, Jason Gunthorpe wrote:
>>> >>>>>> On Fri, Oct 20, 2023 at 03:47:05AM +0000, Zhijian Li (Fujitsu) wrote:
>>> >>>>>>> CC Bart
>>> >>>>>>>
>>> >>>>>>> On 13/10/2023 20:01, Daisuke Matsuda (Fujitsu) wrote:
>>> >>>>>>>> On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote:
>>> >>>>>>>>> From: Zhu Yanjun<yanjun.zhu@xxxxxxxxx>
>>> >>>>>>>>>
>>> >>>>>>>>> The page_size of mr is set in infiniband core originally. In the commit
>>> >>>>>>>>> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the
>>> >>>>>>>>> page_size is also set. Sometime this will cause conflict.
>>> >>>>>>>> I appreciate your prompt action, but I do not think this commit deals with
>>> >>>>>>>> the root cause. I agree that the problem lies in rxe driver, but what is wrong
>>> >>>>>>>> with assigning actual page size to ibmr.page_size?
>>> >>>>>>>>
>>> >>>>>>>> IMO, the problem comes from the device attribute of rxe driver, which is used
>>> >>>>>>>> in ulp/srp layer to calculate the page_size.
>>> >>>>>>>> =====
>>> >>>>>>>> static int srp_add_one(struct ib_device *device)
>>> >>>>>>>> {
>>> >>>>>>>>              struct srp_device *srp_dev;
>>> >>>>>>>>              struct ib_device_attr *attr = &device->attrs;
>>> >>>>>>>> <...>
>>> >>>>>>>>              /*
>>> >>>>>>>>               * Use the smallest page size supported by the HCA, down to a
>>> >>>>>>>>               * minimum of 4096 bytes. We're unlikely to build large sglists
>>> >>>>>>>>               * out of smaller entries.
>>> >>>>>>>>               */
>>> >>>>>>>>              mr_page_shift           = max(12, ffs(attr->page_size_cap) - 1);
>>> >>>>>>>
>>> >>>>>>>
>>> >>>>>>> You light me up.
>>> >>>>>>> RXE provides attr.page_size_cap(RXE_PAGE_SIZE_CAP) which means it can support 4K-2G page size
>>> >>>>>>
>>> >>>>>> That doesn't seem right even in concept.>
>>> >>>>>> I think the multi-size support in the new xarray code does not work
>>> >>>>>> right, just looking at it makes me think it does not work right. It
>>> >>>>>> looks like it can do less than PAGE_SIZE but more than PAGE_SIZE will
>>> >>>>>> explode because kmap_local_page() does only 4K.
>>> >>>>>>
>>> >>>>>> If RXE_PAGE_SIZE_CAP == PAGE_SIZE  will everything work?
>>> >>>>>>
>>> >>>>>
>>> >>>>> Yeah, this should work(even though i only verified hardcoding mr_page_shift to PAGE_SHIFT).
>>> >>>>
>>> >>>> Hi Zhijian
>>> >>>>
>>> >>>> Did you try blktests nvme/rdma use_rxe on your environment, it still
>>> >>>> failed on my side.
>>> >>>>
>>> >>>
>>> >>> Thanks for your testing.
>>> >>> I just did that, it also failed in my environment. After a glance debug, I found there are
>>> >>> other places still registered 4K MR. I will dig into it later.
>>> >>
>>> >>
>>> >> nvme intend to register 4K mr, but it should work IMO, at least the RXE have handled it correctly.
>>> >>
>>> >>
>>> >> 1293 static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue,
>>> >> 1294                 struct nvme_rdma_request *req, struct nvme_command *c,
>>> >> 1295                 int count)
>>> >> 1296 {
>>> >> 1297         struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl;
>>> >> 1298         int nr;
>>> >> 1299
>>> >> 1300         req->mr = ib_mr_pool_get(queue->qp, &queue->qp->rdma_mrs);
>>> >> 1301         if (WARN_ON_ONCE(!req->mr))
>>> >> 1302                 return -EAGAIN;
>>> >> 1303
>>> >> 1304         /*
>>> >> 1305          * Align the MR to a 4K page size to match the ctrl page size and
>>> >> 1306          * the block virtual boundary.
>>> >> 1307          */
>>> >> 1308         nr = ib_map_mr_sg(req->mr, req->data_sgl.sg_table.sgl, count, NULL,
>>> >> 1309                           SZ_4K);
>>> >>
>>> >>
>>> >> Anyway, i will go ahead. if you have any thought, please let me know.
>>> >>
>>> >>
>>> >>
>>> >>>
>>> >>> Thanks
>>> >>> Zhijian
>>> >>>
>>> >>>
>>> >>>
>>> >>>
>>> >>>> # use_rxe=1 nvme_trtype=rdma  ./check nvme/003
>>> >>>> nvme/003 (test if we're sending keep-alives to a discovery controller) [failed]
>>> >>>>        runtime  12.179s  ...  11.941s
>>> >>>>        --- tests/nvme/003.out 2023-10-22 10:54:43.041749537 -0400
>>> >>>>        +++ /root/blktests/results/nodev/nvme/003.out.bad 2023-10-23
>>> >>>> 05:52:27.882759168 -0400
>>> >>>>        @@ -1,3 +1,3 @@
>>> >>>>         Running nvme/003
>>> >>>>        -NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 1 controller(s)
>>> >>>>        +NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 0 controller(s)
>>> >>>>         Test complete
>>> >>>>
>>> >>>> [ 7033.431910] rdma_rxe: loaded
>>> >>>> [ 7033.456341] run blktests nvme/003 at 2023-10-23 05:52:15
>>> >>>> [ 7033.502306] (null): rxe_set_mtu: Set mtu to 1024
>>> >>>> [ 7033.510969] infiniband enP2p1s0v0_rxe: set active
>>> >>>> [ 7033.510980] infiniband enP2p1s0v0_rxe: added enP2p1s0v0
>>> >>>> [ 7033.549301] loop0: detected capacity change from 0 to 2097152
>>> >>>> [ 7033.556966] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
>>> >>>> [ 7033.566711] nvmet_rdma: enabling port 0 (10.19.240.81:4420)
>>> >>>> [ 7033.588605] nvmet: connect attempt for invalid controller ID 0x808
>>> >>>> [ 7033.594909] nvme nvme0: Connect Invalid Data Parameter, cntlid: 65535
>>> >>>> [ 7033.601504] nvme nvme0: failed to connect queue: 0 ret=16770
>>> >>>> [ 7046.317861] rdma_rxe: unloaded
>>> >>>>
>>> >>>>
>>> >>>>>
>>> >>>>>>>> import ctypes
>>> >>>>>>>> libc = ctypes.cdll.LoadLibrary('libc.so.6')
>>> >>>>>>>> hex(65536)
>>> >>>>> '0x10000'
>>> >>>>>>>> libc.ffs(0x10000) - 1
>>> >>>>> 16
>>> >>>>>>>> 1 << 16
>>> >>>>> 65536
>>> >>>>>
>>> >>>>> so
>>> >>>>> mr_page_shift = max(12, ffs(attr->page_size_cap) - 1) = max(12, 16) = 16;
>>> >>>>>
>>> >>>>>
>>> >>>>> So I think Daisuke's patch should work as well.
>>> >>>>>
>>> >>>>> https://lore.kernel.org/linux-rdma/OS3PR01MB98652B2EC2E85DAEC6DDE484E5D2A@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/T/#md133060414f0ba6a3dbaf7b4ad2374c8a347cfd1
>>> >>>>>
>>> >>>>>
>>> >>>>>> Jason
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> --
>>> >>>> Best Regards,
>>> >>>>      Yi Zhang
>>>