On Tue, Oct 31, 2023 at 3:53 PM Greg Sword <gregsword0@xxxxxxxxx> wrote: > > > > On Tue, Oct 31, 2023 at 2:03 PM Greg Sword <gregsword0@xxxxxxxxx> wrote: >> >> >> >> >> On Tue, Oct 31, 2023 at 9:40 AM Zhu Yanjun <yanjun.zhu@xxxxxxxxx> wrote: >>> >>> 在 2023/10/26 17:05, Zhijian Li (Fujitsu) 写道: >>> > The root cause is that >>> > >>> > rxe:rxe_set_page() gets wrong when mr.page_size != PAGE_SIZE where it only stores the *page to xarray. >>> > So the offset will get lost. >>> >>> It is interesting root cause. >>> page_size is 0x1000, PAGE_SIZE 0x10000. >>> >>> Zhu Yanjun >>> >>> > >>> > For example, >>> > store process: >>> > page_size = 0x1000; >>> > PAGE_SIZE = 0x10000; >> >> >> page_size is 0x1000, PAGE_SIZE is 0x10000. >> >> Is it correct? page_size should be PAGE_SIZE. >> That is, page_size should be greater than PAGE_SIZE. >> It is not reasonable now. >> >> The root cause is to find out why page_size is set 4K (0x1000). On ARM64 host withCONFIG_ARM64_64K_PAGES enabled, >> the page_size should not be 4K, it should be 64K. >> >> So Zhijian's commit seems not to be correct. > > linux-rdma not reached. Resend it again. > Because rxe splits everything into PAGE_SIZE units in the xarray, so it should only support PAGE_SIZE page size. WithCONFIG_ARM64_64K_PAGES, > PAGE_SIZE is 64K. > > But from NVME side, the page size is 4K bytes, it is not 64K. > > In the function ib_map_mr_sg is fixed SZ_4K. So the root cause is not in rxe. Before rxe is called, the capability of rxe should be tested. > > static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue, > struct nvme_rdma_request *req, struct nvme_command *c, > int count) > { > struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl; > int nr; > > req->mr = ib_mr_pool_get(queue->qp, &queue->qp->rdma_mrs); > if (WARN_ON_ONCE(!req->mr)) > return -EAGAIN; > > /* > * Align the MR to a 4K page size to match the ctrl page size and > * the block virtual boundary. > */ > nr = ib_map_mr_sg(req->mr, req->data_sgl.sg_table.sgl, count, NULL, > SZ_4K); > if (unlikely(nr < count)) { > ib_mr_pool_put(queue->qp, &queue->qp->rdma_mrs, req->mr); > req->mr = NULL; > if (nr < 0) > return nr; > return -EINVAL; > } > > ib_update_fast_reg_key(req->mr, ib_inc_rkey(req->mr->rkey)); > >> >> >> >>> >>> > va0 = 0xffff000020651000; >>> > page_offset = 0 = va & (page_size - 1); >>> > page = va_to_page(va); >>> > xa_store(&mr->page_list, mr->nbuf, page, GFP_KERNEL); >>> > >>> > load_process: >>> > page = xa_load(&mr->page_list, index); >>> > page_va = kmap_local_page(page) --> it must be a PAGE_SIZE align value, assume it as 0xffff000020650000 >>> > va1 = page_va + page_offset = 0xffff000020650000 + 0 = 0xffff000020650000; >>> > >>> > Obviously, *va0 != va1*, page_offset get lost. >>> > >>> > >>> > How to fix: >>> > - revert 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c") >>> > - don't allow ulp registering mr.page_size != PAGE_SIZE ? >>> > >>> > >>> > >>> > Thanks >>> > Zhijian >>> > >>> > >>> > On 24/10/2023 17:13, Li Zhijian wrote: >>> >> >>> >> >>> >> On 24/10/2023 16:15, Zhijian Li (Fujitsu) wrote: >>> >>> >>> >>> >>> >>> On 23/10/2023 18:45, Yi Zhang wrote: >>> >>>> On Mon, Oct 23, 2023 at 11:52 AM Zhijian Li (Fujitsu) >>> >>>> <lizhijian@xxxxxxxxxxx> wrote: >>> >>>>> >>> >>>>> >>> >>>>> >>> >>>>> On 20/10/2023 22:01, Jason Gunthorpe wrote: >>> >>>>>> On Fri, Oct 20, 2023 at 03:47:05AM +0000, Zhijian Li (Fujitsu) wrote: >>> >>>>>>> CC Bart >>> >>>>>>> >>> >>>>>>> On 13/10/2023 20:01, Daisuke Matsuda (Fujitsu) wrote: >>> >>>>>>>> On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote: >>> >>>>>>>>> From: Zhu Yanjun<yanjun.zhu@xxxxxxxxx> >>> >>>>>>>>> >>> >>>>>>>>> The page_size of mr is set in infiniband core originally. In the commit >>> >>>>>>>>> 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the >>> >>>>>>>>> page_size is also set. Sometime this will cause conflict. >>> >>>>>>>> I appreciate your prompt action, but I do not think this commit deals with >>> >>>>>>>> the root cause. I agree that the problem lies in rxe driver, but what is wrong >>> >>>>>>>> with assigning actual page size to ibmr.page_size? >>> >>>>>>>> >>> >>>>>>>> IMO, the problem comes from the device attribute of rxe driver, which is used >>> >>>>>>>> in ulp/srp layer to calculate the page_size. >>> >>>>>>>> ===== >>> >>>>>>>> static int srp_add_one(struct ib_device *device) >>> >>>>>>>> { >>> >>>>>>>> struct srp_device *srp_dev; >>> >>>>>>>> struct ib_device_attr *attr = &device->attrs; >>> >>>>>>>> <...> >>> >>>>>>>> /* >>> >>>>>>>> * Use the smallest page size supported by the HCA, down to a >>> >>>>>>>> * minimum of 4096 bytes. We're unlikely to build large sglists >>> >>>>>>>> * out of smaller entries. >>> >>>>>>>> */ >>> >>>>>>>> mr_page_shift = max(12, ffs(attr->page_size_cap) - 1); >>> >>>>>>> >>> >>>>>>> >>> >>>>>>> You light me up. >>> >>>>>>> RXE provides attr.page_size_cap(RXE_PAGE_SIZE_CAP) which means it can support 4K-2G page size >>> >>>>>> >>> >>>>>> That doesn't seem right even in concept.> >>> >>>>>> I think the multi-size support in the new xarray code does not work >>> >>>>>> right, just looking at it makes me think it does not work right. It >>> >>>>>> looks like it can do less than PAGE_SIZE but more than PAGE_SIZE will >>> >>>>>> explode because kmap_local_page() does only 4K. >>> >>>>>> >>> >>>>>> If RXE_PAGE_SIZE_CAP == PAGE_SIZE will everything work? >>> >>>>>> >>> >>>>> >>> >>>>> Yeah, this should work(even though i only verified hardcoding mr_page_shift to PAGE_SHIFT). >>> >>>> >>> >>>> Hi Zhijian >>> >>>> >>> >>>> Did you try blktests nvme/rdma use_rxe on your environment, it still >>> >>>> failed on my side. >>> >>>> >>> >>> >>> >>> Thanks for your testing. >>> >>> I just did that, it also failed in my environment. After a glance debug, I found there are >>> >>> other places still registered 4K MR. I will dig into it later. >>> >> >>> >> >>> >> nvme intend to register 4K mr, but it should work IMO, at least the RXE have handled it correctly. >>> >> >>> >> >>> >> 1293 static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue, >>> >> 1294 struct nvme_rdma_request *req, struct nvme_command *c, >>> >> 1295 int count) >>> >> 1296 { >>> >> 1297 struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl; >>> >> 1298 int nr; >>> >> 1299 >>> >> 1300 req->mr = ib_mr_pool_get(queue->qp, &queue->qp->rdma_mrs); >>> >> 1301 if (WARN_ON_ONCE(!req->mr)) >>> >> 1302 return -EAGAIN; >>> >> 1303 >>> >> 1304 /* >>> >> 1305 * Align the MR to a 4K page size to match the ctrl page size and >>> >> 1306 * the block virtual boundary. >>> >> 1307 */ >>> >> 1308 nr = ib_map_mr_sg(req->mr, req->data_sgl.sg_table.sgl, count, NULL, >>> >> 1309 SZ_4K); >>> >> >>> >> >>> >> Anyway, i will go ahead. if you have any thought, please let me know. >>> >> >>> >> >>> >> >>> >>> >>> >>> Thanks >>> >>> Zhijian >>> >>> >>> >>> >>> >>> >>> >>> >>> >>>> # use_rxe=1 nvme_trtype=rdma ./check nvme/003 >>> >>>> nvme/003 (test if we're sending keep-alives to a discovery controller) [failed] >>> >>>> runtime 12.179s ... 11.941s >>> >>>> --- tests/nvme/003.out 2023-10-22 10:54:43.041749537 -0400 >>> >>>> +++ /root/blktests/results/nodev/nvme/003.out.bad 2023-10-23 >>> >>>> 05:52:27.882759168 -0400 >>> >>>> @@ -1,3 +1,3 @@ >>> >>>> Running nvme/003 >>> >>>> -NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 1 controller(s) >>> >>>> +NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 0 controller(s) >>> >>>> Test complete >>> >>>> >>> >>>> [ 7033.431910] rdma_rxe: loaded >>> >>>> [ 7033.456341] run blktests nvme/003 at 2023-10-23 05:52:15 >>> >>>> [ 7033.502306] (null): rxe_set_mtu: Set mtu to 1024 >>> >>>> [ 7033.510969] infiniband enP2p1s0v0_rxe: set active >>> >>>> [ 7033.510980] infiniband enP2p1s0v0_rxe: added enP2p1s0v0 >>> >>>> [ 7033.549301] loop0: detected capacity change from 0 to 2097152 >>> >>>> [ 7033.556966] nvmet: adding nsid 1 to subsystem blktests-subsystem-1 >>> >>>> [ 7033.566711] nvmet_rdma: enabling port 0 (10.19.240.81:4420) >>> >>>> [ 7033.588605] nvmet: connect attempt for invalid controller ID 0x808 >>> >>>> [ 7033.594909] nvme nvme0: Connect Invalid Data Parameter, cntlid: 65535 >>> >>>> [ 7033.601504] nvme nvme0: failed to connect queue: 0 ret=16770 >>> >>>> [ 7046.317861] rdma_rxe: unloaded >>> >>>> >>> >>>> >>> >>>>> >>> >>>>>>>> import ctypes >>> >>>>>>>> libc = ctypes.cdll.LoadLibrary('libc.so.6') >>> >>>>>>>> hex(65536) >>> >>>>> '0x10000' >>> >>>>>>>> libc.ffs(0x10000) - 1 >>> >>>>> 16 >>> >>>>>>>> 1 << 16 >>> >>>>> 65536 >>> >>>>> >>> >>>>> so >>> >>>>> mr_page_shift = max(12, ffs(attr->page_size_cap) - 1) = max(12, 16) = 16; >>> >>>>> >>> >>>>> >>> >>>>> So I think Daisuke's patch should work as well. >>> >>>>> >>> >>>>> https://lore.kernel.org/linux-rdma/OS3PR01MB98652B2EC2E85DAEC6DDE484E5D2A@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/T/#md133060414f0ba6a3dbaf7b4ad2374c8a347cfd1 >>> >>>>> >>> >>>>> >>> >>>>>> Jason >>> >>>> >>> >>>> >>> >>>> >>> >>>> -- >>> >>>> Best Regards, >>> >>>> Yi Zhang >>>