在 2023/10/26 17:05, Zhijian Li (Fujitsu) 写道:
The root cause is that
rxe:rxe_set_page() gets wrong when mr.page_size != PAGE_SIZE where it only stores the *page to xarray.
So the offset will get lost.
It is interesting root cause.
page_size is 0x1000, PAGE_SIZE 0x10000.
Zhu Yanjun
For example,
store process:
page_size = 0x1000;
PAGE_SIZE = 0x10000;
va0 = 0xffff000020651000;
page_offset = 0 = va & (page_size - 1);
page = va_to_page(va);
xa_store(&mr->page_list, mr->nbuf, page, GFP_KERNEL);
load_process:
page = xa_load(&mr->page_list, index);
page_va = kmap_local_page(page) --> it must be a PAGE_SIZE align value, assume it as 0xffff000020650000
va1 = page_va + page_offset = 0xffff000020650000 + 0 = 0xffff000020650000;
Obviously, *va0 != va1*, page_offset get lost.
How to fix:
- revert 325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c")
- don't allow ulp registering mr.page_size != PAGE_SIZE ?
Thanks
Zhijian
On 24/10/2023 17:13, Li Zhijian wrote:
On 24/10/2023 16:15, Zhijian Li (Fujitsu) wrote:
On 23/10/2023 18:45, Yi Zhang wrote:
On Mon, Oct 23, 2023 at 11:52 AM Zhijian Li (Fujitsu)
<lizhijian@xxxxxxxxxxx> wrote:
On 20/10/2023 22:01, Jason Gunthorpe wrote:
On Fri, Oct 20, 2023 at 03:47:05AM +0000, Zhijian Li (Fujitsu) wrote:
CC Bart
On 13/10/2023 20:01, Daisuke Matsuda (Fujitsu) wrote:
On Fri, Oct 13, 2023 10:18 AM Zhu Yanjun wrote:
From: Zhu Yanjun<yanjun.zhu@xxxxxxxxx>
The page_size of mr is set in infiniband core originally. In the commit
325a7eb85199 ("RDMA/rxe: Cleanup page variables in rxe_mr.c"), the
page_size is also set. Sometime this will cause conflict.
I appreciate your prompt action, but I do not think this commit deals with
the root cause. I agree that the problem lies in rxe driver, but what is wrong
with assigning actual page size to ibmr.page_size?
IMO, the problem comes from the device attribute of rxe driver, which is used
in ulp/srp layer to calculate the page_size.
=====
static int srp_add_one(struct ib_device *device)
{
struct srp_device *srp_dev;
struct ib_device_attr *attr = &device->attrs;
<...>
/*
* Use the smallest page size supported by the HCA, down to a
* minimum of 4096 bytes. We're unlikely to build large sglists
* out of smaller entries.
*/
mr_page_shift = max(12, ffs(attr->page_size_cap) - 1);
You light me up.
RXE provides attr.page_size_cap(RXE_PAGE_SIZE_CAP) which means it can support 4K-2G page size
That doesn't seem right even in concept.>
I think the multi-size support in the new xarray code does not work
right, just looking at it makes me think it does not work right. It
looks like it can do less than PAGE_SIZE but more than PAGE_SIZE will
explode because kmap_local_page() does only 4K.
If RXE_PAGE_SIZE_CAP == PAGE_SIZE will everything work?
Yeah, this should work(even though i only verified hardcoding mr_page_shift to PAGE_SHIFT).
Hi Zhijian
Did you try blktests nvme/rdma use_rxe on your environment, it still
failed on my side.
Thanks for your testing.
I just did that, it also failed in my environment. After a glance debug, I found there are
other places still registered 4K MR. I will dig into it later.
nvme intend to register 4K mr, but it should work IMO, at least the RXE have handled it correctly.
1293 static int nvme_rdma_map_sg_fr(struct nvme_rdma_queue *queue,
1294 struct nvme_rdma_request *req, struct nvme_command *c,
1295 int count)
1296 {
1297 struct nvme_keyed_sgl_desc *sg = &c->common.dptr.ksgl;
1298 int nr;
1299
1300 req->mr = ib_mr_pool_get(queue->qp, &queue->qp->rdma_mrs);
1301 if (WARN_ON_ONCE(!req->mr))
1302 return -EAGAIN;
1303
1304 /*
1305 * Align the MR to a 4K page size to match the ctrl page size and
1306 * the block virtual boundary.
1307 */
1308 nr = ib_map_mr_sg(req->mr, req->data_sgl.sg_table.sgl, count, NULL,
1309 SZ_4K);
Anyway, i will go ahead. if you have any thought, please let me know.
Thanks
Zhijian
# use_rxe=1 nvme_trtype=rdma ./check nvme/003
nvme/003 (test if we're sending keep-alives to a discovery controller) [failed]
runtime 12.179s ... 11.941s
--- tests/nvme/003.out 2023-10-22 10:54:43.041749537 -0400
+++ /root/blktests/results/nodev/nvme/003.out.bad 2023-10-23
05:52:27.882759168 -0400
@@ -1,3 +1,3 @@
Running nvme/003
-NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 1 controller(s)
+NQN:nqn.2014-08.org.nvmexpress.discovery disconnected 0 controller(s)
Test complete
[ 7033.431910] rdma_rxe: loaded
[ 7033.456341] run blktests nvme/003 at 2023-10-23 05:52:15
[ 7033.502306] (null): rxe_set_mtu: Set mtu to 1024
[ 7033.510969] infiniband enP2p1s0v0_rxe: set active
[ 7033.510980] infiniband enP2p1s0v0_rxe: added enP2p1s0v0
[ 7033.549301] loop0: detected capacity change from 0 to 2097152
[ 7033.556966] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[ 7033.566711] nvmet_rdma: enabling port 0 (10.19.240.81:4420)
[ 7033.588605] nvmet: connect attempt for invalid controller ID 0x808
[ 7033.594909] nvme nvme0: Connect Invalid Data Parameter, cntlid: 65535
[ 7033.601504] nvme nvme0: failed to connect queue: 0 ret=16770
[ 7046.317861] rdma_rxe: unloaded
import ctypes
libc = ctypes.cdll.LoadLibrary('libc.so.6')
hex(65536)
'0x10000'
libc.ffs(0x10000) - 1
16
1 << 16
65536
so
mr_page_shift = max(12, ffs(attr->page_size_cap) - 1) = max(12, 16) = 16;
So I think Daisuke's patch should work as well.
https://lore.kernel.org/linux-rdma/OS3PR01MB98652B2EC2E85DAEC6DDE484E5D2A@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/T/#md133060414f0ba6a3dbaf7b4ad2374c8a347cfd1
Jason
--
Best Regards,
Yi Zhang