On Wed, Aug 18, 2021 at 2:43 PM yangx.jy@xxxxxxxxxxx <yangx.jy@xxxxxxxxxxx> wrote: > > 于 2021/8/17 10:28, Zhu Yanjun 写道: > > On Fri, Aug 6, 2021 at 10:37 AM Olga Kornievskaia<aglo@xxxxxxxxx> wrote: > >> On Wed, Aug 4, 2021 at 5:05 AM Zhu Yanjun<zyjzyj2000@xxxxxxxxx> wrote: > >>> On Wed, Aug 4, 2021 at 1:41 PM Leon Romanovsky<leon@xxxxxxxxxx> wrote: > >>>> On Wed, Aug 04, 2021 at 09:09:41AM +0800, Zhu Yanjun wrote: > >>>>> On Wed, Aug 4, 2021 at 9:01 AM Zhu Yanjun<zyjzyj2000@xxxxxxxxx> wrote: > >>>>>> On Wed, Aug 4, 2021 at 2:07 AM Leon Romanovsky<leon@xxxxxxxxxx> wrote: > >>>>>>> Hi, > >>>>>>> > >>>>>>> Can you please help me to understand the RXE status in the upstream? > >>>>>>> > >>>>>>> Does we still have crashes/interop issues/e.t.c? > >>>>>> I made some developments with the RXE in the upstream, from my usage > >>>>>> with latest RXE, > >>>>>> I found the following: > >>>>>> > >>>>>> 1. rdma-core can not work well with latest RDMA git; > >>>>> The latest RDMA git is > >>>>> https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git > >>>> "Latest" is a relative term, what SHA did you test? > >>>> Let's focus on fixing RXE before we will continue with new features. > >>> Thanks a lot. I agree with you. > >> I believe simple rping still doesn't work linux-to-linux. The last > >> working version (of rping in rxe) was 5.13 I think. I have posted a > >> number of crashes rping encounters (gotta get that working before I > >> can even try NFSoRDMA). > > The following are my tests. > > > > 1. Modprobe rdma_rxe > > 2. Modprobe -v -r rdma_rxe > > 3. Rdma link add rxe > > 4. Rdma link del rxe > > 5. Latest rdma-core&& latest kernel upstream; > > 6. Latest kernel< ------rping----> 5.10.y stable > > 7. Latest kernel< ------rping----> 5.11.y stable > > 8. Latest kernel< ------rping----> 5.12.y stable > > 9. Latest kernel< ------rping----> 5.13.y stable > > > > It seems that the latest kernel upstream (5.14-rc6) can rping other > > stable kernels. > > Can you make tests again? > > > > Zhu Yanjun > Hi, > > I still get two simliar panic by rping or rdma_client/server on latest kernel vs 5.13: > Panic1: > -------------------------------------------------------- > [ 268.248642] BUG: unable to handle page fault for address: ffff9ae2c07a1414 > [ 268.251049] #PF: supervisor read access in kernel mode > [ 268.252491] #PF: error_code(0x0000) - not-present page > [ 268.253919] PGD 1000067 P4D 1000067 PUD 0 > [ 268.255052] Oops: 0000 [#1] SMP PTI > [ 268.256055] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.14.0-rc6+ #1 > [ 268.257893] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-2.fc30 04/01/2014 > [ 268.259995] RIP: 0010:copy_data+0x2d/0x2a0 [rdma_rxe] > [ 268.261114] Code: 00 00 41 57 41 56 41 55 41 54 55 53 48 83 ec 20 48 89 7c 24 08 89 74 24 10 44 89 4c 24 14 45 85 c0 0f 84 e8 00 00 00 45 89 c7<44> 8b 42 04 49 89 d6 44 89 44 24 18 45 39 f8 0f 8c 20 02 00 00 8b > [ 268.265005] RSP: 0018:ffff9ae2404108b8 EFLAGS: 00010202 > [ 268.266145] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8f7a8bf9da76 > [ 268.267703] RDX: ffff9ae2c07a1410 RSI: 0000000000000001 RDI: ffff8f7a71874400 > [ 268.269291] RBP: ffff8f7a482f87cc R08: 0000000000000010 R09: 0000000000000000 > [ 268.270871] R10: 00000000000000cb R11: 0000000000000001 R12: ffff8f7a482f8000 > [ 268.272468] R13: ffff8f7a8c038928 R14: ffff8f7a482f8008 R15: 0000000000000010 > [ 268.274080] FS: 0000000000000000(0000) GS:ffff8f7abec00000(0000) knlGS:0000000000000000 > [ 268.275899] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 268.277205] CR2: ffff9ae2c07a1414 CR3: 000000000263c002 CR4: 0000000000060ee0 > [ 268.278825] Call Trace: > [ 268.279358]<IRQ> > [ 268.279747] rxe_responder+0x11b1/0x2490 [rdma_rxe] > [ 268.280798] rxe_do_task+0x9c/0xe0 [rdma_rxe] > [ 268.281895] rxe_rcv+0x286/0x8e0 [rdma_rxe] > ... > ------------------------------------------------------ > > Panic2: > -------------------------------------------------------- > [ 212.526854] BUG: unable to handle page fault for address: ffffbb97142acc14 > [ 212.530688] #PF: supervisor read access in kernel mode > [ 212.533030] #PF: error_code(0x0000) - not-present page > [ 212.535428] PGD 1000067 P4D 1000067 PUD 0 > [ 212.536970] Oops: 0000 [#1] SMP PTI > [ 212.537748] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.14.0-rc6+ #1 > [ 212.538984] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-2.fc30 04/01/2014 > [ 212.540853] RIP: 0010:copy_data+0x2d/0x2a0 [rdma_rxe] > [ 212.541957] Code: 00 00 41 57 41 56 41 55 41 54 55 53 48 83 ec 20 48 89 7c 24 08 89 74 24 10 44 89 4c 24 14 45 85 c0 0f 84 e8 00 00 00 45 89 c7<44> 8b 42 04 49 89 d6 44 89 44 24 18 45 39 f8 0f 8c 20 02 00 00 8b > [ 212.546041] RSP: 0018:ffffbb9640410898 EFLAGS: 00010202 > [ 212.547200] RAX: ffffbb97142acc00 RBX: ffff95510ee6d000 RCX: ffff95510a802076 > [ 212.548782] RDX: ffffbb97142acc10 RSI: 0000000000000001 RDI: ffff95510ca00700 > [ 212.550369] RBP: 0000000000000010 R08: 0000000000000010 R09: 0000000000000000 > [ 212.551992] R10: 0000000000000001 R11: 0000000000000001 R12: ffff95510a802076 > [ 212.553613] R13: ffff9550f29acd28 R14: ffff95510ee6d008 R15: 0000000000000010 > [ 212.555225] FS: 0000000000000000(0000) GS:ffff95513ec00000(0000) knlGS:0000000000000000 > [ 212.556749] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 212.557846] CR2: ffffbb97142acc14 CR3: 0000000003b66005 CR4: 0000000000060ee0 > [ 212.559177] Call Trace: > [ 212.559655]<IRQ> > [ 212.560055] send_data_in+0x55/0x73 [rdma_rxe] > [ 212.560903] rxe_responder.cold+0xea/0x1f8 [rdma_rxe] > [ 212.561865] rxe_do_task+0x9c/0xe0 [rdma_rxe] > [ 212.562699] rxe_rcv+0x286/0x8e0 [rdma_rxe] > ... > ------------------------------------------------------ > > Note: it is easy to reproduce the panic on the lastest kernel. Can you let me know how to reproduce the panic? 1. linux upstream < ----rping---- > linux upstream? 2. just run rping? 3. how do you create rxe? with rdma link or rxe_cfg? 4. do you make other operations? 5. other operations? Thanks. Zhu Yanjun > > Best Regards, > Xiao Yang > > > > >> Thank you for working on the code. > >> > >> We (NFS community) do test NFSoRDMA every git pull using rxe and siw > >> but lately have been encountering problems. > >> > >>> rdma-core: > >>> 313509f8 (HEAD -> master, origin/master, origin/HEAD) Merge pull > >>> request #1038 from selvintxavier/master > >>> 2d3dc48b Merge pull request #1039 from amzn/pyverbs-mac-fix-pr > >>> 327d45e0 tests: Add missing MAC element to args list > >>> 66aba73d bnxt_re/lib: Move hardware queue to 16B aligned indices > >>> 8754fb51 bnxt_re/lib: Use separate indices for shadow queue > >>> be4d8abf bnxt_re/lib: add a function to initialize software queue > >>> > >>> kernel rdma: > >>> 0050a57638ca (HEAD -> for-next, origin/for-next, origin/HEAD) > >>> RDMA/qedr: Improve error logs for rdma_alloc_tid error return > >>> 090473004b02 RDMA/qed: Use accurate error num in qed_cxt_dynamic_ilt_alloc > >>> 991c4274dc17 RDMA/hfi1: Fix typo in comments > >>> 8d7e415d5561 docs: Fix infiniband uverbs minor number > >>> bbafcbc2b1c9 RDMA/iwpm: Rely on the rdma_nl_[un]register() to ensure > >>> that requests are valid > >>> bdb0e4e3ff19 RDMA/iwpm: Remove not-needed reference counting > >>> e677b72a0647 RDMA/iwcm: Release resources if iw_cm module initialization fails > >>> a0293eb24936 RDMA/hfi1: Convert from atomic_t to refcount_t on > >>> hfi1_devdata->user_refcount > >>> > >>> with the above kernel and rdma-core, the following messages will appear. > >>> " > >>> [ 54.214608] rdma_rxe: loaded > >>> [ 54.217089] infiniband rxe0: set active > >>> [ 54.217101] infiniband rxe0: added enp0s8 > >>> [ 167.623200] rdma_rxe: cqe(32768)> max_cqe(32767) > >>> [ 167.645590] rdma_rxe: cqe(1)< current # elements in queue (6) > >>> [ 167.733297] rdma_rxe: cqe(32768)> max_cqe(32767) > >>> [ 169.074755] rdma_rxe: check_rkey: no MW matches rkey 0x1000247 > >>> [ 169.074796] rdma_rxe: qp#27 moved to error state > >>> [ 169.138851] rdma_rxe: check_rkey: no MW matches rkey 0x10005de > >>> [ 169.138889] rdma_rxe: qp#30 moved to error state > >>> [ 169.160565] rdma_rxe: check_rkey: no MW matches rkey 0x10006f7 > >>> [ 169.160601] rdma_rxe: qp#31 moved to error state > >>> [ 169.182132] rdma_rxe: check_rkey: no MW matches rkey 0x1000782 > >>> [ 169.182170] rdma_rxe: qp#32 moved to error state > >>> [ 169.667803] rdma_rxe: check_rkey: no MR matches rkey 0x18d8 > >>> [ 169.667850] rdma_rxe: qp#39 moved to error state > >>> [ 198.872649] rdma_rxe: cqe(32768)> max_cqe(32767) > >>> [ 198.894829] rdma_rxe: cqe(1)< current # elements in queue (6) > >>> [ 198.981839] rdma_rxe: cqe(32768)> max_cqe(32767) > >>> [ 200.332031] rdma_rxe: check_rkey: no MW matches rkey 0x1000887 > >>> [ 200.332086] rdma_rxe: qp#58 moved to error state > >>> [ 200.396476] rdma_rxe: check_rkey: no MW matches rkey 0x1000b0d > >>> [ 200.396514] rdma_rxe: qp#61 moved to error state > >>> [ 200.417919] rdma_rxe: check_rkey: no MW matches rkey 0x1000c40 > >>> [ 200.417956] rdma_rxe: qp#62 moved to error state > >>> [ 200.439616] rdma_rxe: check_rkey: no MW matches rkey 0x1000d24 > >>> [ 200.439654] rdma_rxe: qp#63 moved to error state > >>> [ 200.933104] rdma_rxe: check_rkey: no MR matches rkey 0x37d8 > >>> [ 200.933153] rdma_rxe: qp#70 moved to error state > >>> [ 206.880305] rdma_rxe: cqe(32768)> max_cqe(32767) > >>> [ 206.904030] rdma_rxe: cqe(1)< current # elements in queue (6) > >>> [ 206.991494] rdma_rxe: cqe(32768)> max_cqe(32767) > >>> [ 208.359987] rdma_rxe: check_rkey: no MW matches rkey 0x1000e4d > >>> [ 208.360028] rdma_rxe: qp#89 moved to error state > >>> [ 208.425637] rdma_rxe: check_rkey: no MW matches rkey 0x1001136 > >>> [ 208.425675] rdma_rxe: qp#92 moved to error state > >>> [ 208.447333] rdma_rxe: check_rkey: no MW matches rkey 0x10012d8 > >>> [ 208.447370] rdma_rxe: qp#93 moved to error state > >>> [ 208.469511] rdma_rxe: check_rkey: no MW matches rkey 0x100137a > >>> [ 208.469550] rdma_rxe: qp#94 moved to error state > >>> [ 208.956691] rdma_rxe: check_rkey: no MR matches rkey 0x5670 > >>> [ 208.956731] rdma_rxe: qp#100 moved to error state > >>> [ 216.879703] rdma_rxe: cqe(32768)> max_cqe(32767) > >>> [ 216.902199] rdma_rxe: cqe(1)< current # elements in queue (6) > >>> [ 216.989264] rdma_rxe: cqe(32768)> max_cqe(32767) > >>> [ 218.363765] rdma_rxe: check_rkey: no MW matches rkey 0x10014d6 > >>> [ 218.363808] rdma_rxe: qp#119 moved to error state > >>> [ 218.429474] rdma_rxe: check_rkey: no MW matches rkey 0x10017e4 > >>> [ 218.429513] rdma_rxe: qp#122 moved to error state > >>> [ 218.451443] rdma_rxe: check_rkey: no MW matches rkey 0x1001895 > >>> [ 218.451481] rdma_rxe: qp#123 moved to error state > >>> [ 218.473869] rdma_rxe: check_rkey: no MW matches rkey 0x1001910 > >>> [ 218.473908] rdma_rxe: qp#124 moved to error state > >>> [ 218.963602] rdma_rxe: check_rkey: no MR matches rkey 0x757b > >>> [ 218.963641] rdma_rxe: qp#130 moved to error state > >>> [ 233.855140] rdma_rxe: cqe(32768)> max_cqe(32767) > >>> [ 233.877202] rdma_rxe: cqe(1)< current # elements in queue (6) > >>> [ 233.963952] rdma_rxe: cqe(32768)> max_cqe(32767) > >>> [ 235.305274] rdma_rxe: check_rkey: no MW matches rkey 0x1001ac2 > >>> [ 235.305319] rdma_rxe: qp#149 moved to error state > >>> [ 235.368800] rdma_rxe: check_rkey: no MW matches rkey 0x1001db8 > >>> [ 235.368838] rdma_rxe: qp#152 moved to error state > >>> [ 235.390155] rdma_rxe: check_rkey: no MW matches rkey 0x1001e4d > >>> [ 235.390192] rdma_rxe: qp#153 moved to error state > >>> [ 235.411336] rdma_rxe: check_rkey: no MW matches rkey 0x1001f4c > >>> [ 235.411374] rdma_rxe: qp#154 moved to error state > >>> [ 235.895784] rdma_rxe: check_rkey: no MR matches rkey 0x9482 > >>> [ 235.895828] rdma_rxe: qp#161 moved to error state > >>> " > >>> Not sure if they are problems. > >>> IMO, we should make further investigations. > >>> > >>> Thanks > >>> Zhu Yanjun > >>>> Thanks > >