RE: [PATCH V1] NFS-RDMA: fix qp pointer validation checks

Devesh Sharma <Devesh.Sharma@xxxxxxxxxx> · Wed, 23 Apr 2014 23:30:59 +0000

Hi Chuck

Following is the complete call trace of a typical NFS-RDMA transaction while mounting a share.
It is unavoidable to stop calling post-send in case it is not created. Therefore, applying checks to the connection state is a must
While registering/deregistering frmrs on-the-fly. The unconnected state of QP implies don't call  post_send/post_recv from any context.

call_start nfs4 proc GETATTR (sync)
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 call_reserve (status 0)
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 reserved req ffff8804678b8800 xid 53abc98d
Apr 23 20:00:34 neo03-el64 kernel: RPC:       rpcrdma_event_process: event rep ffff88046230f980 status 0 opcode 7 length 48
Apr 23 20:00:34 neo03-el64 kernel: RPC:       wake_up_next(ffff880465ae6190 "xprt_sending")
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 call_reserveresult (status 0)
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 call_refresh (status 0)
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 looking up UNIX cred
Apr 23 20:00:34 neo03-el64 kernel: RPC:       looking up UNIX cred
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 refreshing UNIX cred ffff880467b2cec0
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 call_refreshresult (status 0)
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 call_allocate (status 0)
Apr 23 20:00:34 neo03-el64 kernel: RPC:       xprt_rdma_allocate: size 1052 too large for buffer[1024]: prog 100003 vers 4 proc 1
Apr 23 20:00:34 neo03-el64 kernel: RPC:       xprt_rdma_allocate: size 1052, request 0xffff8804650e2000

------------->>>>>> A new buffer is allocated from the Pre-Created Buffer pool, and since buffer is smaller to hold requested data size <<<<<<<-----------------
------------->>>>>> allocate new, do book keeping and create phys_mr for the newly allocated buffer.

Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 call_bind (status 0)
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 call_connect xprt ffff880465ae6000 is connected
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 call_transmit (status 0)
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 xprt_prepare_transmit
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 xprt_cwnd_limited cong = 0 cwnd = 4096
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 rpc_xdr_encode (status 0)
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 marshaling UNIX cred ffff880467b2cec0
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 using AUTH_UNIX cred ffff880467b2cec0 to wrap rpc data
Apr 23 20:00:34 neo03-el64 kernel: encode_compound: tag=
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 xprt_transmit(120)
Apr 23 20:00:34 neo03-el64 kernel: RPC:       rpcrdma_inline_pullup: pad 0 destp 0xffff8804650e37d8 len 120 hdrlen 120
Apr 23 20:00:34 neo03-el64 kernel: RPC:       rpcrdma_register_frmr_external: Using frmr ffff88046230ef30 to map 1 segments

---------------->>>>>>>>> This is where post_send is called for FRMR creations. If xprt is not connected, even then post_send call continues with FRMR cration.
--------------->>>>>>>>>> if QP is connected call post send else fail the reg-call and submit the buffers back to the pools and start over with call_bind() at RPC layer.

Apr 23 20:00:34 neo03-el64 kernel: RPC:       rpcrdma_create_chunks: reply chunk elem 592@0x4650e392c:0x805f505 (last)
Apr 23 20:00:34 neo03-el64 kernel: RPC:       rpcrdma_marshal_req: reply chunk: hdrlen 48 rpclen 120 padlen 0 headerp 0xffff8804650e3100 base 0xffff8804650e3760 lkey 0x0
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 xmit complete
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 sleep_on(queue "xprt_pending" time 4296808435)
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 added to queue ffff880465ae6258 "xprt_pending"
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 setting alarm for 60000 ms
Apr 23 20:00:34 neo03-el64 kernel: RPC:       wake_up_next(ffff880465ae6190 "xprt_sending")
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 sync task going to sleep
Apr 23 20:00:34 neo03-el64 kernel: RPC:       rpcrdma_event_process: event rep ffff88046230ef30 status 0 opcode 8 length 48
Apr 23 20:00:34 neo03-el64 kernel: RPC:       rpcrdma_event_process: event rep ffff88046502a000 status 0 opcode 80 length 48

---------------->>>>>>>>>> If the completion is Flush, Update the QP connection state immediately, don't wait for tasklet to schedule.

Apr 23 20:00:34 neo03-el64 kernel: RPC:       rpcrdma_reply_handler: reply 0xffff88046502a000 completes request 0xffff8804650e2000
Apr 23 20:00:34 neo03-el64 kernel:                   RPC request 0xffff8804678b8800 xid 0x8dc9ab53
Apr 23 20:00:34 neo03-el64 kernel: RPC:       rpcrdma_count_chunks: chunk 212@0x4650e392c:0x805f505
Apr 23 20:00:34 neo03-el64 kernel: RPC:       rpcrdma_reply_handler: xprt_complete_rqst(0xffff880465ae6000, 0xffff8804678b8800, 212)
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 xid 53abc98d complete (212 bytes received)
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 __rpc_wake_up_task (now 4296808436)
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 disabling timer
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 removed from queue ffff880465ae6258 "xprt_pending"
Apr 23 20:00:34 neo03-el64 kernel: RPC:       __rpc_wake_up_task done
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 sync task resuming
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 call_status (status 212)
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 call_decode (status 212)
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 validating UNIX cred ffff880467b2cec0
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 using AUTH_UNIX cred ffff880467b2cec0 to unwrap rpc data
Apr 23 20:00:34 neo03-el64 kernel: decode_attr_type: type=040000
Apr 23 20:00:34 neo03-el64 kernel: decode_attr_change: change attribute=952326959717679104
Apr 23 20:00:34 neo03-el64 kernel: decode_attr_size: file size=4096
Apr 23 20:00:34 neo03-el64 kernel: decode_attr_fsid: fsid=(0x0/0x0)
Apr 23 20:00:34 neo03-el64 kernel: decode_attr_fileid: fileid=2
Apr 23 20:00:34 neo03-el64 kernel: decode_attr_fs_locations: fs_locations done, error = 0
Apr 23 20:00:34 neo03-el64 kernel: decode_attr_mode: file mode=0555
Apr 23 20:00:34 neo03-el64 kernel: decode_attr_nlink: nlink=32
Apr 23 20:00:34 neo03-el64 kernel: decode_attr_owner: uid=0
Apr 23 20:00:34 neo03-el64 kernel: decode_attr_group: gid=0
Apr 23 20:00:34 neo03-el64 kernel: decode_attr_rdev: rdev=(0x0:0x0)
Apr 23 20:00:34 neo03-el64 kernel: decode_attr_space_used: space used=8192
Apr 23 20:00:34 neo03-el64 kernel: decode_attr_time_access: atime=1398288115
Apr 23 20:00:34 neo03-el64 kernel: decode_attr_time_metadata: ctime=1398290189
Apr 23 20:00:34 neo03-el64 kernel: decode_attr_time_modify: mtime=1398290189
Apr 23 20:00:34 neo03-el64 kernel: decode_attr_mounted_on_fileid: fileid=0
Apr 23 20:00:34 neo03-el64 kernel: decode_getfattr_attrs: xdr returned 0
Apr 23 20:00:34 neo03-el64 kernel: decode_getfattr_generic: xdr returned 0
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 call_decode result 0
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 return 0, status 0
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 release task
Apr 23 20:00:34 neo03-el64 kernel: RPC:       wake_up_next(ffff880465ae6190 "xprt_sending")
Apr 23 20:00:34 neo03-el64 kernel: RPC:       xprt_rdma_free: called on 0xffff88046502a000

--------->>>>>xprt_rdma_free calls ib_post_send irrespective of QP connection state. Apply check here as-well.

------------->>>>>>>>> xprt_rdma_free internally tries to invalidate FRMRs, If QP is not connected, free-up buffer without invalidation modify the state of frmr.state = INVALID.

Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 release request ffff8804678b8800
Apr 23 20:00:34 neo03-el64 kernel: RPC:       wake_up_next(ffff880465ae6320 "xprt_backlog")
Apr 23 20:00:34 neo03-el64 kernel: RPC:       rpc_release_client(ffff8804651c1e00)
Apr 23 20:00:34 neo03-el64 kernel: RPC:   178 freeing task
Apr 23 20:00:34 neo03-el64 kernel: NFS: nfs_fhget(0:21/2 ct=1)
Apr 23 20:00:34 neo03-el64 kernel: <-- nfs4_get_root()
Apr 23 20:00:34 neo03-el64 kernel: RPC:       looking up Generic cred
Apr 23 20:00:34 neo03-el64 kernel: RPC:       rpcrdma_event_process: event rep ffff88046230ef30 status 0 opcode 7 length 48

----------->>>>> New Task Initialised<<<<<<<<<---------------
Apr 23 20:00:34 neo03-el64 kernel: RPC:       new task initialized, procpid 3491

> -----Original Message-----
> From: linux-rdma-owner@xxxxxxxxxxxxxxx [mailto:linux-rdma-
> owner@xxxxxxxxxxxxxxx] On Behalf Of Devesh Sharma
> Sent: Tuesday, April 15, 2014 11:56 PM
> To: Chuck Lever
> Cc: Linux NFS Mailing List; linux-rdma@xxxxxxxxxxxxxxx; Trond Myklebust
> Subject: RE: [PATCH V1] NFS-RDMA: fix qp pointer validation checks
> 
> 
> 
> > -----Original Message-----
> > From: Chuck Lever [mailto:chuck.lever@xxxxxxxxxx]
> > Sent: Tuesday, April 15, 2014 6:10 AM
> > To: Devesh Sharma
> > Cc: Linux NFS Mailing List; linux-rdma@xxxxxxxxxxxxxxx; Trond
> > Myklebust
> > Subject: Re: [PATCH V1] NFS-RDMA: fix qp pointer validation checks
> >
> >
> > On Apr 14, 2014, at 6:46 PM, Devesh Sharma
> <devesh.sharma@xxxxxxxxxx>
> > wrote:
> >
> > > Hi Chuck
> > >
> > >> -----Original Message-----
> > >> From: Chuck Lever [mailto:chuck.lever@xxxxxxxxxx]
> > >> Sent: Tuesday, April 15, 2014 2:24 AM
> > >> To: Devesh Sharma
> > >> Cc: Linux NFS Mailing List; linux-rdma@xxxxxxxxxxxxxxx; Trond
> > >> Myklebust
> > >> Subject: Re: [PATCH V1] NFS-RDMA: fix qp pointer validation checks
> > >>
> > >> Hi Devesh-
> > >>
> > >>
> > >> On Apr 13, 2014, at 12:01 AM, Chuck Lever <chuck.lever@xxxxxxxxxx>
> > wrote:
> > >>
> > >>>
> > >>> On Apr 11, 2014, at 7:51 PM, Devesh Sharma
> > >> <Devesh.Sharma@xxxxxxxxxx> wrote:
> > >>>
> > >>>> Hi  Chuck,
> > >>>> Yes that is the case, Following is the trace I got.
> > >>>>
> > >>>> <4>RPC:   355 setting alarm for 60000 ms
> > >>>> <4>RPC:   355 sync task going to sleep
> > >>>> <4>RPC:       xprt_rdma_connect_worker: reconnect
> > >>>> <4>RPC:       rpcrdma_ep_disconnect: rdma_disconnect -1
> > >>>> <4>RPC:       rpcrdma_ep_connect: rpcrdma_ep_disconnect status -1
> > >>>> <3>ocrdma_mbx_create_qp(0) rq_err
> > >>>> <3>ocrdma_mbx_create_qp(0) sq_err
> > >>>> <3>ocrdma_create_qp(0) error=-1
> > >>>> <4>RPC:       rpcrdma_ep_connect: rdma_create_qp failed -1
> > >>>> <4>RPC:   355 __rpc_wake_up_task (now 4296956756)
> > >>>> <4>RPC:   355 disabling timer
> > >>>> <4>RPC:   355 removed from queue ffff880454578258 "xprt_pending"
> > >>>> <4>RPC:       __rpc_wake_up_task done
> > >>>> <4>RPC:       xprt_rdma_connect_worker: exit
> > >>>> <4>RPC:   355 sync task resuming
> > >>>> <4>RPC:   355 xprt_connect_status: error 1 connecting to server
> > >> 192.168.1.1
> > >>>
> > >>> xprtrdma's connect worker is returning "1" instead of a negative errno.
> > >>> That's the bug that triggers this chain of events.
> > >>
> > >> rdma_create_qp() has returned -EPERM. There's very little xprtrdma
> > >> can do if the provider won't even create a QP. That seems like a
> > >> rare and fatal problem.
> > >>
> > >> For the moment, I'm inclined to think that a panic is correct
> > >> behavior, since there are outstanding registered memory regions
> > >> that cannot be cleaned up without a QP (see below).
> > > Well, I think the system should still remain alive.
> >
> > Sure, in the long run. I'm not suggesting we leave it this way.
> Okay, Agreed.
> >
> > > This will definatly cause a memory leak. But QP create failure does
> > > not
> > mean system should also crash.
> >
> > It's more than leaked memory.  A permanent QP creation failure can
> > leave pages in the page cache registered and pinned, as I understand it.
> Yes! true.
> >
> > > I think for the time being it is worth to put Null pointer checks to
> > > prevent
> > system from crash.
> >
> > Common practice in the Linux kernel is to avoid unnecessary NULL checks.
> > Work-around fixes are typically rejected, and not with a happy face either.
> >
> > Once the connection tear-down code is fixed, it should be clear where
> > NULL checks need to go.
> Okay.
> >
> > >>
> > >>> RPC tasks waiting for the reconnect are awoken.
> > >>> xprt_connect_status() doesn't recognize a tk_status of "1", so it
> > >>> turns it into -EIO, and kills each waiting RPC task.
> > >>
> > >>>> <4>RPC:       wake_up_next(ffff880454578190 "xprt_sending")
> > >>>> <4>RPC:   355 call_connect_status (status -5)
> > >>>> <4>RPC:   355 return 0, status -5
> > >>>> <4>RPC:   355 release task
> > >>>> <4>RPC:       wake_up_next(ffff880454578190 "xprt_sending")
> > >>>> <4>RPC:       xprt_rdma_free: called on 0x(null)
> > >>>
> > >>> And as part of exiting, the RPC task has to free its buffer.
> > >>>
> > >>> Not exactly sure why req->rl_nchunks is not zero for an NFSv4
> GETATTR.
> > >>> This is why rpcrdma_deregister_external() is invoked here.
> > >>>
> > >>> Eventually this gets around to attempting to post a LOCAL_INV WR
> > >>> with
> > >>> ->qp set to NULL, and the panic below occurs.
> > >>
> > >> This is a somewhat different problem.
> > >>
> > >> Not only do we need to have a good ->qp here, but it has to be
> > >> connected and in the ready-to-send state before LOCAL_INV work
> > >> requests can be posted.
> > >>
> > >> The implication of this is that if a server disconnects (server
> > >> crash or network partition), the client is stuck waiting for the
> > >> server to come back before it can deregister memory and retire
> > >> outstanding RPC
> > requests.
> > > This is a real problem to solve. In the existing state of xprtrdma
> > > code. Even a Server reboot will cause Client to crash.
> >
> > I don't see how that can happen if the HCA/provider manages to create
> > a fresh QP successfully and then rdma_connect() succeeds.
> Okay yes, since QP creation will still succeed.
> >
> > A soft timeout or a ^C while the server is rebooting might be a problem.
> >
> > >>
> > >> This is bad for ^C or soft timeouts or umount ... when the server
> > >> is unavailable.
> > >>
> > >> So I feel we need better clean-up when the client cannot reconnect.
> > > Unreg old frmrs with the help of new QP? Until the new QP is created
> > > with
> > same PD and FRMR is bound to PD and not to QP.
> > >> Probably deregistering RPC chunk MR's before finally tearing down
> > >> the old QP is what is necessary.
> > >
> > > We need a scheme that handles Memory registrations separately from
> > connection establishment and do book-keeping of which region is
> > Registered and which one is not.
> > > Once the new connection is back. Either start using old mem-regions
> > > as it is,
> > or invalidate old and re-register on the new QP.
> > > What is the existing scheme xprtrdma is following? Is it the same?
> >
> > This is what is going on now.  Clearly, when managing its own memory
> > resources, the client should never depend on the server ever coming back.
> >
> > The proposal is to deregister _before_ the old QP is torn down, using
> > ib_dereg_mr() in the connect worker process. All RPC requests on that
> > connection should be sleeping waiting for the reconnect to complete.
> >
> > If chunks are created and marshaled during xprt_transmit(), the
> > waiting RPC requests should simply re-register when they are ready to be
> sent again.
> >
> Ok, I will try to change this and test, I may take a week's time to understand
> and rollout V3.
> 
> > > I think it is possible to create FRMR on qp->qp_num = x while
> > > invalidate on qp->qp_num = y until qpx.pd == qpy.pd
> >
> > --
> > Chuck Lever
> > chuck[dot]lever[at]oracle[dot]com
> >
> >
> >
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the
> body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at
> http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html