On Wed, Mar 31, 2021 at 2:51 PM Trond Myklebust <trondmy@xxxxxxxxxxxxxxx> wrote: > > On Wed, 2021-03-31 at 14:34 -0400, David Wysochanski wrote: > > On Wed, Mar 31, 2021 at 2:04 PM Trond Myklebust < > > trondmy@xxxxxxxxxxxxxxx> wrote: > > > > > > On Wed, 2021-03-31 at 13:49 -0400, David Wysochanski wrote: > > > > Trond, > > > > > > > > I've been working on getting NFS converted to dhowells new > > > > fscache > > > > and > > > > netfs APIs and running into a problem with how NFS is designed > > > > and it > > > > involves the NFS pagelist.c / pgio API. I'd appreciate it if you > > > > could review and give your thoughts on possible approaches. I've > > > > tried to outline some of the possibilities below. I tried coding > > > > option #3 and ran into some problems, and it has a serialization > > > > limitation. At this point I'm leaning towards option 2, so I'll > > > > probably try that approach if you don't have time for review or > > > > have > > > > strong thoughts on it. > > > > > > > > > > I am not going through another redesign of the NFS code in order to > > > accommodate another cachefs design. If netfs needs a refactoring or > > > redesign of the I/O code then it will be immediately NACKed. > > > > > I don't think it will require a redesign. I was thinking more about > > adding a flag to nfs_pageio_add_request() for example that > > would return a different value if coalesce of the page being > > added failed. So we'd have: > > nfs_pageio_init(): called 1 time > > nfs_pageio_add_request(): called N times, one for each page, but stop > > if > > coalesce fails > > nfs_pageio_complete(): called 1 time > > > > > Why does netfs need to know these details about the NFS code > > > anyway? > > > > > We can probably get by without it but it will be awkward and probably > > not the > > best, but I'm not sure. > > > > I tried to explain below with a problem statement but maybe it was > > unclear. > > The basic design of netfs API as it pertains to this problem is: > > * issue_op(): calls into the specific netfs (NFS) to obtain the data > > from server > > (send one or more RPCs) > > * netfs_subreq_terminated(): when RPC(s) are completed, we need to > > call > > netfs API back to say the data is either there or there was an error > > > > That's a problem. That means NFS and netfs need intimate knowledge of > each other's design, and I'm not going allow us to go there again. We > did that with cachefs, and here we are 10 years later doing a full > redesign. That's unacceptable. > I don't think it's a full redesign, and my goal all along has been minimally invasive to existing NFS. I forgot to mention this part of netfs: * clamp_length(): netfs calls into NFS here and we can clamp the length to a specific size (for example 'rsize' for reads); this is what I'm trying to see if I can implement fully or not but looks more complicated due to coalescing failing, etc. If not then there's other possible approaches (NETFS_SREQ_SHORT_READ) but not sure they are ideal. > If netfs requires these detailed changes to the NFS code, then that's a > red flag that the whole design is broken and needs to be revised. > > > I would note that assuming we can come up with something acceptable > > to > > NFS, it should simplify both nfs_readpage() and > > nfs_readpages/nfs_readhead. > > I hope we can find some common ground so it's neither too invasive to > > NFS and also maybe there's some similar improvements in NFS that can > > be done along with this interface. > > > > > > > > Thanks. > > > > > > > > > > > > Problem: The NFS pageio interface does not expose a max read > > > > length > > > > that > > > > we can easily use inside netfs clamp_length() function. As a > > > > result, > > > > when > > > > issue_op() is called indicating a single netfs subrequest, this > > > > can > > > > be > > > > split into > > > > multiple NFS subrequests / RPCs inside guts of NFS pageio code. > > > > Multiple > > > > NFS subrequests requests leads to multiple completions, and the > > > > netfs > > > > API expects a 1:1 mapping between issue_op() and > > > > netfs_subreq_terminated() calls. > > > > > > > > Details of the NFS pageio API (see include/linux/nfs_page.h and > > > > fs/nfs/pagelist.c) > > > > Details of the netfs API (see include/linux/netfs.h and > > > > fs/netfs/read_helper.c) > > > > > > > > The NFS pageio API 3 main calls are as follows: > > > > 1. nfs_pageio_init(): initialize a pageio structure (R/W IO of N > > > > pages) > > > > 2. nfs_pageio_add_request(): called for each page to add to an IO > > > > * Calls nfs_pageio_add_request_mirror -> __nfs_pageio_add_request > > > > * __nfs_pageio_add_request may call nfs_pageio_doio() which > > > > actually > > > > sends an RPC over the wire if page cannot be added to the > > > > request > > > > ("coalesced") due to various factors. For more details, see > > > > nfs_pageio_do_add_request() and all underlying code it calls > > > > such > > > > as nfs_coalesce_size() and subsequent pgio->pg_ops->pg_test() > > > > calls > > > > 3. nfs_pageio_complete() - "complete" the pageio > > > > * calls nfs_pageio_complete_mirror -> nfs_pageio_doio() > > > > > > > > The NFS pageio API thus may generate multiple over the wire RPCs > > > > and thus multiple completions even though at the high level only > > > > one call to nfs_pageio_complete() is made. > > > > > > > > Option 1: Just use NFS pageio API as is, and deal with possible > > > > multiple > > > > completions. > > > > - Inconsistent with netfs design intent > > > > - Need to keep track of the RPC completion status, and for > > > > example, > > > > if one completes with success and one an error, probably call > > > > netfs_subreq_terminated() with the error. > > > > - There's no way for the caller of the NFS pageio API to know how > > > > many RPCs and thus completions may occur. Thus, it's unclear how > > > > one would distinguish between a READ that resulted in a single > > > > RPC > > > > over the wire that completed as a short read, and a READ that > > > > resulted in multiple RPCs that would each complete separately, > > > > but would eventually complete > > > > > > > > Option 2: Create a more complex 'clamp_length()' function for > > > > NFS, > > > > taking into account all ways NFS / pNFS code can split a read. > > > > + Consistent with netfs design intent > > > > + Multiple "split" requests would be called in parallel (see loop > > > > inside netfs_readahead, which repeatedly calls > > > > netfs_rreq_submit_slice) > > > > - Looks impossible without refactoring of NFS pgio API. We need > > > > to prevent nfs_pageio_add_request() from calling > > > > nfs_pagio_doio(), > > > > and return some indication coalesce failed. In addition, it may > > > > run into problems with fallback from DS to MDS for example (see > > > > commit d9156f9f364897e93bdd98b4ad22138de18f7c24). > > > > > > > > Option 3: Utilize NETFS_SREQ_SHORT_READ flag as needed. > > > > + Consistent with netfs design intent > > > > - Multiple "split" requests would be serialized (see code > > > > paths inside netfs_subreq_terminated that check for this flag). > > > > - Looks impossible without some refactoring of NFS pgio API. > > > > * Notes: Terminate NFS pageio page based loop at the first call > > > > to nfs_pageio_doio(). When a READ completes, NFS calls > > > > netfs_subreq_terminated() with NETFS_SREQ_SHORT_READ > > > > and is prepared to have the rest of the subrequest be > > > > resubmitted. > > > > Need to somehow fail early or avoid entirely subsequent calls to > > > > nfs_pagio_doio() for the original request though, and handle > > > > error status only from the first RPC. > > > > > > > > Option 4: Add some final completion routine to be called near > > > > bottom of nfs_pageio_complete() and would pass in at least > > > > netfs_read_subrequest(), possibly nfs_pageio_descriptor. > > > > + Inconsistent with netfs design intent > > > > - Would be a new NFS API or call on top of everything > > > > - Need to handle the "multiple completion with different > > > > status" problem (see #1). > > > > > > > > > > -- > > > Trond Myklebust > > > Linux NFS client maintainer, Hammerspace > > > trond.myklebust@xxxxxxxxxxxxxxx > > > > > > > > > > -- > Trond Myklebust > CTO, Hammerspace Inc > 4984 El Camino Real, Suite 208 > Los Altos, CA 94022 > > www.hammer.space >