On Wed, Apr 25, 2018 at 2:21 PM, Boaz Harrosh <boazh@xxxxxxxxxx> wrote: > > On 03/15/2018 02:42 PM, Miklos Szeredi wrote: >> >> Ideally most of the complexity would be in the page cache. Not sure >> how ready it is to handle pmem pages? >> >> The general case (non-pmem) will always have to be handled >> differently; you've just stated that it's much less latency sensitive >> and needs async handling. Basing the design on just trying to make >> it use the same mechanism (userspace copy) is flawed in my opinion, >> since it's suboptimal for either case. >> >> Thanks, >> Miklos > > > OK So I was thinking hard on all this and am changing my mind and > agreeing with all that was said. > > I want that the usFS plugin will have all the different options and > have an easy way to tell Kernel which mode to use. > > Let me summarize all the options: > > 1. Sync, userspace copy directly to app-buffers (current implementation) > > 2. Async block device operation (none pmem) > zuf owns all devices pmem and none pmem at mount time and provides > a very efficient access to both. In the harddisk / ssd case as part of > an IO call > the server returns -EWOULD_BLOCK and in the background will issue a > scatter_gather call through zuf. > The memory target for the IO can be pmem, directly to user-buffers > (DIO), transient > server buffers. > On completion an up call is made to ZUF to complete the IO operation > and > release the waiting application. > > 3. Splice and R-spilce > In the case that the IO target is not a block-device but an external > path like > network / rdma / some none block device. > Zuf already holds an internal object describing the IO context including > the > GUP app buffers. This internal object can be made the memory target of a > splice > operation. > > 4. Get-io_map type operation (currently implemented for mmap) > The zus-FS returns a set of dpp_t(s) to kernel and the Kernel does the > memcopy > to app buffers. The Server also specifies if those buffers should be > cached > on a per inode radix-tree (xarray) and if so at the next access to the > same > range Kernel does the copy and never dispatches to user-space > In this mode the Server can also revoke a cached mapping when needed > > 5. Use of VFS page-cache > For a very slow backing device the FS request the regular VFS > page-cache. > On read/write_pages() vector zuf uses option 1. above to read into > page-cache > instead of app-buffers directly. Only cache misses dispatch back to > user-space > > Have I forgotten anything? > > This way the zus-FS is in control and can do the "right thing" depending on > target device and FS characteristics. The interface lets us have a rich set > of > tools to be used. > > Hope that answers your concerns Why keep options 1 and 2? An io-map (4) type interface should cover this efficiently, shouldn't it? I don't think page-cache is just for slow backing devices or that it needs to be a separate interface. Caches are and will always be the fastest, no matter how fast your device is. In linux the page cache seems like the most convenient place to put a pmem mapping, for example. Of course, caches are also a big PITA when dealing with distributed filesystems. Fuse doesn't have a perfect solution for that. It's one of key areas that needs improvement. Also I'll add one more use case that crops up often with fuse: "whole file data mapping". Basically this means, the file's data in a (virtual) userspace filesystem is equivalent to a file's data on an underlying (physical) filesystem. We could accelerate I/O in that case tremendously as well as eliminating double caching. I've been undecided what to do with it; for some time I was resisting, then saying that I'll accept patches, and at some point I'll probably do a patch myself. Thanks, Miklos