Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag

Ross Zwisler <ross.zwisler@xxxxxxxxxxxxxxx> · Tue, 23 Feb 2016 10:25:12 -0700

On Tue, Feb 23, 2016 at 04:10:50PM +0200, Boaz Harrosh wrote:
> On 02/23/2016 11:52 AM, Christoph Hellwig wrote:
> <>
> > 
> > And this is BS.  Using msync or fsync might not perform as well as not
> > actually using them, but without them you do not get persistence.  If
> > you use your pmem as a throw away cache that's fine, but for most people
> > that is not the case.
> > 
> 
> Hi Christoph
> 
> So is exactly my suggestion. My approach is *not* the we do not call
> m/fsync to let the FS clean up.
> 
> In my model we still do that, only we eliminate the m/fsync slowness
> and the all page faults overhead by being instructed by the application
> that we do not need to track the data modified cachelines. Since the
> application is telling us that it will do so.
> 
> In my model the job is split:
>  App will take care of data persistence by instructing a MAP_PMEM_AWARE,
>  and doing its own cl_flushing / movnt.
>  Which is the heavy cost
> 
>  The FS will keep track of the Meta-Data persistence as it already does, via the
>  call to m/fsync. Which is marginal performance compared to the above heavy
>  IO.
> 
> Note that the FS is still free to move blocks around, as Dave said:
> lockout pagefaultes, unmap from user space, let app fault again on a new
> block. this will still work as before, already in COW we flush the old
> block so there will be no persistence lost.
> 
> So this all thread started with my patches, and my patches do not say
> "no m/fsync" they say, make this 3-8 times faster than today if the app
> is participating in the heavy lifting.
> 
> Please tell me what you find wrong with my approach?

It seems like we are trying to solve a couple of different problems:

1) Make page faults faster by skipping any radix tree insertions, tag updates,
etc.

2) Make fsync/msync faster by not flushing data that the application says it
is already making durable from userspace.

I agree that your approach seems to improve both of these problems, but I
would argue that it is an incomplete solution for problem #2 because a
fsync/msync from the PMEM aware application would still flush any radix tree
entries from *other* threads that were writing to the same file.

It seems like a more direct solution for #2 above would be to have a
metadata-only equivalent of fsync/fdatasync, say "fmetasync", which says "I'll
make the writes I do to my mmaps durable from userspace, but I need you to
sync all filesystem metadata for me, please".

This would allow a complete separation of data synchronization in userspace
from metadata synchronization in kernel space by the filesystem code.

By itself a fmetasync() type solution of course would do nothing for issue #1
- if that was a compelling issue you'd need something like the mmap tag you're
proposing to skip work on page faults.

All that being said, though, I agree with others in the thread that we should
still be focused on correctness, as we have a lot of correctness issues
remaining.  When we eventually get to the place where we are trying to do
performance optimizations, those optimizations should be measurement driven.

- Ross

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>