Re: [RFC 0/2] New MAP_PMEM_AWARE mmap flag

Jan Kara <jack@xxxxxxx> · Thu, 25 Feb 2016 10:46:42 +0100

On Wed 24-02-16 12:30:59, Ross Zwisler wrote:
> On Tue, Feb 23, 2016 at 09:09:47PM -0700, Ross Zwisler wrote:
> > On Tue, Feb 23, 2016 at 03:56:17PM -0800, Dan Williams wrote:
> > > On Tue, Feb 23, 2016 at 3:43 PM, Jeff Moyer <jmoyer@xxxxxxxxxx> wrote:
> > > > Dan Williams <dan.j.williams@xxxxxxxxx> writes:
> > > >
> > > >> On Tue, Feb 23, 2016 at 3:28 PM, Jeff Moyer <jmoyer@xxxxxxxxxx> wrote:
> > > >>>> The crux of the problem, in my opinion, is that we're asking for an "I
> > > >>>> know what I'm doing" flag, and I expect that's an impossible statement
> > > >>>> for a filesystem to trust generically.
> > > >>>
> > > >>> The file system already trusts that.  If an application doesn't use
> > > >>> fsync properly, guess what, it will break.  This line of reasoning
> > > >>> doesn't make any sense to me.
> > > >>
> > > >> No, I'm worried about the case where an app specifies MAP_PMEM_AWARE
> > > >> uses fsync correctly, and fails to flush cpu cache.
> > > >
> > > > I don't think the kernel needs to put training wheels on applications.
> > > >
> > > >>>> If you can get MAP_PMEM_AWARE in, great, but I'm more and more of the
> > > >>>> opinion that the "I know what I'm doing" interface should be something
> > > >>>> separate from today's trusted filesystems.
> > > >>>
> > > >>> Just so I understand you, MAP_PMEM_AWARE isn't the "I know what I'm
> > > >>> doing" interface, right?
> > > >>
> > > >> It is the "I know what I'm doing" interface, MAP_PMEM_AWARE asserts "I
> > > >> know when to flush the cpu relative to an fsync()".
> > > >
> > > > I see.  So I think your argument is that new file systems (such as Nova)
> > > > can have whacky new semantics, but existing file systems should provide
> > > > the more conservative semantics that they have provided since the dawn
> > > > of time (even if we add a new mmap flag to control the behavior).
> > > >
> > > > I don't agree with that.  :)
> > > >
> > > 
> > > Fair enough.  Recall, I was pushing MAP_DAX not to long ago.  It just
> > > seems like a Sisyphean effort to push an mmap flag up the XFS hill and
> > > maybe that effort is better spent somewhere else.
> > 
> > Well, for what it's worth MAP_SYNC feels like the "right" solution to me.  I
> > understand that we are a ways from having it implemented, but it seems like
> > the correct way to have applications work with persistent memory in a perfect
> > world, and worth the effort.
> > 
> > MAP_PMEM_AWARE is interesting, but even in a perfect world it seems like a
> > partial solution - applications still need to call *sync to get the FS
> > metadata to be durable, and they have no reliable way of knowing which of
> > their actions will cause the metadata to be out of sync.
> > 
> > Dave, is your objection to the MAP_SYNC idea a practical one about complexity
> > and time to get it implemented, or do you think it's is the wrong solution?
> 
> Jan, I just noticed that this chain didn't CC you nor linux-fsdevel, so you
> may have missed it.  All the gory details are here:
> 
> http://thread.gmane.org/gmane.linux.kernel.mm/146691
> 
> Let me provide a little background for my question.  (Everyone else on the
> thread feel free to jump in if you feel like my summary is incorrect or
> incomplete.)
> 
> There is a new persistent memory programming model outlined on pmem.io and
> implemented by the NVM Library (nvml).
> 
> http://pmem.io/
> http://pmem.io/nvml/
> https://github.com/pmem/nvml/
> 
> This new programming model is based on the idea that an application should be
> able to create a DAX MMAP, and then from then on satisfy the data durability
> requirements of the application purely in userspace.  This is done by using
> non-temporal stores or cached writes followed by flushes, the same way that we
> do things in the kernel.
> 
> Dave was concerned that this breaks down for XFS because even if the
> application were to sync all its writes to media, the filesystem could be
> making associated metadata changes that the application wouldn't and couldn't
> know about:
> 
> http://article.gmane.org/gmane.linux.kernel.mm/146699
> 
> To sync these metadata changes to media, the application would still need to
> call *sync.
> 
> One proposal from Christoph was that we could add a MMAP_SYNC flag that
> essentially says "make all metadata operations synchronous":
> 
> http://article.gmane.org/gmane.linux.kernel.mm/146753
> 
> The worry is that this would be complex to implement, and that we maybe don't
> want yet another DAX special case in the FS code.
> 
> Another way that we could implement this would be to key off of the DAX mount
> option / inode setting for all mmaps that use DAX.  This would preclude the
> need for changes to the mmap() API.
> 
> My question: How far away are we from having such a metadata durability
> guarantee in ext4?  Do we have cases where the metadata changes associated
> with a page fault, etc. could be out of sync with the data writes that are
> being made durable by the application in userspace?
> 
> I see ext4 creating journal entries around page faults in places like
> ext4_dax_fault() - this should durably record any metadata changes associated
> with that page fault before the fault completes, correct?

That is not true. Journalling makes sure metadata changes are recorded in
the journal but you have to commit the transaction to make the change
really durable. That happens either in response to sync / fsync or
asynchronously every couple of seconds. So with ext4 you have exactly the
same issues with durability as with XFS.

> Are there other cases you can think of with ext4 where we would need to call
> *sync for DAX just to be sure we are safely synchronizing metadata?

So I think implementing something like MAP_SYNC semantics for ext4 is
reasonably doable. Basically we would have to make sure that we commit a
transaction already during a page fault which is not that hard to do.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>