Re: [PATCH v2 5/5] dax: handle media errors in dax_do_io

"Rudoff, Andy" <andy.rudoff@xxxxxxxxx> · Tue, 3 May 2016 01:26:46 +0000

>> The takeaway is that msync() is 9-10x slower than userspace cache management.
>
>An alternative viewpoint: that flushing clean cachelines is
>extremely expensive on Intel CPUs. ;)
>
>i.e. Same numbers, different analysis from a different PoV, and
>that gives a *completely different conclusion*.
>
>Think about it for the moment. The hardware inefficiency being
>demonstrated could be fixed/optimised in the next hardware product
>cycle(s) and so will eventually go away. OTOH, we'll be stuck with
>whatever programming model we come up with for the next 30-40 years,
>and we'll never be able to fix flaws in it because applications will
>be depending on them. Do we really want to be stuck with a pmem
>model that is designed around the flaws and deficiencies of ~1st
>generation hardware?

Hi Dave,

Not sure I agree with your completely different conclusion.  (Not sure
I completely disagree either, but please let me raise some practical
points.)

First of all, let's say you're completely right and flushing clean
cache lines is extremely expensive.  So your solution is to wait for
the chip to be fixed?  Remember the model we're putting forward (which
we're working on documenting, because I fully agree with the lack of
documentation point you keep raising) requires the application to ASK
for the file system's permission before assuming flushing from user space
to persistence is allowed.  So that doesn't stick us with 30-40 years of
a flawed model.  I don't think the model is wrong, having spent lots of
research time on it, but if I'm full of crap, all we have to do is stop
telling the app that flushing from user space is allowed and it must go
back to using msync().  This is my understanding of what Dan suggested
at LSF and this is what I'm currently writing up.  By the way, the NVM
Libraries already contain the logic to ask if flushing from user space
is allowed, falling back to msync() if not.  Currently those libraries
check for DAX mappings.  But the points you raised about metadata changes
happening during page faults made us realize we have to ask the file
system to opt-in to allowing user space flushing, so that's what we're
changing the library to do.  See, we are listening :-)

Anyway, I doubt that flushing a clean cache line is extremely expensive.
Remember the code is building transactions to maintain a consistent
in-memory data structure in the face of sudden failure like powerloss.
So it is using the flushes to create store barriers, but not the block-
based store barriers we're used to in the storage world, but cache-line-
sized store barriers (usually multiples of cache lines, but most commonly
smaller than 4k of them).  So I think when you turn a cache line flush
into an msync(), you're seeing some dirty stuff get flushed before it
is time to flush it.  I'm not sure though, but certainly we could spend
more time testing & measuring.

More importantly, I think it is interesting to decide what we want the
pmem programming model to be long-term.  I think we want applications to
just map pmem, do normal stores to it, and assume they are persistent.
This is quite different from the 30-year-old POSIX Model where msync()
is required.  But I think it is cleaner, easier to understand, and less
error-prone.  So why doesn't it work that way right now?  Because we're
finding it impractical.  Using write-through caching for pmem simply
doesn't perform well, and depending on the platform to flush the CPU
caches on shutdown/powerfail is not practical yet.  But I think the day
will come when it is practical.

So given that long-term target, the idea is for an application to ask if
the msync() calls are required, or if just flushing the CPU caches is
sufficient for persistence.  Then, we're also adding an ACPI property
that allows SW to discover if the caches are flushed automatically
on shutdown/powerloss.  Initially that will only be true for custom
platforms, but hopefully it can be available more broadly in the future.
The result will be that the programming model gets simpler as more and
more hardware requires less explicit flushing.

Now I'll go back to writing up the big picture for this programming
model so I can ask you for comments on that as well...

-andy
��.n��������+%������w��{.n�����{����n�r������&��z�ޗ�zf���h���~����������_��+v���)ߣ�