Re: Fwd: Re: Fedora27: NFS v4 terrible write performance, is async working

Terry Barnaby <terry1@xxxxxxxxxxx> · Thu, 8 Feb 2018 20:21:44 +0000

On 06/02/18 21:48, J. Bruce Fields wrote:
On Tue, Feb 06, 2018 at 08:18:27PM +0000, Terry Barnaby wrote:
Well, when a program running on a system calls open(), write() etc. to the
local disk FS the disk's contents is not actually updated. The data is in
server buffers until the next sync/fsync or some time has passed. So, in
your parlance, the OS write() call lies to the program. So it is by default
async unless the "sync" mount option is used when mounting the particular
file system in question.
That's right, but note applications are written with the knowledge that
OS's behave this way, and are given tools (sync, fsync, etc.) to manage
this behavior so that they still have some control over what survives a
crash.

(But sync & friends no longer do what they're supposed to on an Linux
server exporting with async.)
Doesn't fsync() and perhaps sync() work across NFS then when the server 
has an async export, I thought they did along with file locking to some 
extent ?
Although it is different from the current NFS settings methods, I would have
thought that this should be the same for NFS. So if a client mounts a file
system normally it is async, ie write() data is in buffers somewhere (client
or server) unless the client mounts the file system in sync mode.
In fact, this is pretty much how it works, for write().

It didn't used to be that way--NFSv2 writes were all synchronous.

The problem is that if a server power cycles while it still had dirty
data in its caches, what should you do?
You can't ignore it--you'd just be silently losing data.  You could
return an error at some point, but "we just lost some or your idea, no
idea what" isn't an error an application can really act on.
Yes, it is tricky error handling. But what does a program do when its 
local hard disk disk or machine dies underneath it anyway ? I don't 
think a program on a remote system is particularly worse off if the NFS 
server dies, it may have to die if it can't do any special recovery. If 
it was important to get the data to disk it would have been using 
fsync(), FS sync, or some other transaction based approach, indeed it 
shouldn't be using network remote disk mounts anyway. It all depends on 
what the program is doing and its usage requirements. A cc failing one 
in a blue moon is not a real issue (as long as it fails and removes its 
created files or at least a make clean can be run). As I have said I 
have used NFS async for about 27+ years on multiple systems with no 
problems when servers die with the type of usage I use NFS for. The 
number of times a server has died is low in that time. Client systems 
have died many many more times (User issues, experimental 
programs/kernels, random program usage, single cheap disks, cheaper non 
ECC RAM etc.)
So NFSv3 introduced a separation of write into WRITE and COMMIT.  The
client first sends a WRITE with the data, then latter sends a COMMIT
call that says "please don't return till that data I sent before is
actually on disk".

If the server reboots, there's a limited set of data that the client
needs to resend to recover (just data that's been written but not
committed.)

But we only have that for file data, metadata would be more complicated,
so stuff like file creates, setattr, directory operations, etc., are
still synchronous.

Only difference from the normal FS conventions I am suggesting is to
allow the server to stipulate "sync" on its mount that forces sync
mode for all clients on that FS.
Anyway, we don't have protocol to tell clients to do that.
As I said NFSv4.3 :)

In the case of a /home mount for example, or a source code build file
system, it is normally only one client that is accessing the dir and if a
write fails due to the server going down (an unlikely occurrence, its not
much of an issue. I have only had this happen a couple of times in 28 years
and then with no significant issues (power outage, disk fail pre-raid etc.).
So if you have reliable servers and power, maybe you're comfortable with
the risk.  There's a reason that's not the default, though.
Well, it is the default for local FS mounts so I really don't see why it 
should be different for network mounts. But anyway for my usage NFS sync 
is completely unusable (as would local sync mounts) so it has to be 
async NFS or local disks (13 secs local disk -> 3mins NFS async-> 2 
hours NFS sync). I would have thought that would go for the majority of 
NFS usage. No issue to me though as long as async can be configured and 
works well :)

4. The 0.5ms RPC latency seems a bit high (ICMP pings 0.12ms) . Maybe this
is worth investigating in the Linux kernel processing (how ?) ?
Yes, that'd be interesting to investigate.  With some kernel tracing I
think it should be possible to get high-resolution timings for the
processing of a single RPC call, which would make a good start.

It'd probably also interesting to start with the simplest possible RPC
and then work our way up and see when the RTT increases the most--e.g
does an RPC ping (an RPC with procedure 0, empty argument and reply)
already have a round-trip time closer to .5ms or .12ms?
Any pointers to trying this ? I have a small amount of time as work is quiet
at the moment.
Hm.  I wonder if testing over loopback would give interesting enough
results.  That might simplify testing even if it's not as realistic.
You could start by seeing if latency is still similar.

You could start by googling around for "ftrace", I think lwn.net's
articles were pretty good introductions.

I don't do this very often and don't have good step-by-step
instructions....

I beleive the simplest way to do it was using "trace-cmd" (which is
packaged for fedora in a package of the same name).  The man page looks
skimpy, buthttps://lwn.net/Articles/410200/  looks good.  Maybe run it
while just stat-ing a single file on an NFS partition as a start.

I don't know if that will result in too much data.  Figuring out how to
filter it may be tricky.  Tracing everything may be prohibitive.
Several processes are involved so you don't want to restrict by process.
Maybe restricting to functions in nfsd and sunrpc modules would work,
with something like -l ':mod:nfs' -l ':mod:sunrpc'.
Thanks for the ideas, I will try and have a play.

We have also found that SSD's or at least NAND flash has quite a few write
latency peculiarities . We use eMMC NAND flash on a few embedded systems we
have designed and the write latency patterns are a bit random and not well
described/defined in datasheets etc. Difficult when you have an embedded
system with small amounts of RAM doing real-time data capture !
That's one of the reasons you want the "enterprise" drives with power
loss protection--they let you just write to cache, so write-behind and
gathering of writes into erase-block-sized writes to flash should allow
the firmware to hide weird flash latency from you.

A few years ago (and I have poor notes, so take this with a grain of
salt) I tested the same untar-a-kernel-workload using an external
journal on an SSD without that feature, and found it didn't offer any
improvement.

Although using a low latency SSD drive could speed up NFS sync performance,
I don't think it would affect NFS async write performance that much (already
50 - 100 x slower than normal HD access).
You need to specify "file creates" here--over NFS that has very
different performance characteristics.  Ordinary file writes should
still be able to saturate the network and/or disk in most cases.

If you want a protocol that makes no distinction between metadata and
data, and if you *really* don't do any sharing between clients, then
another option is to use a block protocol (iscsi or something).  That
will have different drawbacks.

It is the latency and the way the protocol works that is causing the
most issue.
Sure.  The protocol issues are probably more complicated than they first
appear, though!
Yes, they probably are, most things are below the surface, but I still 
think there are likely to be a lot of improvements that could be made 
that would make using NFS async more tenable to the user.
If necessary local file caching (to local disk) with delayed NFS writes. 
I do use fscache for the NFS - OpenVPN - FTTP mounts, but the NFS 
caching time tests probably hit the performance of this for reads and I 
presume writes would be write through rather than delayed write. Haven't 
actually looked at the performance of this and I know there are other 
network file systems that may be more suited in that case.

--b.

_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx