Re: Fwd: Re: Fedora27: NFS v4 terrible write performance, is async working

Terry Barnaby <terry1@xxxxxxxxxxx> · Tue, 6 Feb 2018 20:18:27 +0000

On 05/02/18 23:06, J. Bruce Fields wrote:
On Thu, Feb 01, 2018 at 08:29:49AM +0000, Terry Barnaby wrote:
1. Have an OPEN-SETATTR-WRITE RPC call all in one and a SETATTR-CLOSE call
all in one. This would reduce the latency of a small file to 1ms rather than
3ms thus 66% faster. Would require the client to delay the OPEN/SETATTR
until the first WRITE. Not sure how possible this is in the implementations.
Maybe READ's could be improved as well but getting the OPEN through quick
may be better in this case ?

2. Could go further with an OPEN-SETATTR-WRITE-CLOSE RPC call. (0.5ms vs
3ms).
The protocol doesn't currently let us delay the OPEN like that,
unfortunately.
Yes, should have thought of that, to focused on network traces and not 
thinking about the program/OS API :)
But maybe  OPEN-SETATTR and SETATTR-CLOSE would be possible.

What we can do that might help: we can grant a write delegation in the
reply to the OPEN.  In theory that should allow the following operations
to be performed asynchronously, so the untar can immediately issue the
next OPEN without waiting.  (In practice I'm not sure what the current
client will do.)

I'm expecting to get to write delegations this year....

It probably wouldn't be hard to hack the server to return write
delegations even when that's not necessarily correct, just to get an
idea what kind of speedup is available here.
That sounds good. I will have to read up on NFS write delegations, not 
sure how they work. I guess write() errors would be returned later than 
they actually occurred etc. ?

3. On sync/async modes personally I think it would be better for the client
to request the mount in sync/async mode. The setting of sync on the server
side would just enforce sync mode for all clients. If the server is in the
default async mode clients can mount using sync or async as to their
requirements. This seems to match normal VFS semantics and usage patterns
better.
The client-side and server-side options are both named "sync", but they
aren't really related.  The server-side "async" export option causes the
server to lie to clients, telling them that data has reached disk even
when it hasn't.  This affects all clients, whether they mounted with
"sync" or "async".  It violates the NFS specs, so it is not the default.

I don't understand your proposal.  It sounds like you believe that
mounting on the client side with the "sync" option will make your data
safe even if the "async" option is set on the server side?
Unfortunately that's not how it works.
Well, when a program running on a system calls open(), write() etc. to 
the local disk FS the disk's contents is not actually updated. The data 
is in server buffers until the next sync/fsync or some time has passed. 
So, in your parlance, the OS write() call lies to the program. So it is 
by default async unless the "sync" mount option is used when mounting 
the particular file system in question.

Although it is different from the current NFS settings methods, I would 
have thought that this should be the same for NFS. So if a client mounts 
a file system normally it is async, ie write() data is in buffers 
somewhere (client or server) unless the client mounts the file system in 
sync mode. Only difference from the normal FS conventions I am 
suggesting is to allow the server to stipulate "sync" on its mount that 
forces sync mode for all clients on that FS. I know it is different from 
standard NFS config but it just seems more logical to me :) The 
sync/async option and the ramifications of it are really dependent of 
the clients usage in most cases.

In the case of a /home mount for example, or a source code build file 
system, it is normally only one client that is accessing the dir and if 
a write fails due to the server going down (an unlikely occurrence, its 
not much of an issue. I have only had this happen a couple of times in 
28 years and then with no significant issues (power outage, disk fail 
pre-raid etc.).

I know that is not how NFS currently "works", it just seems illogical to 
me they way it currently does work :)

4. The 0.5ms RPC latency seems a bit high (ICMP pings 0.12ms) . Maybe this
is worth investigating in the Linux kernel processing (how ?) ?
Yes, that'd be interesting to investigate.  With some kernel tracing I
think it should be possible to get high-resolution timings for the
processing of a single RPC call, which would make a good start.

It'd probably also interesting to start with the simplest possible RPC
and then work our way up and see when the RTT increases the most--e.g
does an RPC ping (an RPC with procedure 0, empty argument and reply)
already have a round-trip time closer to .5ms or .12ms?
Any pointers to trying this ? I have a small amount of time as work is 
quiet at the moment.

5. The 20ms RPC latency I see in sync mode needs a look at on my system
although async mode is fine for my usage. Maybe this ends up as 2 x 10ms
drive seeks on ext4 and is thus expected.
Yes, this is why dedicated file servers have hardware designed to lower
that latency.

As long as you're exporting with "async" and don't care about data
safety across crashes or power outages, I guess you could go all the way
and mount your ext4 export with "nobarrier", I *think* that will let the
system acknowledge writes as soon as they reach the disk's write cache.
I don't recommend that.

Just for fun I dug around a little for cheap options to get safe
low-latency storage:

For Intel you can cross-reference this list:

	https://ark.intel.com/Search/FeatureFilter?productType=solidstatedrives&EPLDP=true

of SSD's with "enhanced power loss data protection" (EPLDP) with
shopping sites and I find e.g. this for US $121:

	https://www.newegg.com/Product/Product.aspx?Item=9SIABVR66R5680

See the "device=" option in the ext4 man pages--you can use that to give
your existing ext4 filesystem an external journal on that device.  I
think you want "data=journal" as well, then writes should normally be
acknowledged once they hit that SSD's write cache, which should be quite
quick.

I was also curious whether there were PCI SSDs, but the cheapest Intel
SSD with EPLDP is the P4800X, at US $1600.

Intel Optane Memory is interesting as it starts at $70.  It doesn't have
EPLDP but latency of the underlying storage might be better even without
that?

I haven't figured out how to get a similar list for other brands.

Just searching for "SSD power loss protection" on newegg:

This also claims "power loss protection" at $53, but I can't find any
reviews:

	https://www.newegg.com/Product/Product.aspx?Item=9SIA1K642V2376&cm_re=ssd_power_loss_protection-_-9SIA1K642V2376-_-Product

Or this?:

	https://www.newegg.com/Product/Product.aspx?Item=N82E16820156153&cm_re=ssd_power_loss_protection-_-20-156-153-_-Product

This is another interesting discussion of the problem:

	https://blogs.technet.microsoft.com/filecab/2016/11/18/dont-do-it-consumer-ssd/

--b.

We have also found that SSD's or at least NAND flash has quite a few 
write latency peculiarities . We use eMMC NAND flash on a few embedded 
systems we have designed and the write latency patterns are a bit random 
and not well described/defined in datasheets etc. Difficult when you 
have an embedded system with small amounts of RAM doing real-time data 
capture !

Although using a low latency SSD drive could speed up NFS sync 
performance, I don't think it would affect NFS async write performance 
that much (already 50 - 100 x slower than normal HD access). It is the 
latency and the way the protocol works that is causing the most issue. 
Changing the NFS file system protocol/performance has much more scope 
for improvements with async mode and async mode I think is fine for most 
usage.

_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx