Re: Fedora27: NFS v4 terrible write performance, is async working

"J. Bruce Fields" <bfields@xxxxxxxxxx> · Mon, 12 Feb 2018 16:55:22 -0500

On Mon, Feb 12, 2018 at 05:35:49PM +0000, Terry Barnaby wrote:
> Well that seems like a major drop off, I always thought that fsync() would
> work in this case.

No, it never has.

> I don't understand why fsync() should not operate as
> intended ? Sounds like this NFS async thing needs some work !

By "NFS async" I assume you mean the export option.  Believe me, I'd
remove it entirely if I thought I could get away with it.....

> I still do not understand why NFS doesn't operate in the same way as a
> standard mount on this. The use for async is only for improved performance
> due to disk write latency and speed (or are there other reasons ?)

Reasons for the async export option?  Historically I believe it was a
workaround for the fact that NFSv2 didn't have COMMIT, so even writes of
ordinary file data suffered from the problem that metadata-modifying
operations still have today.

> So with a local system mount:
> 
> async: normal mode: All system calls manipulate in buffer memory disk
> structure (inodes etc). Data/Metadata is flushed to disk on fsync(), sync()
> and occasionally by kernel. Processes data is not actually stored until
> fsync(), sync() etc.
> 
> sync: with sync option. Data/metadata is written to disk before system calls
> return (all FS system calls ?).
> 
> With an NFS mount I would have thought it should be the same.

As a distributed filesystem which aims to survive server reboots, it's
more complicated.

> async: normal mode: All system calls manipulate in buffer memory disk
> structure (inodes etc) this would normally be on the server (so multiple
> clients can work with the same data) but with some options (particular
> usage) maybe client side write buffering/caching could be used (ie. data
> would not actually pass to server during every FS system call).

Definitely required if you want to, for example, be able to use the full
network bandwidth when writing data to a file.

> Data/Metadata is flushed to server disk on fsync(), sync() and occasionally
> by kernel (If client side write caching is used flushes across network and
> then flushes server buffers). Processes data is not actually stored until
> fsync(), sync() etc.

I'd be nervous about the idea of a lot unsync'd metadata changes sitting
around in server memory.  On server crash/restart that's a bunch of
files and directories that are visible to every client, and that vanish
without anyone actually deleting them.  I wonder what the consequences
would be?

This is something that can only happen on a distributed filesystem: on
ext4, a crash takes down all the users of the filesystem too....

(Thinking about this: don't we already have a tiny window during the rpc
processing, after a change has been made but before it's been committed,
when a server crash could make the change vanish?  But, no, actually, I
believe we hold a lock on the parent directory in every such case,
preventing anyone from seeing the change till the commit has finished.)

Also, delegations potentially hide both network and disk latency,
whereas your proposal only hides disk latency.  The latter is more
important in your case.  I'm not sure what the ratio is for higher-end
setups, actually--probably disk latency is still higher if not as high.

> sync: with client side sync option. Data/metadata is written across NFS and
> to Server disk before system calls return (all FS system calls ?).
> 
> I really don't understand why the async option is implemented on the server
> export although a sync option here could force sync for all clients for that
> mount. What am I missing ? Is there some good reason (rather than history)
> it is done this way ?

So, again, Linux knfsd's "async" export behavior is just incorrect, and
I'd be happier if we didn't have to support it.

See above for why I don't think what you describe as async-like behavior
would fly.

As for adding protocol to allow the server to tell all clients that they
should do "sync" mounts: I don't know, I suppose it's possible, but a) I
don't know how much use it would actually get (I suspect "sync" mounts
are pretty rare), and b) that's meddling with client implementation
behavior a little more than we normally would in the protocol.  The
difference between "sync" and "async" mounts is purely a matter of
client behavior, after all, it's not really visible to the protocol at
all.

--b.
_______________________________________________
devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx