On Tue, Feb 06, 2018 at 08:18:27PM +0000, Terry Barnaby wrote: > Well, when a program running on a system calls open(), write() etc. to the > local disk FS the disk's contents is not actually updated. The data is in > server buffers until the next sync/fsync or some time has passed. So, in > your parlance, the OS write() call lies to the program. So it is by default > async unless the "sync" mount option is used when mounting the particular > file system in question. That's right, but note applications are written with the knowledge that OS's behave this way, and are given tools (sync, fsync, etc.) to manage this behavior so that they still have some control over what survives a crash. (But sync & friends no longer do what they're supposed to on an Linux server exporting with async.) > Although it is different from the current NFS settings methods, I would have > thought that this should be the same for NFS. So if a client mounts a file > system normally it is async, ie write() data is in buffers somewhere (client > or server) unless the client mounts the file system in sync mode. In fact, this is pretty much how it works, for write(). It didn't used to be that way--NFSv2 writes were all synchronous. The problem is that if a server power cycles while it still had dirty data in its caches, what should you do? You can't ignore it--you'd just be silently losing data. You could return an error at some point, but "we just lost some or your idea, no idea what" isn't an error an application can really act on. So NFSv3 introduced a separation of write into WRITE and COMMIT. The client first sends a WRITE with the data, then latter sends a COMMIT call that says "please don't return till that data I sent before is actually on disk". If the server reboots, there's a limited set of data that the client needs to resend to recover (just data that's been written but not committed.) But we only have that for file data, metadata would be more complicated, so stuff like file creates, setattr, directory operations, etc., are still synchronous. > Only difference from the normal FS conventions I am suggesting is to > allow the server to stipulate "sync" on its mount that forces sync > mode for all clients on that FS. Anyway, we don't have protocol to tell clients to do that. > In the case of a /home mount for example, or a source code build file > system, it is normally only one client that is accessing the dir and if a > write fails due to the server going down (an unlikely occurrence, its not > much of an issue. I have only had this happen a couple of times in 28 years > and then with no significant issues (power outage, disk fail pre-raid etc.). So if you have reliable servers and power, maybe you're comfortable with the risk. There's a reason that's not the default, though. > > > 4. The 0.5ms RPC latency seems a bit high (ICMP pings 0.12ms) . Maybe this > > > is worth investigating in the Linux kernel processing (how ?) ? > > Yes, that'd be interesting to investigate. With some kernel tracing I > > think it should be possible to get high-resolution timings for the > > processing of a single RPC call, which would make a good start. > > > > It'd probably also interesting to start with the simplest possible RPC > > and then work our way up and see when the RTT increases the most--e.g > > does an RPC ping (an RPC with procedure 0, empty argument and reply) > > already have a round-trip time closer to .5ms or .12ms? > Any pointers to trying this ? I have a small amount of time as work is quiet > at the moment. Hm. I wonder if testing over loopback would give interesting enough results. That might simplify testing even if it's not as realistic. You could start by seeing if latency is still similar. You could start by googling around for "ftrace", I think lwn.net's articles were pretty good introductions. I don't do this very often and don't have good step-by-step instructions.... I beleive the simplest way to do it was using "trace-cmd" (which is packaged for fedora in a package of the same name). The man page looks skimpy, but https://lwn.net/Articles/410200/ looks good. Maybe run it while just stat-ing a single file on an NFS partition as a start. I don't know if that will result in too much data. Figuring out how to filter it may be tricky. Tracing everything may be prohibitive. Several processes are involved so you don't want to restrict by process. Maybe restricting to functions in nfsd and sunrpc modules would work, with something like -l ':mod:nfs' -l ':mod:sunrpc'. > We have also found that SSD's or at least NAND flash has quite a few write > latency peculiarities . We use eMMC NAND flash on a few embedded systems we > have designed and the write latency patterns are a bit random and not well > described/defined in datasheets etc. Difficult when you have an embedded > system with small amounts of RAM doing real-time data capture ! That's one of the reasons you want the "enterprise" drives with power loss protection--they let you just write to cache, so write-behind and gathering of writes into erase-block-sized writes to flash should allow the firmware to hide weird flash latency from you. A few years ago (and I have poor notes, so take this with a grain of salt) I tested the same untar-a-kernel-workload using an external journal on an SSD without that feature, and found it didn't offer any improvement. > Although using a low latency SSD drive could speed up NFS sync performance, > I don't think it would affect NFS async write performance that much (already > 50 - 100 x slower than normal HD access). You need to specify "file creates" here--over NFS that has very different performance characteristics. Ordinary file writes should still be able to saturate the network and/or disk in most cases. If you want a protocol that makes no distinction between metadata and data, and if you *really* don't do any sharing between clients, then another option is to use a block protocol (iscsi or something). That will have different drawbacks. > It is the latency and the way the protocol works that is causing the > most issue. Sure. The protocol issues are probably more complicated than they first appear, though! --b. _______________________________________________ devel mailing list -- devel@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to devel-leave@xxxxxxxxxxxxxxxxxxxxxxx