Please ignore my previous patch (which this is in reply to), it was flawed in various ways. I now understand the code a bit better and have a somewhat simpler patch which appears to address the same problem. The problem is that writeback to NFS often produces lots of small writes (10s of K) rather than fewer large writes (1M). This pattern can often hurt throughput, but in certain circumstances it can hurt NFS throughput more than expected. Each nfs_writepages() call results in an NFS commit being sent to the server. If writeback triggers lots of smaller nfs_writepages calls, this means lots of COMMITs. If the server is slow to handle the COMMIT (I've seen the Ganesha NFS server take over 200ms per commit), these COMMITs can overlap, queue up, and choke the NFS server and cause order-of-magnitude drop in throughput. So we really want to only call nfs_writepages when there are a largish number of pages to be written - i.e. that are 'dirty'. For historical reasons that I didn't thoroughly research but I'm confident are no longer relevant, pages that have been written to the NFS server but have not yet been the subject of a COMMIT - so-called "unstable" pages - are effectively accounted that same as "dirty" pages (sometimes called "reclaimable"). This can result in writeback thinking there are lots of "dirty" pages to reclaim, while nfs_writepages can only find a few that it can write out. The second patch following changes the accounting for these "unstable" pages. They are now always accounted exactly the same was writeback pages. Conceptually they can be thought of as still in writeback, but the writeback is now happening on the server. A COMMIT will always automatically follow the writes generated by nfs_writepages, so from the perspective of the VM, there really is no difference: It has scheduled the write and there is nothing else it can do except wait. Testing this patch showed that loop-back NFS is prone to deadlocks again. I cannot see exactly how the change to 'unstable' accounting affected this, but I can see that the old +25% heuristic can no longer be justified given the complexity of writeback calculations. So the first patch following changes how writeback is handled for NFS servers handling loop-back requests (and other similar services) so that it is more obviously safe against excessive dirty pages scheduled for other devices. Thanks, NeilBrown
Attachment:
signature.asc
Description: PGP signature