On Tue, Dec 14, 2010 at 04:53:46AM +0800, Trond Myklebust wrote: > On Mon, 2010-12-13 at 22:47 +0800, Wu Fengguang wrote: > > plain text document attachment (writeback-nfs-should-commit.patch) > > The heuristics introduced by commit 420e3646 ("NFS: Reduce the number of > > unnecessary COMMIT calls") do not work well for large inodes being > > actively written to. > > > > Refine the criterion to > > - it has gone quiet (all data transfered to server) > > - has accumulated >= 4MB data to commit (so it will be large IO) > > - too few active commits (hence active IO) in the server > > Where does the number 4MB come from? If I'm writing a 4GB file, I > certainly do not want to commit every 4MB; that would make for a total > of 1000 commit requests in addition to the writes. On a 64-bit client > +server both having loads of memory and connected by a decently a fast > network, that can be a significant slowdown... Sorry the description omits too much details.. Let me show you the behavior in real workload first. http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-04/writeback-inode.png http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-1dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-04/nfs-commit-300.png On a 3GB client writing 50MB/s to the NFS server, the write chunk size and commit size is mostly 32MB and 64MB. The ->writepages() size and the later commit size actually scales up to the available write bandwidth ("[PATCH 20/35] writeback: scale IO chunk size up to device bandwidth"). So the "4MB" here is merely the minimal threshold. I chose it mainly by the rule of thumb "it's not too bad IO size". And it's mainly used for the cases: 1) low client=>server write bandwidth In this case the VFS will call ->writepages() with small (but always >= 4MB, see patch 20/35) nr_to_write , and the 4MB threshold helps accumulate to-be-commited pages over multiple ->write_inode() calls. As you said it will help to further scale this 4MB threshold up to the client's memory size. But complexity arises in the next case. 2) bandwidth/memory is high, but there are lots of concurrent dd's When doing 10 dd's with mem=3G, it still achieves 20-30MB write/commit size: http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-10dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-13/writeback-300.png http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-10dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-13/nfs-commit-300.png However when there comes 100 dd's, you cannot wait each inode to accumulate much more than 4MB pages to commit, because 4*100MB is approaching the client's dirty limit. So you'll see around 4-5MB commit sizes in this graph. http://www.kernel.org/pub/linux/kernel/people/wfg/writeback/tests/3G/nfs-100dd-1M-8p-2952M-2.6.37-rc5+-2010-12-09-03-23/nfs-commit-300.png Then you see the problem: how to decide one auto scaled threshold to start commit for the current inode? It's easy for the 1-dd case. However when there are N dd's (admittedly NFS clients rarely do large N), we don't readily know the number N to scale down the threshold that's suitable for 1-dd case.. So I give up the scale-to-memory commit threshold idea that could help case (1) and just do it in a dumb but should good enough way. But I'm open to better ideas :) > Most of the time, we really want the server to be managing its dirty > cache entirely independently of the client. The latter should only be > sending the commit when it really needs to free up those pages. Agreed. And it makes one major contrariety I'm fighting about: do large commit size but not too much to make unacceptable fluctuations in the data flow. It leads to the decision to include patch 20/35 into this series. It magically reduces the frequency to ->writepages()/write_inode() and results in semi-adaptive wrote pages in each ->writepages() (and the later commit) to the number of concurrent dd's. Thanks, Fengguang -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html