On Thu, 2009-12-24 at 09:21 +0800, Wu Fengguang wrote: > > Commits and writes on the same inode need to be serialized for > > consistency (write can change the data and metadata; commit [fsync] > > needs to provide guarantees that the written data are stable). The > > performance problem arises because NFS writes are fast (they generally > > just deposit data into the server's page cache), but commits can take a > > Right. > > > long time, especially if there is a lot of cached data to flush to > > stable storage. > > "a lot of cached data to flush" is not likely with pdflush, since it > roughly send one COMMIT per 4MB WRITEs. So in average each COMMIT > syncs 4MB at the server side. Maybe on paper, but empirically I see anywhere from one commit per 8MB to one commit per 64 MB. > > Your patch adds another pre-pdlush async write logic, which greatly > reduced the number of COMMITs by pdflush. Can this be the major factor > of the performance gain? My patch removes pdflush from the picture almost entirely. See my comments below. > > Jan has been proposing to change the pdflush logic from > > loop over dirty files { > writeback 4MB > write_inode > } > to > loop over dirty files { > writeback all its dirty pages > write_inode > } > > This should also be able to reduce the COMMIT numbers. I wonder if > this (more general) approach can achieve the same performance gain. The pdflush mechanism is fine for random writes and small sequential writes, because it promotes concurrency -- instead of the application blocking while it tries to write and commit its data, the application can go on doing other more useful things, and the data gets flushed in the background. There is also a benefit if the application makes another modification to a page that is already dirty, because then multiple modifications are coalesced into a single write. However, the pdflush mechanism is wrong for large sequential writes (like a backup stream, for example). First, there is no concurrency to exploit -- the application is only going to dirty more pages, so removing the need for it to block writing the pages out only adds to the problem of memory pressure. Second, the application is not going to go back and modify a page it has already written, so leaving it in the cache for someone else to write provides no additional benefit. Note that this assumes the application actually cares about the consistency of its data and will call fsync() when it is done. If the application doesn't call fsync(), then it doesn't matter when the pages are written to backing store, because the interface makes no guarantees in this case. Thanks, Steve -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html