On Wed, Jun 29, 2011 at 11:00 AM, Christoph Hellwig <hch@xxxxxxxxxxxxx> wrote: > On Wed, Jun 29, 2011 at 10:26:33AM -0700, Curt Wohlgemuth wrote: >> The semantics did change between 2.6.34 and 2.6.35 though. When the >> work item queue was introduced in 2.6.32, the semantics changed from >> what you describe above to what's present through 2.6.34: >> writeback_inodes_sb() would enqueue a work item, and return. Your >> commit 83ba7b07 ("writeback: simplify the write back thread queue") >> added the wait_for_completion() call, putting the semantics back to >> where they were pre-2.6.32. > > Yes. The kernels inbetween had that nasty writeback vs umount races > that we could trigger quite often. My apologies for not being aware of these races (I tried to search mailing lists) -- how did they manifest themselves? I don't see anything about it in your commit message of 83ba7b07. I do see that we are not taking s_umount for sys_sync any longer (only in sync_filesystem(), e.g., for sys_syncfs)... >> Though one additional change between the old way (pre-2.6.32) and >> today: with the old kernel, the pdflush thread would operate >> concurrently with the first (and second?) sync path through writeback. >> Today of course, they're *all* serialized. So really a call to >> sys_sync() will enqueue 3 work items -- one from >> wakeup_flusher_threads(), one from writeback_inodes_sb(), and one from >> sync_inodes_sb(). > > Yes. And having WB_SYNC_NONE items from both wakeup_flusher_threads vs > writeback_inodes_sb is plain stupid. The initial conversion to the > new sync_filesystem scheme had removed the wakeup_flusher_threads > call, but that caused a huge regression in some benchmark. > > As mentioned before Wu was working on some code to introduce tagging > so that the WB_SYNC_ALL call won't start writing pages dirtied after > the sync call, which should help with your issue. Although to > completely solve it we really need to get down to just two passes. I can definitely see how tagging with the start of sync would be helpful; also how going from three to two passes seems like a no-brainer. But what's the real point of doing even two passes -- one SYNC_NONE followed by one SYNC_ALL? Is it try to get through as many inodes as possible before we potentially lock up by waiting on an inode on an unavailable device? Thanks, Curt -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html