On Thu, Oct 30, 2008 at 07:50:20PM +1100, Dave Chinner wrote: > [*] Starting pdflush to sync data in the background when we are > about to start flushing ourselves is self-defeating. instead of > having a single thread doing optimal writeout patterns, we now > have two threads trying to sync the same filesystems and > competing with each other to write out dirty inodes. This > actually causes bugs in sync because pdflush is doing async > flushes. Hence if pdflush races and wins during the sync flush > part of the sync process, sync_inodes(1) will return before all > the data/metadata is on disk because it can't be found to be > waited on. Note that this is an XFS special. Every other filesystem (at least the major ones) rely on the VFS to write out data. > Now the sync is _supposedly_ complete. But we still have a dirty > log and superblock thanks to delayed allocation that may have > occurred after the sync_supers() call. Hence we can immediately > see that we cannot *ever* do a proper sync of an XFS filesystem > in Linux without modifying do_sync() to do more callouts. > > Worse, XFS can also still have *dirty inodes* because sync_inodes(1) > will remove inodes from the dirty list in the async pass, but they > can get dirtied a short time later (if they had dirty data) when the > data I/O completes. Hence if the second sync pass completes before > the inode is dirtied again we'll miss flushing it. This will mean we > don't write inode size updates during sync. This is the same race > that pdflush running in the background can trigger. This is a common problem that would hit any filesystem trying to have some sort of ordered data or journaled data mode. > However, I have a problem - I'm an expert in XFS, not the other tens > of Linux filesystems so I can't begin to guess what the impact of > changing do_sync() would be on those many filesystems. How many > filesystems would such a change break? Indeed - how many are broken > right now by having dirty inodes and superblocks slip through > sync(1)? I would guess more are broken now then a change in order would break. Then again purely a change in order would still leave this code fragile as hell. > What are the alternatives? do_sync() operates above any particular > filesystem, so it's hard to provide a filesystem specific ->do_sync > method to avoid changing sync order for all filesystems. Do we > change do_sync() to completely sync a superblock at a time instead > of doing each operation across all superblocks before moving onto > the next operation? Is there any particular reason (e.g. performance, locking) for the current > method that would prevent changing to completely-sync-a-superblock > iteration algorithm so we can provide a custom ->do_sync method? Locking can't be the reason as we should never hold locks while returning from one of the VFS operations. I think it's performance or rather alledged performance as I think it doesn't really matter. If it matters however there is an easy method to make it perform just as well with a proper callout - just spawn a thread for every filesystem to perform it in parallel. > Are there any other ways that we can get a custom ->do_sync > method for XFS? I'd prefer a custom method so we don't have to > revalidate every linux filesystem, especially as XFS already has > everything it needs to provide it's own sync method (used for > freezing) and a test suite to validate it is working correctly..... I think having a method for this would be useful. And given that a proper sync should be exactly the same as a filesysytem freeze we should maybe use one method for both of those and use the chance to give filesystem better control over the freeze process? -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html