On Tue, Feb 17, 2009 at 11:05:16AM +0100, Peter Zijlstra wrote: > On Tue, 2009-02-17 at 10:38 +0100, Nick Piggin wrote: > > > It is a great shame that filesystems are not properly notified > > that a page may become dirty before the actual set_page_dirty > > event (which is not allowed to fail and is called after the > > page is already dirty). > > Not quite true, for example the set_page_dirty() done by the write fault > code is done _before_ the page becomes dirty. > > This before/after thing was the reason for that horrid file corruption > bug that dragged on for a few weeks back in .19 (IIRC). Yeah, there are actually races though. The page can become cleaned before set_page_dirty is reached, and there are also nasty races with truncate. > > This is a big problem I have with fsblock simply in trying to > > make the memory allocation robust. page_mkwrite unfortunately > > is racy and I've fixed problems there... the big problem though > > is get_user_pages. Fixing that properly seems to require fixing > > callers so it is not really realistic in the short term. > > Right, I'm just not sure what we can do, even with a > prepage_page_dirty() function, what are you going to do, fail the fault? Oh, for regular page fault functions using page_mkwrite, they definitely want to fail the fault with a SIGBUS, and actually XFS already does that (for fsblock robust memory allocations you would also want to fail OOM on metadata allocation failure). What is the other option? Silently fail the write? For XFS purpose (ie. -ENOSPC handling), the current code is reasonable although there could be some truncate races with block allocation. But mostly probably works. For something like fsblock it can be much more common to have the metadata refcount reach 0 and freed before spd is called. In that case the code actually goes into a bug situation so it is a bit more critical. But no that's the "easy" part. The hard part is get_user_pages because the caller can hold onto the page indefinitely simply with a refcount, and go along happily dirtying it at any stage (actually writing to the page memory) before actually calling set_page_dirty. The "cleanest" way to fix this from VM point of view is probably to force gup callers to hold the page locked for the duration to prevent truncation or writeout after the filesystem notification. Don't know if that would be very popular, however. -- To unsubscribe from this list: send the line "unsubscribe reiserfs-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html