On Tue, 2009-02-17 at 11:24 +0100, Nick Piggin wrote: > On Tue, Feb 17, 2009 at 11:05:16AM +0100, Peter Zijlstra wrote: > > On Tue, 2009-02-17 at 10:38 +0100, Nick Piggin wrote: > > > > > It is a great shame that filesystems are not properly notified > > > that a page may become dirty before the actual set_page_dirty > > > event (which is not allowed to fail and is called after the > > > page is already dirty). > > > > Not quite true, for example the set_page_dirty() done by the write fault > > code is done _before_ the page becomes dirty. > > > > This before/after thing was the reason for that horrid file corruption > > bug that dragged on for a few weeks back in .19 (IIRC). > > Yeah, there are actually races though. The page can become cleaned > before set_page_dirty is reached, and there are also nasty races with > truncate. Hmm, so you're saying that never got properly fixed? > > > This is a big problem I have with fsblock simply in trying to > > > make the memory allocation robust. page_mkwrite unfortunately > > > is racy and I've fixed problems there... the big problem though > > > is get_user_pages. Fixing that properly seems to require fixing > > > callers so it is not really realistic in the short term. > > > > Right, I'm just not sure what we can do, even with a > > prepage_page_dirty() function, what are you going to do, fail the fault? > > Oh, for regular page fault functions using page_mkwrite, they > definitely want to fail the fault with a SIGBUS, and actually XFS > already does that (for fsblock robust memory allocations you > would also want to fail OOM on metadata allocation failure). What > is the other option? Silently fail the write? OK, agreed. > For XFS purpose (ie. -ENOSPC handling), the current code is reasonable > although there could be some truncate races with block allocation. But > mostly probably works. For something like fsblock it can be much more > common to have the metadata refcount reach 0 and freed before spd is > called. In that case the code actually goes into a bug situation so it > is a bit more critical. > > But no that's the "easy" part. The hard part is get_user_pages > because the caller can hold onto the page indefinitely simply with a > refcount, and go along happily dirtying it at any stage (actually > writing to the page memory) before actually calling set_page_dirty. Should a gup user not specify .write=1 if it wants to dirty the page, at which point the follow_page() will do the dirty-fault thingy. Ah, but then we can clean it because we're not holding the page-lock. I see. > The "cleanest" way to fix this from VM point of view is probably to > force gup callers to hold the page locked for the duration to > prevent truncation or writeout after the filesystem notification. > Don't know if that would be very popular, however. Right, so you'd want to keep the page locked over gup(.write=1) sections. So should we extend the gup() with put_user_page()? -- To unsubscribe from this list: send the line "unsubscribe reiserfs-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html