Re: [patch][rfc] mm: hold page lock over page_mkwrite

Nick Piggin <npiggin@xxxxxxx> · Tue, 3 Mar 2009 05:33:38 +0100

On Mon, Mar 02, 2009 at 10:26:21AM -0500, jim owens wrote:
> Nick Piggin wrote:
> >
> >So assuming there is no reasonable way to do out of core algorithms
> >on the filesystem metadata (and likely you don't want to anyway
> >because it would be a significant slowdown or diverge of code
> >paths), you still only need to reserve one set of those 30-40 pages
> >for the entire kernel.
> >
> >You only ever need to reserve enough memory for a *single* page
> >to be processed. In the worst case that there are multiple pages
> >under writeout but can't allocate memory, only one will be allowed
> >access to reserves and the others will block until it is finished
> >and can unpin them all.
> 
> Sure, nobody will mind seeing lots of extra pinned memory ;)

40 pages (160k) isn't a huge amount. You could always have a
boot option to disable the memory reserve if it is a big
deal.

> Don't forget to add the space for data transforms and raid
> driver operations in the write stack, and whatever else we
> may not have thought of.  With good engineering we can make

The block layer below the filesystem should be robust. Well
actually the core block layer is (except maybe for the new
bio integrity stuff that looks pretty nasty). Not sure about
md/dm, but they really should be safe (they use mempools etc).

> it so "we can always make forward progress".  But it won't
> matter because once a real user drives the system off this
> cliff there is no difference between "hung" and "really slow
> progress".  They are going to crash it and report a hang.

I don't think that is the case. These are situations that
would be *really* rare and transient. It is not like thrashing
in that your working set size exceeds physical RAM, but just
a combination of conditions that causes an unusual spike in the
required memory to clean some dirty pages (eg. Dave's example
of several IOs requiring btree splits over several AGs). Could
cause a resource deadlock.

> >Well I'm not saying it is an immediate problem or it would be a
> >good use of anybody's time to rush out and try to redesign their
> >fs code to fix it ;) But at least for any new core/generic library
> >functionality like fsblock, it would be silly not to close the hole
> >there (not least because the problem is simpler here than in a
> >complex fs).
> 
> Hey, I appreciate anything you do in VM to make the ugly
> dance with filesystems (my area) a little less ugly.

Well thanks.

> I'm sure you also appreciate that every time VM tries to
> save 32 bytes, someone else tries to take 32 K-bytes.
> As they say... memory is cheap :)

Well that's OK. If core vm/fs code saves a little bit of memory
that enables something else like a filesystem to use it to cache
a tiny bit more useful data, then I think it is a good result :)

--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html