Re: [patch][rfc] mm: hold page lock over page_mkwrite

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 4 Mar 2009 15:37:39 +1100

On Tue, Mar 03, 2009 at 05:25:36PM +0000, Jamie Lokier wrote:
> > > it so "we can always make forward progress".  But it won't
> > > matter because once a real user drives the system off this
> > > cliff there is no difference between "hung" and "really slow
> > > progress".  They are going to crash it and report a hang.
> > 
> > I don't think that is the case. These are situations that
> > would be *really* rare and transient. It is not like thrashing
> > in that your working set size exceeds physical RAM, but just
> > a combination of conditions that causes an unusual spike in the
> > required memory to clean some dirty pages (eg. Dave's example
> > of several IOs requiring btree splits over several AGs). Could
> > cause a resource deadlock.
> 
> Suppose the systems has two pages to be written.  The first must
> _reserve_ 40 pages of scratch space just in case the operation will
> need them.  If the second page write is initiated concurrently with
> the first, the second must reserve another 40 pages concurrently.
> 
> If 10 page writes are concurrent, that's 400 pages of scratch space
> needed in reserve...

Therein lies the problem. XFS can do this in parallel in every AG at
the same time. i.e. the reserve is per AG. The maximum number of AGs
in XFS is 2^32, and I know of filesystems out there that have
thousands of AGs in them. Hence reserving 40 pages per AG is
definitely unreasonable. ;)

Even if we look at concurrent allocations as the upper bound, I've
seen an 8p machine with several hundred concurrent allocation
transactions in progress. Even that is unreasonable if you consider
machines with 64k pages - it's hundreds of megabytes of RAM that are
mostly going to be unused.

Specifying a pool of pages is not a guaranteed solution, either,
as someone will always exhaust it as we can't guarantee any given
transaction will complete before the pool is exhausted. i.e.
the mempool design as it stands can't be used.

AFAIC, "should never allocate during writeback" is a great goal, but
it is one that we will never be able to reach without throwing
everything away and starting again. Minimising allocation is
something we can do but we can't avoid it entirely. The higher
layers need to understand this, not assert that the lower layers
must conform to an impossible constraint and break if they don't.....

Cheers,

Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html