On Tue, Mar 03, 2009 at 05:25:36PM +0000, Jamie Lokier wrote: > > > it so "we can always make forward progress". But it won't > > > matter because once a real user drives the system off this > > > cliff there is no difference between "hung" and "really slow > > > progress". They are going to crash it and report a hang. > > > > I don't think that is the case. These are situations that > > would be *really* rare and transient. It is not like thrashing > > in that your working set size exceeds physical RAM, but just > > a combination of conditions that causes an unusual spike in the > > required memory to clean some dirty pages (eg. Dave's example > > of several IOs requiring btree splits over several AGs). Could > > cause a resource deadlock. > > Suppose the systems has two pages to be written. The first must > _reserve_ 40 pages of scratch space just in case the operation will > need them. If the second page write is initiated concurrently with > the first, the second must reserve another 40 pages concurrently. > > If 10 page writes are concurrent, that's 400 pages of scratch space > needed in reserve... Therein lies the problem. XFS can do this in parallel in every AG at the same time. i.e. the reserve is per AG. The maximum number of AGs in XFS is 2^32, and I know of filesystems out there that have thousands of AGs in them. Hence reserving 40 pages per AG is definitely unreasonable. ;) Even if we look at concurrent allocations as the upper bound, I've seen an 8p machine with several hundred concurrent allocation transactions in progress. Even that is unreasonable if you consider machines with 64k pages - it's hundreds of megabytes of RAM that are mostly going to be unused. Specifying a pool of pages is not a guaranteed solution, either, as someone will always exhaust it as we can't guarantee any given transaction will complete before the pool is exhausted. i.e. the mempool design as it stands can't be used. AFAIC, "should never allocate during writeback" is a great goal, but it is one that we will never be able to reach without throwing everything away and starting again. Minimising allocation is something we can do but we can't avoid it entirely. The higher layers need to understand this, not assert that the lower layers must conform to an impossible constraint and break if they don't..... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html