On Mon, 23 Feb 2015 11:45:21 +1100 Dave Chinner <david@xxxxxxxxxxxxx> wrote: > > > I really don't care about the OOM Killer corner cases - it's > > > completely the wrong way line of development to be spending time on > > > and you aren't going to convince me otherwise. The OOM killer a > > > crutch used to justify having a memory allocation subsystem that > > > can't provide forward progress guarantee mechanisms to callers that > > > need it. > > > > We can provide this. Are all these callers able to preallocate? > > Anything that allocates in transaction context (and therefor is > GFP_NOFS by definition) can preallocate at transaction reservation > time. However, preallocation is dumb, complex, CPU and memory > intensive and will have a *massive* impact on performance. > Allocating 10-100 pages to a reserve which we will almost *never > use* and then free them again *on every single transaction* is a lot > of unnecessary additional fast path overhead. Hence a "preallocate > for every context" reserve pool is not a viable solution. Yup. > Reservations are simply an *accounting* of the maximum amount of a > reserve required by an operation to guarantee forwards progress. In > filesystems, we do this for log space (transactions) and some do it > for filesystem space (e.g. delayed allocation needs correct ENOSPC > detection so we don't overcommit disk space). The VM already has > such concepts (e.g. watermarks and things like min_free_kbytes) that > it uses to ensure that there are sufficient reserves for certain > types of allocations to succeed. Yes, as we do for __GFP_HIGH and PF_MEMALLOC etc. Add a dynamic reserve. So to reserve N pages we increase the page allocator dynamic reserve by N, do some reclaim if necessary then deposit N tokens into the caller's task_struct (it'll be a set of zone/nr-pages tuples I suppose). When allocating pages the caller should drain its reserves in preference to dipping into the regular freelist. This guy has already done his reclaim and shouldn't be penalised a second time. I guess Johannes's preallocation code should switch to doing this for the same reason, plus the fact that snipping a page off task_struct.prealloc_pages is super-fast and needs to be done sometime anyway so why not do it by default. Both reservation and preallocation are vulnerable to deadlocks - 10,000 tasks all trying to reserve/prealloc 100 pages, they all have 50 pages and we ran out of memory. Whoops. We can undeadlock by returning ENOMEM but I suspect there will still be problematic situations where massive numbers of pages are temporarily AWOL. Perhaps some form of queuing and throttling will be needed, to limit the peak number of reserved pages. Per zone, I guess. And it'll be a huge pain handling order>0 pages. I'd be inclined to make it order-0 only, and tell the lamer callers that vmap-is-thattaway. Alas, one lame caller is slub. But the biggest issue is how the heck does a caller work out how many pages to reserve/prealloc? Even a single sb_bread() - it's sitting on loop on a sparse NTFS file on loop on a five-deep DM stack on a six-deep MD stack on loop on NFS on an eleventy-deep networking stack. And then there will be an unknown number of slab allocations of unknown size with unknown slabs-per-page rules - how many pages needed for them? And to make it much worse, how many pages of which orders? Bless its heart, slub will go and use a 1-order page for allocations which should have been in 0-order pages.. _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs