On Thu, 2017-03-09 at 13:05 -0500, Johannes Weiner wrote: > On Tue, Mar 07, 2017 at 02:52:36PM -0500, Rik van Riel wrote: > > > > It only does this to some extent. If reclaim made > > no progress, for example due to immediately bailing > > out because the number of already isolated pages is > > too high (due to many parallel reclaimers), the code > > could hit the "no_progress_loops > MAX_RECLAIM_RETRIES" > > test without ever looking at the number of reclaimable > > pages. > Hm, there is no early return there, actually. We bump the loop > counter > every time it happens, but then *do* look at the reclaimable pages. Am I looking at an old tree? I see this code before we look at the reclaimable pages. /* * Make sure we converge to OOM if we cannot make any progress * several times in the row. */ if (*no_progress_loops > MAX_RECLAIM_RETRIES) { /* Before OOM, exhaust highatomic_reserve */ return unreserve_highatomic_pageblock(ac, true); } > > Could that create problems if we have many concurrent > > reclaimers? > With increased concurrency, the likelihood of OOM will go up if we > remove the unlimited wait for isolated pages, that much is true. > > I'm not sure that's a bad thing, however, because we want the OOM > killer to be predictable and timely. So a reasonable wait time in > between 0 and forever before an allocating thread gives up under > extreme concurrency makes sense to me. That is a fair point, a faster OOM kill is preferable to a system that is livelocked. > Unless I'm mistaken, there doesn't seem to be a whole lot of urgency > behind this patch. Can we think about a general model to deal with > allocation concurrency? Unlimited parallel direct reclaim is kinda > bonkers in the first place. How about checking for excessive > isolation > counts from the page allocator and putting allocations on a > waitqueue? The (limited) number of reclaimers can still do a relatively fast OOM kill, if none of them manage to make progress. That should avoid the potential issue you and I both pointed out, and, as a bonus, it might actually be faster than letting all the tasks in the system into the direct reclaim code simultaneously. -- All rights reversed
Attachment:
signature.asc
Description: This is a digitally signed message part