On 7/31/19 5:06 AM, Vlastimil Babka wrote: > On 7/24/19 7:50 PM, Mike Kravetz wrote: >> For PAGE_ALLOC_COSTLY_ORDER allocations, MIN_COMPACT_COSTLY_PRIORITY is >> minimum (highest priority). Other places in the compaction code key off >> of MIN_COMPACT_PRIORITY. Costly order allocations will never get to >> MIN_COMPACT_PRIORITY. Therefore, some conditions will never be met for >> costly order allocations. >> >> This was observed when hugetlb allocations could stall for minutes or >> hours when should_compact_retry() would return true more often then it >> should. Specifically, this was in the case where compact_result was >> COMPACT_DEFERRED and COMPACT_PARTIAL_SKIPPED and no progress was being >> made. > > Hmm, the point of MIN_COMPACT_COSTLY_PRIORITY was that costly > allocations will not reach the priority where compaction becomes too > expensive. With your patch, they still don't reach that priority value, > but are allowed to be thorough anyway, even sooner. That just seems like > a wrong way to fix the problem. Thanks Vlastimil, here is why I took the approach I did. I instrumented some of the long stalls. Here is one common example: should_compact_retry returned true 5000000 consecutive times. However, the variable compaction_retries is zero. We never get to the code that increments the compaction_retries count because compaction_made_progress is false and compaction_withdrawn is true. As suggested earlier, I noted why compaction_withdrawn is true. Of the 5000000 calls, 4921875 were COMPACT_DEFERRED 78125 were COMPACT_PARTIAL_SKIPPED Note that 5000000/64(1 << COMPACT_MAX_DEFER_SHIFT) == 78125 I then started looking into why COMPACT_DEFERRED and COMPACT_PARTIAL_SKIPPED were being set/returned so often. COMPACT_DEFERRED is set/returned in try_to_compact_pages. Specifically, if (prio > MIN_COMPACT_PRIORITY && compaction_deferred(zone, order)) { rc = max_t(enum compact_result, COMPACT_DEFERRED, rc); continue; } COMPACT_PARTIAL_SKIPPED is set/returned in __compact_finished. Specifically, if (compact_scanners_met(cc)) { /* Let the next compaction start anew. */ reset_cached_positions(cc->zone); /* ... */ if (cc->direct_compaction) cc->zone->compact_blockskip_flush = true; if (cc->whole_zone) return COMPACT_COMPLETE; else return COMPACT_PARTIAL_SKIPPED; } In both cases, compact_priority being MIN_COMPACT_COSTLY_PRIORITY and not being able to go to MIN_COMPACT_PRIORITY caused the 'compaction_withdrawn' result to be set/returned. I do not know the subtleties of the compaction code, but it seems like retrying in this manner does not make sense. > If should_compact_retry() returns > misleading results for costly allocations, then that should be fixed > instead? > > Alternatively, you might want to say that hugetlb allocations are not > like other random costly allocations, because the admin setting > nr_hugepages is prepared to take the cost (I thought that was indicated > by the __GFP_RETRY_MAYFAIL flag, but seeing all the other users of it, > I'm not sure anymore). The example above, resulted in a stall of a little over 5 minutes. However, I have seen them last for hours. Sure, the caller (admin for hugetlbfs) knows there may be high costs. But, I think minutes/hours to try and allocate a single huge page is too much. We should fail sooner that that. > In that case should_compact_retry() could take > __GFP_RETRY_MAYFAIL into account and allow MIN_COMPACT_PRIORITY even for > costly allocations. I'll put something like this together to test. -- Mike Kravetz