On Mon, Jan 24, 2011 at 7:58 AM, David Rientjes <rientjes@xxxxxxxxxx> wrote: > 0e093d99763e (writeback: do not sleep on the congestion queue if there > are no congested BDIs or if significant congestion is not being > encountered in the current zone) uncovered a livelock in the page > allocator that resulted in tasks infinitely looping trying to find memory > and kswapd running at 100% cpu. > > The issue occurs because drain_all_pages() is called immediately > following direct reclaim when no memory is freed and try_to_free_pages() > returns non-zero because all zones in the zonelist do not have their > all_unreclaimable flag set. > > When draining the per-cpu pagesets back to the buddy allocator for each > zone, the zone->pages_scanned counter is cleared to avoid erroneously > setting zone->all_unreclaimable later. ÂThe problem is that no pages may > actually be drained and, thus, the unreclaimable logic never fails direct > reclaim so the oom killer may be invoked. > > This apparently only manifested after wait_iff_congested() was introduced > and the zone was full of anonymous memory that would not congest the > backing store. ÂThe page allocator would infinitely loop if there were no > other tasks waiting to be scheduled and clear zone->pages_scanned because > of drain_all_pages() as the result of this change before kswapd could > scan enough pages to trigger the reclaim logic. ÂAdditionally, with every > loop of the page allocator and in the reclaim path, kswapd would be > kicked and would end up running at 100% cpu. ÂIn this scenario, current > and kswapd are all running continuously with kswapd incrementing > zone->pages_scanned and current clearing it. > > The problem is even more pronounced when current swaps some of its memory > to swap cache and the reclaimable logic then considers all active > anonymous memory in the all_unreclaimable logic, which requires a much > higher zone->pages_scanned value for try_to_free_pages() to return zero > that is never attainable in this scenario. > > Before wait_iff_congested(), the page allocator would incur an > unconditional timeout and allow kswapd to elevate zone->pages_scanned to > a level that the oom killer would be called the next time it loops. > > The fix is to only attempt to drain pcp pages if there is actually a > quantity to be drained. ÂThe unconditional clearing of > zone->pages_scanned in free_pcppages_bulk() need not be changed since > other callers already ensure that draining will occur. ÂThis patch > ensures that free_pcppages_bulk() will actually free memory before > calling into it from drain_all_pages() so zone->pages_scanned is only > cleared if appropriate. > > Signed-off-by: David Rientjes <rientjes@xxxxxxxxxx> Good catch!!!! Too late but, Reviewed-by: Minchan Kim <minchan.kim@xxxxxxxxx> -- Kind regards, Minchan Kim -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxxx For more info on Linux MM, see: http://www.linux-mm.org/ . Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/ Don't email: <a href