Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update

Vlastimil Babka <vbabka@xxxxxxx> · Thu, 18 May 2017 12:03:50 +0200

On 05/17/2017 04:48 PM, Christoph Lameter wrote:
> On Wed, 17 May 2017, Michal Hocko wrote:
> 
>>>> So how are you going to distinguish VM_FAULT_OOM from an empty mempolicy
>>>> case in a raceless way?
>>>
>>> You dont have to do that if you do not create an empty mempolicy in the
>>> first place. The current kernel code avoids that by first allowing access
>>> to the new set of nodes and removing the old ones from the set when done.
>>
>> which is racy and as Vlastimil pointed out. If we simply fail such an
>> allocation the failure will go up the call chain until we hit the OOM
>> killer due to VM_FAULT_OOM. How would you want to handle that?
> 
> The race is where? If you expand the node set during the move of the
> application then you are safe in terms of the legacy apps that did not
> include static bindings.

No, that expand/shrink by itself doesn't work against parallel
get_page_from_freelist going through a zonelist. Moving from node 0 to
1, with zonelist containing nodes 1 and 0 in that order:

- mempolicy mask is 0
- zonelist iteration checks node 1, it's not allowed, skip
- mempolicy mask is 0,1 (expand)
- mempolicy mask is 1 (shrink)
- zonelist iteration checks node 0, it's not allowed, skip
- OOM

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@xxxxxxxxx.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@xxxxxxxxx";> email@xxxxxxxxx </a>