Re: [PATCH V2] mm, page_alloc: fix core hung in free_pcppages_bulk()

David Hildenbrand <david@xxxxxxxxxx> · Wed, 12 Aug 2020 12:00:08 +0200

On 12.08.20 11:46, Charan Teja Kalla wrote:
> 
> Thanks David for the inputs.
> 
> On 8/12/2020 2:35 AM, David Hildenbrand wrote:
>> On 11.08.20 14:58, Charan Teja Reddy wrote:
>>> The following race is observed with the repeated online, offline and a
>>> delay between two successive online of memory blocks of movable zone.
>>>
>>> P1						P2
>>>
>>> Online the first memory block in
>>> the movable zone. The pcp struct
>>> values are initialized to default
>>> values,i.e., pcp->high = 0 &
>>> pcp->batch = 1.
>>>
>>> 					Allocate the pages from the
>>> 					movable zone.
>>>
>>> Try to Online the second memory
>>> block in the movable zone thus it
>>> entered the online_pages() but yet
>>> to call zone_pcp_update().
>>> 					This process is entered into
>>> 					the exit path thus it tries
>>> 					to release the order-0 pages
>>> 					to pcp lists through
>>> 					free_unref_page_commit().
>>> 					As pcp->high = 0, pcp->count = 1
>>> 					proceed to call the function
>>> 					free_pcppages_bulk().
>>> Update the pcp values thus the
>>> new pcp values are like, say,
>>> pcp->high = 378, pcp->batch = 63.
>>> 					Read the pcp's batch value using
>>> 					READ_ONCE() and pass the same to
>>> 					free_pcppages_bulk(), pcp values
>>> 					passed here are, batch = 63,
>>> 					count = 1.
>>>
>>> 					Since num of pages in the pcp
>>> 					lists are less than ->batch,
>>> 					then it will stuck in
>>> 					while(list_empty(list)) loop
>>> 					with interrupts disabled thus
>>> 					a core hung.
>>>
>>> Avoid this by ensuring free_pcppages_bulk() is called with proper count
>>> of pcp list pages.
>>>
>>> The mentioned race is some what easily reproducible without [1] because
>>> pcp's are not updated for the first memory block online and thus there
>>> is a enough race window for P2 between alloc+free and pcp struct values
>>> update through onlining of second memory block.
>>>
>>> With [1], the race is still exists but it is very much narrow as we
>>> update the pcp struct values for the first memory block online itself.
>>>
>>> [1]: https://patchwork.kernel.org/patch/11696389/
>>>
>>
>> IIUC, this is not limited to the movable zone, it could also happen in
>> corner cases with the normal zone (e.g., hotplug to a node that only has
>> DMA memory, or no other memory yet).
> 
> Yes, this is my understanding too. I explained the above race in terms
> of just movable zone for which it is observed. We can add the below line
> in the end in patch commit message:
> "This is not limited to the movable zone, it could also happen in cases
> with the normal zone (e.g., hotplug to a node that only has DMA memory,
> or no other memory yet)."

Yeah, that makes sense!

> 
> Just curious, there exists such systems where just a dma zone present
> and we hot add the normal zone? I am not aware such thing in the
> embedded world.

You can easily create such setups using QEMU.

IIRC, just specify a QEMU guest with 2G initial memory and a single NUMA
node, or 4G initial memory and two NUMA nodes. Then hotplug memory.

(IIRC kata containers always start a VM with 2G and then hotplug memory)

>>
>>> Signed-off-by: Charan Teja Reddy <charante@xxxxxxxxxxxxxx>
>>> ---
>>>
>>> v1: https://patchwork.kernel.org/patch/11707637/
>>>
>>>  mm/page_alloc.c | 5 +++++
>>>  1 file changed, 5 insertions(+)
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index e4896e6..839039f 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>>>  	struct page *page, *tmp;
>>>  	LIST_HEAD(head);
>>>  
>>> +	/*
>>> +	 * Ensure proper count is passed which otherwise would stuck in the
>>> +	 * below while (list_empty(list)) loop.
>>> +	 */
>>> +	count = min(pcp->count, count);
>>>  	while (count) {
>>>  		struct list_head *list;
>>>  
>>>
>>
>> Fixes: and Cc: stable... tags?
> 
> Fixes: 5f8dcc21211a ("page-allocator: split per-cpu list into
> one-list-per-migrate-type")
> Cc: <stable@xxxxxxxxxxxxxxx> [2.6+]

Did we have memory hotplug support then already?

> 
> I am not sure If I should have to raise V3 including these?

Maybe Andrew can fixup when applying.

-- 
Thanks,

David / dhildenb