On 12.08.20 12:11, Charan Teja Kalla wrote: > > > On 8/12/2020 3:30 PM, David Hildenbrand wrote: >> On 12.08.20 11:46, Charan Teja Kalla wrote: >>> >>> Thanks David for the inputs. >>> >>> On 8/12/2020 2:35 AM, David Hildenbrand wrote: >>>> On 11.08.20 14:58, Charan Teja Reddy wrote: >>>>> The following race is observed with the repeated online, offline and a >>>>> delay between two successive online of memory blocks of movable zone. >>>>> >>>>> P1 P2 >>>>> >>>>> Online the first memory block in >>>>> the movable zone. The pcp struct >>>>> values are initialized to default >>>>> values,i.e., pcp->high = 0 & >>>>> pcp->batch = 1. >>>>> >>>>> Allocate the pages from the >>>>> movable zone. >>>>> >>>>> Try to Online the second memory >>>>> block in the movable zone thus it >>>>> entered the online_pages() but yet >>>>> to call zone_pcp_update(). >>>>> This process is entered into >>>>> the exit path thus it tries >>>>> to release the order-0 pages >>>>> to pcp lists through >>>>> free_unref_page_commit(). >>>>> As pcp->high = 0, pcp->count = 1 >>>>> proceed to call the function >>>>> free_pcppages_bulk(). >>>>> Update the pcp values thus the >>>>> new pcp values are like, say, >>>>> pcp->high = 378, pcp->batch = 63. >>>>> Read the pcp's batch value using >>>>> READ_ONCE() and pass the same to >>>>> free_pcppages_bulk(), pcp values >>>>> passed here are, batch = 63, >>>>> count = 1. >>>>> >>>>> Since num of pages in the pcp >>>>> lists are less than ->batch, >>>>> then it will stuck in >>>>> while(list_empty(list)) loop >>>>> with interrupts disabled thus >>>>> a core hung. >>>>> >>>>> Avoid this by ensuring free_pcppages_bulk() is called with proper count >>>>> of pcp list pages. >>>>> >>>>> The mentioned race is some what easily reproducible without [1] because >>>>> pcp's are not updated for the first memory block online and thus there >>>>> is a enough race window for P2 between alloc+free and pcp struct values >>>>> update through onlining of second memory block. >>>>> >>>>> With [1], the race is still exists but it is very much narrow as we >>>>> update the pcp struct values for the first memory block online itself. >>>>> >>>>> [1]: https://patchwork.kernel.org/patch/11696389/ >>>>> >>>> >>>> IIUC, this is not limited to the movable zone, it could also happen in >>>> corner cases with the normal zone (e.g., hotplug to a node that only has >>>> DMA memory, or no other memory yet). >>> >>> Yes, this is my understanding too. I explained the above race in terms >>> of just movable zone for which it is observed. We can add the below line >>> in the end in patch commit message: >>> "This is not limited to the movable zone, it could also happen in cases >>> with the normal zone (e.g., hotplug to a node that only has DMA memory, >>> or no other memory yet)." >> >> Yeah, that makes sense! >> >>> >>> Just curious, there exists such systems where just a dma zone present >>> and we hot add the normal zone? I am not aware such thing in the >>> embedded world. >> >> You can easily create such setups using QEMU. >> >> IIRC, just specify a QEMU guest with 2G initial memory and a single NUMA >> node, or 4G initial memory and two NUMA nodes. Then hotplug memory. >> >> (IIRC kata containers always start a VM with 2G and then hotplug memory) >> > I see. Thanks for letting me know this. > >>>> >>>>> Signed-off-by: Charan Teja Reddy <charante@xxxxxxxxxxxxxx> >>>>> --- >>>>> >>>>> v1: https://patchwork.kernel.org/patch/11707637/ >>>>> >>>>> mm/page_alloc.c | 5 +++++ >>>>> 1 file changed, 5 insertions(+) >>>>> >>>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>>> index e4896e6..839039f 100644 >>>>> --- a/mm/page_alloc.c >>>>> +++ b/mm/page_alloc.c >>>>> @@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, int count, >>>>> struct page *page, *tmp; >>>>> LIST_HEAD(head); >>>>> >>>>> + /* >>>>> + * Ensure proper count is passed which otherwise would stuck in the >>>>> + * below while (list_empty(list)) loop. >>>>> + */ >>>>> + count = min(pcp->count, count); >>>>> while (count) { >>>>> struct list_head *list; >>>>> >>>>> >>>> >>>> Fixes: and Cc: stable... tags? >>> >>> Fixes: 5f8dcc21211a ("page-allocator: split per-cpu list into >>> one-list-per-migrate-type") >>> Cc: <stable@xxxxxxxxxxxxxxx> [2.6+] >> >> Did we have memory hotplug support then already? > > Yes, it exist. > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/mm/memory_hotplug.c?h=v2.6.39 Okay, so I guess these tags make sense. > >> >>> >>> I am not sure If I should have to raise V3 including these? >> >> >> Maybe Andrew can fixup when applying. > > Okay, let Andrew decide on this. Meanwhile If you find that this patch > looks correct, ACK from you helps here. Sure, I think this is good enough as a simple fix. Acked-by: David Hildenbrand <david@xxxxxxxxxx> -- Thanks, David / dhildenb