On 22/07/2024 17:28, Vlastimil Babka (SUSE) wrote: > On 7/22/24 11:15 AM, Zhijian Li (Fujitsu) wrote: >> Hi David >> >> Thanks for you quickly reply. >> >> >> On 22/07/2024 14:44, Vlastimil Babka (SUSE) wrote: >>> On 7/22/24 4:10 AM, Li Zhijian wrote: >>>> It's expected that no page should be left in pcp_list after calling >>>> zone_pcp_disable() in offline_pages(). Previously, it's observed that >>>> offline_pages() gets stuck [1] due to some pages remaining in pcp_list. >>>> >>>> Cause: >>>> There is a race condition between drain_pages_zone() and __rmqueue_pcplist() >>>> involving the pcp->count variable. See below scenario: >>>> >>>> CPU0 CPU1 >>>> ---------------- --------------- >>>> spin_lock(&pcp->lock); >>>> __rmqueue_pcplist() { >>>> zone_pcp_disable() { >>>> /* list is empty */ >>>> if (list_empty(list)) { >>>> /* add pages to pcp_list */ >>>> alloced = rmqueue_bulk() >>>> mutex_lock(&pcp_batch_high_lock) >>>> ... >>>> __drain_all_pages() { >>>> drain_pages_zone() { >>>> /* read pcp->count, it's 0 here */ >>>> count = READ_ONCE(pcp->count) >>>> /* 0 means nothing to drain */ >>>> /* update pcp->count */ >>>> pcp->count += alloced << order; >>>> ... >>>> ... >>>> spin_unlock(&pcp->lock); >>>> >>>> In this case, after calling zone_pcp_disable() though, there are still some >>>> pages in pcp_list. And these pages in pcp_list are neither movable nor >>>> isolated, offline_pages() gets stuck as a result. >>>> >>>> Solution: >>>> Expand the scope of the pcp->lock to also protect pcp->count in >>>> drain_pages_zone(), to ensure no pages are left in the pcp list after >>>> zone_pcp_disable() >>>> >>>> [1] https://lore.kernel.org/linux-mm/6a07125f-e720-404c-b2f9-e55f3f166e85@xxxxxxxxxxx/ >>>> >>>> Cc: David Hildenbrand <david@xxxxxxxxxx> >>>> Cc: Vlastimil Babka (SUSE) <vbabka@xxxxxxxxxx> >>>> Reported-by: Yao Xingtao <yaoxt.fnst@xxxxxxxxxxx> >>>> Signed-off-by: Li Zhijian <lizhijian@xxxxxxxxxxx> >>> >>> Can we find a breaking commit for Fixes: ? >> >> I haven't confirmed the FBC because my reproducer is not fit to run in the old kernel for some reasons. >> but I noticed it didn't read the count without lock held since below commit >> >> 4b23a68f9536 mm/page_alloc: protect PCP lists with a spinlock >> >> >> >>> >>>> --- >>>> V2: >>>> - Narrow down the scope of the spin_lock() to limit the draining latency. # Vlastimil and David >>>> - In above scenario, it's sufficient to read pcp->count once with lock held, and it fully fixed >>>> my issue[1] in thounds runs(It happened in more than 5% before). >>> >>> That should be ok indeed, but... >>> >>>> RFC: >>>> https://lore.kernel.org/linux-mm/20240716073929.843277-1-lizhijian@xxxxxxxxxxx/ >>>> --- >>>> mm/page_alloc.c | 5 ++++- >>>> 1 file changed, 4 insertions(+), 1 deletion(-) >>>> >>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>> index 9ecf99190ea2..5388a35c4e9c 100644 >>>> --- a/mm/page_alloc.c >>>> +++ b/mm/page_alloc.c >>>> @@ -2323,8 +2323,11 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp) >>>> static void drain_pages_zone(unsigned int cpu, struct zone *zone) >>>> { >>>> struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); >>>> - int count = READ_ONCE(pcp->count); >>>> + int count; >>>> >>>> + spin_lock(&pcp->lock); >>>> + count = pcp->count; >>>> + spin_unlock(&pcp->lock); >>>> while (count) { >>>> int to_drain = min(count, pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX); >>>> count -= to_drain; >>> >>> It's wasteful to do a lock/unlock cycle just to read the count. >> >> How about, >> >> static void drain_pages_zone(unsigned int cpu, struct zone *zone) >> { >> struct per_cpu_pages *pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu); >> int count, to_drain; >> >> do { >> spin_lock(&pcp->lock); >> to_drain = min(pcp->count, pcp->batch << CONFIG_PCP_BATCH_SCALE_MAX); >> free_pcppages_bulk(zone, to_drain, pcp, 0); >> spin_unlock(&pcp->lock); >> } while (to_drain); > > Yeah better than break. But I think you still should use Okay, I will update it in V3 Thanks Zhijian > count = pcp->count; > ... > count -= to_drain; > } while(count); > > or you make one extra wasteful iteration to find to_drain is zero. > (assuming "it's sufficient to read pcp->count once with lock held", that I > agree with)> >> } >>> It could >>> rather look something like this: >>> >> >> Sorry, I don't follow your code... >> >>> while (true) >>> spin_lock(&pcp->lock); >>> count = pcp->count; >>> ... >>> count -= to_drain; >>> if (to_drain) >>> drain_zone_pages(...) >> >> Which subroutine does this code belong to, why it involves drain_zone_pages > > Yeah sorry I meant free_pcppages_bulk() > >>> ... >>> spin_unlock(&pcp->lock); >>> if (count) >>> break; >> >> Thanks >> Zhijian >