David && ALL, Some progress updates On 04/07/2024 21:07, Zhijian Li (Fujitsu) wrote: > > >> -----Original Message----- >> From: David Hildenbrand <david@xxxxxxxxxx> >> Sent: Thursday, July 4, 2024 4:15 PM > > >> >> On 04.07.24 09:43, Zhijian Li (Fujitsu) wrote: >>> All, >>> >>> Some progress updates >>> >>> When issue occurs, calling __drain_all_pages() can make offline_pages() escape >> from the loop. >>> >>>> >>>> Jun 28 15:29:26 linux kernel: page: refcount:0 mapcount:0 >>>> mapping:0000000000000000 index:0x0 pfn:0x7980dd Jun 28 15:29:26 linux >>>> kernel: flags: 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff) >>>> Jun 28 15:29:26 linux kernel: raw: 009fffffc0000000 ffffdfbd9e603788 >>>> ffffd4f0ffd97ef0 0000000000000000 Jun 28 15:29:26 linux kernel: raw: >>>> 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 >> Jun 28 15:29:26 linux kernel: page dumped because: trouble page... >>>> >>> >>> With this problematic page structure contents, it seems that the >>> list_head = {ffffdfbd9e603788, ffffd4f0ffd97ef0} is valid. >>> >>> I guess it was linking to the pcp_list, so I dumped the >>> per_cpu_pages[cpu].count in every in critical timings. >> >> So, is your reproducer getting fixed when you call __drain_all_pages() in the loop? >> (not that it's the right fix, but a could datapoint :) ) > > Yeah, it works for my reproducer. > > > >> >>> >>> An example is as below, >>> offline_pages() >>> { >>> // per_cpu_pages[1].count = 0 >>> zone_pcp_disable() // will call __drain_all_pages() >>> // per_cpu_pages[1].count = 188 >>> do { >>> do { >>> scan_movable_pages() >>> ret = do_migrate_range() >>> } while (!ret) >>> >>> ret = test_pages_isolated() >>> >>> if(is the 1st iteration) >>> // per_cpu_pages[1].count = 182 >>> >>> if (issue occurs) { /* if the loop take beyond 10 seconds */ >>> // per_cpu_pages[1].count = 61 >>> __drain_all_pages() >>> // per_cpu_pages[1].count = 0 >>> /* will escape from the outer loop in later iterations */ >>> } >>> } while (ret) >>> } >>> >>> Some interesting points: >>> - After the 1st __drain_all_pages(), per_cpu_pages[1].count increased to 188 >> from 0, >>> does it mean it's racing with something...? >>> - per_cpu_pages[1].count will decrease but not decrease to 0 during >> iterations >>> - when issue occurs, calling __drain_all_pages() will decrease >> per_cpu_pages[1].count to 0. >> >> That's indeed weird. Maybe there is a race, or zone_pcp_disable() is not fully >> effective for a zone? > > I often see there still are pages in PCP after the zone_pcp_disable(). > >> >>> >>> So I wonder if it's fine to call __drain_all_pages() in the loop? >>> >>> Looking forward to your insights. >> >> So, in free_unref_page(), we make sure to never place MIGRATE_ISOLATE onto >> the PCP. All pageblocks we are going to offline should be isolated at this point, so >> no page that is getting freed and part of the to-be-offlined range should end up on >> the PCP. So far the theory. >> >> >> In offlining code we do >> >> 1) Set MIGRATE_ISOLATE >> 2) zone_pcp_disable() -> set high-and-batch to 0 and drain >> >> Could there be a race in free_unref_page(), such that although >> zone_pcp_disable() succeeds, we would still end up with a page in the pcp? >> (especially, one that has MIGRATE_ISOLATE set for its pageblock?) > > Thanks for your idea, I will further investigate in this direction. some updates CPU0 CPU1 ----------- --------- // erase pcp_list zone_pcp_disable // pcp->count = 0 lru_cache_diable() __rmqueue_pcplist() // re-add pages to pcp_lsit __rmqueue_pcplist() // drop pages from pcp_list decay_pcp_high() // drop pages from pcp_list loop { ... __rmqueue_pcplist() // drop pages from pcp_list, // it will be only called a few times during the loop scan_movable_pages() ... migration_pages() decay_pcp_high() // drop pages from pcp_list, it will be called by // a worker periodically during the loop // wait pcp_list to be empty } while (test_pages_isolated()) And we noticed that re-add pages to pcp_list in '__rmqueue_pcplist()` only happen once, pcp->count changed to 200 from 0 for example. The later calls to __rmqueue_pcplist() will drop pcp->count by step 1 for each call, For example, pcp->count: 199->198->197->196..., However it stops calling __rmqueue_pcplist after a few times before pcp->count is dropped to 0. In the normal/good case, we also noticed that __rmqueue_pcplist() dropped pcp->count to 0. Here is the __rmqueue_pcplist() call trace. CPU: 1 PID: 3615 Comm: consume_std_pag Not tainted 6.10.0-rc2+ #147 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0x64/0x80 __rmqueue_pcplist+0xd55/0xdf0 get_page_from_freelist+0x2a1/0x1770 __alloc_pages_noprof+0x1a0/0x380 alloc_pages_mpol_noprof+0xe3/0x1f0 vma_alloc_folio_noprof+0x5c/0xb0 folio_prealloc+0x21/0x80 do_pte_missing+0x695/0xa20 ? __pte_offset_map+0x1b/0x180 __handle_mm_fault+0x65f/0xc10 ? kmem_cache_free+0x370/0x410 handle_mm_fault+0x128/0x360 do_user_addr_fault+0x309/0x810 exc_page_fault+0x7e/0x180 asm_exc_page_fault+0x26/0x30 RIP: 0033:0x7f3d8729028a In the meantime, decay_pcp_high() will be called to drop the pcp->count periodically. [ 145.117256] decay_pcp_high+0x68/0x90 [ 145.117256] refresh_cpu_vm_stats+0x149/0x2a0 [ 145.117256] vmstat_update+0x13/0x50 [ 145.117256] process_scheduled_works+0xa6/0x420 [ 145.117256] worker_thread+0x117/0x270 [ 145.117256] ? __pfx_worker_thread+0x10/0x10 [ 145.117256] kthread+0xe5/0x120 [ 145.117256] ? __pfx_kthread+0x10/0x10 [ 145.117256] ret_from_fork+0x34/0x40 [ 145.117256] ? __pfx_kthread+0x10/0x10 [ 145.117256] ret_from_fork_asm+0x1a/0x30 But decay_pcp_high() will stop dropping pcp->count after a few moment later. IOW decay_pcp_high() will be called periodically though, it doesn't drop pcp->count. A piece of pcp content shows as below: ... count = 7, high = 7, high_min = 0, high_max = 0, batch = 1, flags = 0 '\000', alloc_factor = 3 '\003', expire = 0 '\000', free_count = 0, lists = {{ ... 2244 * Called from the vmstat counter updater to decay the PCP high. 2245 * Return whether there are addition works to do. 2246 */ 2247 int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp) 2248 { 2249 int high_min, to_drain, batch; 2250 int todo = 0; 2251 2252 high_min = READ_ONCE(pcp->high_min); 2253 batch = READ_ONCE(pcp->batch); 2254 /* 2255 * Decrease pcp->high periodically to try to free possible 2256 * idle PCP pages. And, avoid to free too many pages to 2257 * control latency. This caps pcp->high decrement too. 2258 */ 2259 if (pcp->high > high_min) { 2260 pcp->high = max3(pcp->count - (batch << CONFIG_PCP_BATCH_SCALE_MAX), 2261 pcp->high - (pcp->high >> 3), high_min); 2262 if (pcp->high > high_min) 2263 todo++; 2264 } 2265 2266 to_drain = pcp->count - pcp->high; // to_drain will be 0(when count == high), // so no any pages can be drop from pcp_list. 2267 if (to_drain > 0) { 2268 spin_lock(&pcp->lock); 2269 free_pcppages_bulk(zone, to_drain, pcp, 0); 2270 spin_unlock(&pcp->lock); 2271 todo++; 2272 } 2273 2274 if (mutex_is_locked(&pcp_batch_high_lock) && pcp->high_max == 0 && to_drain > 0 && pcp->count >= 0) 2275 pr_info("lizhijian:%s,%d: cpu%d, to_drain %d, new %d\n", __func__, __LINE__, smp_processor_id(), to_drain, pcp->count); 2277 return todo; 2278 } I'm wondering if we can fix it in decay_pcp_high(), let it drop pcp->count to 0 when zone_pcp_disable() The following logs show the pcp->count(*new* is pcp->count in the end of the function) in the bad case. ===================== Jul 11 18:34:14 linux kernel: lizhijian:__rmqueue_pcplist,2945: cpu1, pfn 6a0159, old 71, new 70, add 0, drop 1 Jul 11 18:34:14 linux kernel: lizhijian:__rmqueue_pcplist,2945: cpu1, pfn 6a015a, old 70, new 69, add 0, drop 1 Jul 11 18:34:14 linux kernel: lizhijian:__rmqueue_pcplist,2945: cpu1, pfn 6a015b, old 69, new 68, add 0, drop 1 Jul 11 18:34:14 linux kernel: lizhijian:__rmqueue_pcplist,2945: cpu1, pfn 6a015c, old 68, new 67, add 0, drop 1 Jul 11 18:34:14 linux kernel: lizhijian:__rmqueue_pcplist,2945: cpu1, pfn 6a015d, old 67, new 66, add 0, drop 1 Jul 11 18:34:14 linux kernel: lizhijian:__rmqueue_pcplist,2945: cpu1, pfn 6a015e, old 66, new 65, add 0, drop 1 Jul 11 18:34:14 linux kernel: lizhijian:__rmqueue_pcplist,2945: cpu1, pfn 6a015f, old 65, new 64, add 0, drop 1 ... Jul 11 18:34:18 linux kernel: lizhijian: offline_pages,2087: cpu0: [6a0000-6a8000] get trouble: pcplist[1]: 0->63, batch 1, high 154 ... Jul 11 18:34:25 linux kernel: lizhijian:decay_pcp_high,2275: cpu1, to_drain 7, new 56 Jul 11 18:34:26 linux kernel: lizhijian:decay_pcp_high,2275: cpu1, to_drain 7, new 49 Jul 11 18:34:27 linux kernel: lizhijian:decay_pcp_high,2275: cpu1, to_drain 6, new 43 ... Jul 11 18:34:40 linux kernel: lizhijian:decay_pcp_high,2275: cpu1, to_drain 1, new 10 Jul 11 18:34:41 linux kernel: lizhijian:decay_pcp_high,2275: cpu1, to_drain 1, new 9 Jul 11 18:34:42 linux kernel: lizhijian:decay_pcp_high,2275: cpu1, to_drain 1, new 8 Jul 11 18:34:43 linux kernel: lizhijian:decay_pcp_high,2275: cpu1, to_drain 1, new 7 ===================== Thanks Zhijian > > > Thanks > Zhijian > >> >> >> -- >> Cheers, >> >> David / dhildenb >