on 7/1/2024 3:14 PM, David Hildenbrand wrote: > On 01.07.24 03:25, Zhijian Li (Fujitsu) wrote: >> Hi all >> Overview: >> During testing the CXL memory hotremove, we noticed that `daxctl offline-memory dax0.0` >> would get stuck forever sometimes. daxctl offline-memory dax0.0 will write "offline" to >> /sys/devices/system/memory/memoryNNN/state. > Hi, > See > Documentation/admin-guide/mm/memory-hotplug.rst Many thanks for this quotation. It reminds me that we encountered OOM during the test sometimes. > " > Further, when running into out of memory situations while migrating > pages, or when still encountering permanently unmovable pages within > ZONE_MOVABLE (-> BUG), memory offlining will keep retrying until it > eventually succeeds. > When offlining is triggered from user space, the offlining context can > be terminated by sending a signal. A timeout based offlining can easily > be implemented via:: > % timeout $TIMEOUT offline_block | failure_handling > " >> Workaround: >> When it happens, we can type Ctrl-C to abort it and then retry again. >> Then the CXL memory is able to offline successfully. >> Where the kernel gets stuck: >> After digging into the kernel, we found that when the issue occurs, the kernel >> is stuck in the outer loop of offline_pages(). Below is a piece of the >> highlighted offline_pages(): >> ``` >> int __ref offline_pages() >> { >> do { // outer loop >> pfn = start_pfn; >> do { >> ret = scan_movable_pages(pfn, end_pfn, &pfn); // It returns -ENOENT >> if (!ret) >> do_migrate_range(pfn, end_pfn); // Not reach here >> } while (!ret); >> ret = test_pages_isolated(start_pfn, end_pfn, MEMORY_OFFLINE); >> } while (ret); // ret is -EBUSY >> } >> ``` >> In this case, we dumped the first page that cannot be isolated (see dump_page below), it's >> content does not change in each iteration.: >> ``` >> Jun 28 15:29:26 linux kernel: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x7980dd >> Jun 28 15:29:26 linux kernel: flags: 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff) >> Jun 28 15:29:26 linux kernel: raw: 009fffffc0000000 ffffdfbd9e603788 ffffd4f0ffd97ef0 0000000000000000 >> Jun 28 15:29:26 linux kernel: raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 >> Jun 28 15:29:26 linux kernel: page dumped because: trouble page... > Are you sure that's the problematic page? Yes, I dumped the page in the `else` in __test_page_isolated_in_pageblock(), see below 573 __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn, 574 int flags) 575 { 576 struct page *page; 577 578 while (pfn < end_pfn) { 579 page = pfn_to_page(pfn); 580 if (PageBuddy(page)) 581 /* 582 * If the page is on a free list, it has to be on 583 * the correct MIGRATE_ISOLATE freelist. There is no 584 * simple way to verify that as VM_BUG_ON(), though. 585 */ 586 pfn += 1 << buddy_order(page); 587 else if ((flags & MEMORY_OFFLINE) && PageHWPoison(page)) 588 /* A HWPoisoned page cannot be also PageBuddy */ 589 pfn++; 590 else if ((flags & MEMORY_OFFLINE) && PageOffline(page) && 591 !page_count(page)) 592 /* 593 * The responsible driver agreed to skip PageOffline() 594 * pages when offlining memory by dropping its 595 * reference in MEM_GOING_OFFLINE. 596 */ 597 pfn++; 598 else /****************** dump_page(page) here ****************/ 599 break; 600 } 601 602 return pfn; 603 } We also dumped that page at the beginning of offline_pages(), it had the same page structure content. IOW, this page has been problematic before the loop. > refcount:0 > Indicates that the page is free. But maybe it does not have PageBuddy() set. > It could also be that this is a "tail" page of a PageBuddy() page, It doesn't seem it's the tail page of the PageBuddy(), I also tested it that it didn't covered by the buddy_order(page) of the previous pageBuddy. > and > somehow we always end up on the tail in test_pages_isolated(). > Which kernel + architecture are you testing with? This test is running on QEMU/tcg x86_64 guest with kernel v6.10-rc2, the host is x86_64. /home/lizhijian/qemu/build/qemu-system-x86_64 \ -name guest=fedora-37-client \ -nographic \ -machine pc-q35-3.1,accel=tcg,nvdimm=on,cxl=on \ -cpu qemu64 \ -smp 4,sockets=4,cores=1,threads=1 \ -m size=8G,slots=8,maxmem=19922944k \ -hda ./Fedora-Server-1.qcow2 \ -object memory-backend-ram,size=4G,id=m0 \ -object memory-backend-ram,size=4G,id=m1 \ -numa node,nodeid=0,cpus=0-1,memdev=m0 \ -numa node,nodeid=1,cpus=2-3,memdev=m1 \ -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \ -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \ -object memory-backend-ram,size=2G,share=on,id=vmem0 \ -device cxl-type3,bus=root_port13,volatile-memdev=vmem0,id=type3-cxl-vmem0 \ -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=32G,cxl-fmw.0.interleave-granularity=4k \ > >> ``` >> >> Every time the issue occurs, the content of the page structure is >> similar. >> >> Questions: >> Q1. Is this behavior expected? At least for an OS administrator, it >> should return >> promptly (success or failure) instead of hanging indefinitely. > > It's expected that it might take a long time (possibly forever) in > corner cases. See documentation. > > But it's likely unexpected that we have some problematic page here. > >> Q2. Regarding the offline_pages() function, encountering such a page >> indeed causes >> an endless loop. Shouldn't another part of the kernel timely >> changed the state >> of this page? > > There are various things that can go wrong. One issue might be that we > try migrating a page but continuously fail to allocate memory to be > used as a migration target. It seems unlikely with the page you dumped > above, though. > > Do you maybe have that CXL memory be on a separate "fake" NUMA node, Yes, it's a memory only(CPU less) node. ``` [root@localhost guest]# numactl -H available: 3 nodes (0-2) node 0 cpus: 0 1 node 0 size: 3927 MB node 0 free: 3430 MB node 1 cpus: 2 3 node 1 size: 4028 MB node 1 free: 3620 MB node 2 cpus: node 2 size: 0 MB node 2 free: 0 MB node distances: node 0 1 2 0: 10 20 20 1: 20 10 20 2: 20 20 10 [root@localhost guest]# daxctl online-memory dax0.0 --movable onlined memory for 1 device [root@localhost guest]# numactl -H available: 3 nodes (0-2) node 0 cpus: 0 1 node 0 size: 3927 MB node 0 free: 3449 MB node 1 cpus: 2 3 node 1 size: 4028 MB node 1 free: 3614 MB node 2 cpus: node 2 size: 2048 MB node 2 free: 2048 MB node distances: node 0 1 2 0: 10 20 20 1: 20 10 20 2: 20 20 10 ``` > and your workload mbind() itself to that NUMA node, possibly refusing > to migrate somewhere else? In most testing runs, we do see the pages migrate to other node when we trigger a offline memory. > >> >> When I use the workaround mentioned above (Ctrl-C and try >> offline again), I find >> that the page state changes (see dump_page below): >> ``` >> Jun 28 15:33:12 linux kernel: page: refcount:0 mapcount:0 >> mapping:0000000000000000 index:0x0 pfn:0x7980dd >> Jun 28 15:33:12 linux kernel: flags: >> 0x9fffffc0000000(node=2|zone=3|lastcpupid=0x1fffff) >> Jun 28 15:33:12 linux kernel: raw: 009fffffc0000000 dead000000000100 >> dead000000000122 0000000000000000 >> Jun 28 15:33:12 linux kernel: raw: 0000000000000000 0000000000000000 >> 00000000ffffffff 0000000000000000 >> Jun 28 15:33:12 linux kernel: page dumped because: previous trouble page >> ``` >> >> What our test does: >> We have a CXL memory device, which is configured as kmem and online >> into the MOVABLE >> zone as NUMA node2. We run two processes, consume-memory and >> offline-memory, in parallel, >> see the pseudo code below: >> >> ``` >> main() >> { >> if (fork() == 0) >> numactl -m 2 ./consume-memory > > What exactly does "consume-memory" do? Does it involve hugetlb maybe? No, they are just malloc() pages, see the code as below. We did the 2M hugetlb pattern, the hugetlb pattern will get offlined success or fail with EBUSY promptly. ``` int main(int argc, char **argv) { unsigned long long mem_size = 0; if (argc < 2) { printf("please specify the mem size in MB!\n"); return -1; } mem_size = strtoull(argv[1], NULL, 10); if (mem_size <= 0) { printf("invalid mem size '%s'\n", argv[1]); return -1; } printf("the mem size is %llu MB\n", mem_size); mem_size *= 1024 * 1024; char * a = (char *)malloc(mem_size); if (!a) { printf("malloc failed\n"); return -1; } memset(a, 0, mem_size); return 0; } ``` Feel free to let me know if you want to add some trace/debug in the code to do a further check. Thanks Zhijian > >