On 13.02.22 12:07, Yang Yanchao wrote: > Hello, > Hi, > I find a hanging issue during memory-hotplug on kernel-4.18. you actually mean memory hotunplug / memory offlinig IIUC. > Repetition steps: > 1. malloc for all system memory, write 'x', then free > 2. for each removable memory block: Note that "removable=yes" was always racy and upstream Linux nowadays only keeps that property around to not break older user space -- upstream Linux always reports "removable=yes" if memory offlining is supported. > echo offline > /sys/devices/system/memory/memoryXXX/state > Then during the offline process, there is a high probability of being stuck for more than 20 minutes to five hours. > cat /sys/ Device/system/Memory/memoryXXX/state > The status is "going-offline" > I try to understand it by adding some print to the kernel.The discovery process can't exit in this loop: > __offline_pages > do_migrate_range > migrate_pages > unmap_and_move > move_to_new_page > fallback_migrate_page --> return EAGAIN > I try to clear the cache, but it don't seems to solve the problem. > echo 3 > /proc/sys/vm/drop_caches > Can I fix this problem with other Settings? Or can I see why it's stuck? There are no real guarantees what will happen when trying offlinig a memory block that's not onlined to ZONE_MOVABLE. You can observe the zone e.g., via $ cat /sys/devices/system/memory/memory40/valid_zones Normal Even with ZONE_MOVABLE, it can take quite a while (and in corner cases eventually forever) until offlining succeeds. Now, 20 minutes are a bit extreme. User space can always cancel offlining -- in your example, by killing the "echo offline > /sys/devices/system/memory/memoryXXX/state" process. Having that said, as raised by Matthew, a lot changed since 4.18, so you should try reproducing upstream. But even there, you can just cancel offlining if it takes too long. If you observe similar behavior on ZONE_MOVABLE, it would be interesting to find out how to better handle that to make offlining succeed faster. -- Thanks, David / dhildenb