On 27.04.22 06:28, Naoya Horiguchi wrote: > Hi, > > This patchset addresses some issues on the workload related to hwpoison, > hugetlb, and memory_hotplug. The problem in memory hotremove reported by > Miaohe Lin [1] is mentioned in 2/4. This patch depends on "storing raw > error info" functionality provided by 1/4. This patch also provide delayed > dissolve function too. > > Patch 3/4 is to adjust unpoison to new semantics of HPageMigratable for > hwpoisoned hugepage. And 4/4 is the fix for the inconsistent counter issue. > > [1] https://lore.kernel.org/linux-mm/20220421135129.19767-1-linmiaohe@xxxxxxxxxx/ > > Please let me know if you have any suggestions and comments. > Hi, I raised some time ago already that I don't quite see the value of allowing memory offlining with poisened pages. 1) It overcomplicates the offlining code and seems to be partially broken 2) It happens rarely (ever?), so do we even care? 3) Once the memory is offline, we can re-online it and lost HWPoison. The memory can be happily used. 3) can happen easily if our DIMM consists of multiple memory blocks and offlining of some memory block fails -> we'll re-online all already offlined ones. We'll happily reuse previously HWPoisoned pages, which feels more dangerous to me then just leaving the DIMM around (and eventually hwpoisoning all pages on it such that it won't get used anymore?). So maybe we should just fail offlining once we stumble over a hwpoisoned page? Yes, we would disallow removing a semi-broken DIMM from the system that was onlined MOVABLE. I wonder if we really need that and how often it happens in real life. Most systems I am aware of don't allow for replacing individual DIMMs, but only complete NUMA nodes. Hm. -- Thanks, David / dhildenb