On Wed, Apr 27, 2022 at 12:48:16PM +0200, David Hildenbrand wrote: > I raised some time ago already that I don't quite see the value of > allowing memory offlining with poisened pages. > > 1) It overcomplicates the offlining code and seems to be partially > broken > 2) It happens rarely (ever?), so do we even care? > 3) Once the memory is offline, we can re-online it and lost HWPoison. > The memory can be happily used. > > 3) can happen easily if our DIMM consists of multiple memory blocks and > offlining of some memory block fails -> we'll re-online all already > offlined ones. We'll happily reuse previously HWPoisoned pages, which > feels more dangerous to me then just leaving the DIMM around (and > eventually hwpoisoning all pages on it such that it won't get used > anymore?). > > So maybe we should just fail offlining once we stumble over a hwpoisoned > page? > > Yes, we would disallow removing a semi-broken DIMM from the system that > was onlined MOVABLE. I wonder if we really need that and how often it > happens in real life. Most systems I am aware of don't allow for > replacing individual DIMMs, but only complete NUMA nodes. Hm. I teend to agree with all you said. And to be honest, the mechanism of making a semi-broken DIMM healthy again has always been a mistery to me. One would think that: 1- you hot-remove the memory 2- you fix/remove it 3- you hotplug memory again but I am not sure how many times this came to be. And there is also the thing about losing the hwpoison information between offline<->online transitions, so, the thing is unreliable. And for that to work, we would have to add a bunch of code to keep track of "offlined" pages that are hwpoisoned, so we flag them again once they get onlined, and that means more room for errors. So, I would lean towards the fact of not allowing to offline memory that contain such pages in the first place, unless that proves to be a no-go. -- Oscar Salvador SUSE Labs