On Tue 27-06-23 15:14:11, David Hildenbrand wrote: > On 27.06.23 14:40, Michal Hocko wrote: > > On Tue 27-06-23 13:22:18, David Hildenbrand wrote: > > > John Hubbard writes [1]: > > > > > > Some device drivers add memory to the system via memory hotplug. > > > When the driver is unloaded, that memory is hot-unplugged. > > > > > > However, memory hot unplug can fail. And these days, it fails a > > > little too easily, with respect to the above case. Specifically, if > > > a signal is pending on the process, hot unplug fails. > > > > > > [...] > > > > > > So in this case, other things (unmovable pages, un-splittable huge > > > pages) can also cause the above problem. However, those are > > > demonstrably less common than simply having a pending signal. I've > > > got bug reports from users who can trivially reproduce this by > > > killing their process with a "kill -9", for example. > > > > This looks like a bug of the said driver no? If the tear down process is > > killed it could very well happen right before offlining so you end up in > > the very same state. Or what am I missing? > > IIUC (John can correct me if I am wrong): > > 1) The process holds the device node open > 2) The process gets killed or quits > 3) As the process gets torn down, it closes the device node > 4) Closing the device node results in the driver removing the device and > calling offline_and_remove_memory() > > So it's not a "tear down process" that triggers that offlining_removal > somehow explicitly, it's just a side-product of it letting go of the device > node as the process gets torn down. Isn't that just fragile? The operation might fail for other reasons. Why cannot there be a hold on the resource to control the tear down explicitly? > > > Especially with ZONE_MOVABLE, offlining is supposed to work in most > > > cases when offlining actually hotplugged (not boot) memory, and only fail > > > in rare corner cases (e.g., some driver holds a reference to a page in > > > ZONE_MOVABLE, turning it unmovable). > > > > > > In these corner cases we really don't want to be stuck forever in > > > offline_and_remove_memory(). But in the general cases, we really want to > > > do our best to make memory offlining succeed -- in a reasonable > > > timeframe. > > > > > > Reliably failing in the described case when there is a fatal signal pending > > > is sub-optimal. The pending signal check is mostly only relevant when user > > > space explicitly triggers offlining of memory using sysfs device attributes > > > ("state" or "online" attribute), but not when coming via > > > offline_and_remove_memory(). > > > > > > So let's use a timer instead and ignore fatal signals, because they are > > > not really expressive for offline_and_remove_memory() users. Let's default > > > to 30 seconds if no timeout was specified, and limit the timeout to 120 > > > seconds. > > > > I really hate having timeouts back. They just proven to be hard to get > > right and it is essentially a policy implemented in the kernel. They > > simply do not belong to the kernel space IMHO. > > As much as I agree with you in terms of offlining triggered from user space > (e.g., write "state" or "online" attribute) where user-space is actually in > charge and can do something reasonable (timeout, retry, whatever), in these > the offline_and_remove_memory() case it's the driver that wants a > best-effort memory offlining+removal. > > If it times out, virtio-mem will simply try another block or retry later. > Right now, it could get stuck forever in offline_and_remove_memory(), which > is obviously "not great". Fortunately, for virtio-mem it's configurable and > we use the alloc_contig_range()-method for now as default. It seems that offline_and_remove_memory is using a wrong operation then. If it wants an opportunistic offlining with some sort of policy. Timeout might be just one policy to use but failure mode or a retry count might be a better fit for some users. So rather than (ab)using offline_pages, would be make more sense to extract basic offlining steps and allow drivers like virtio-mem to reuse them and define their own policy? > If it would time out for John's driver, we most certainly don't want to be > stuck in offline_and_remove_memory(), blocking device/driver unloading (and > even a reboot IIRC) possibly forever. Now I am confused. John driver wants to tear down the device now? There is no way to release that memory otherwise AFAIU from the initial problem description. -- Michal Hocko SUSE Labs