On Thu 12-07-18 19:34:16, Wei Wang wrote: > On 07/12/2018 04:13 PM, Michal Hocko wrote: > > On Thu 12-07-18 10:52:08, Wei Wang wrote: > > > On 07/12/2018 10:30 AM, Linus Torvalds wrote: > > > > On Wed, Jul 11, 2018 at 7:17 PM Wei Wang <wei.w.wang@xxxxxxxxx> wrote: > > > > > Would it be better to remove __GFP_THISNODE? We actually want to get all > > > > > the guest free pages (from all the nodes). > > > > Maybe. Or maybe it would be better to have the memory balloon logic be > > > > per-node? Maybe you don't want to remove too much memory from one > > > > node? I think it's one of those "play with it" things. > > > > > > > > I don't think that's the big issue, actually. I think the real issue > > > > is how to react quickly and gracefully to "oops, I'm trying to give > > > > memory away, but now the guest wants it back" while you're in the > > > > middle of trying to create that 2TB list of pages. > > > OK. virtio-balloon has already registered an oom notifier > > > (virtballoon_oom_notify). I plan to add some control there. If oom happens, > > > - stop the page allocation; > > > - immediately give back the allocated pages to mm. > > Please don't. Oom notifier is an absolutely hideous interface which > > should go away sooner or later (I would much rather like the former) so > > do not build a new logic on top of it. I would appreciate if you > > actually remove the notifier much more. > > > > You can give memory back from the standard shrinker interface. If we are > > reaching low reclaim priorities then we are struggling to reclaim memory > > and then you can start returning pages back. > > OK. Just curious why oom notifier is thought to be hideous, and has it been > a consensus? Because it is a completely non-transparent callout from the OOM context which is really subtle on its own. It is just too easy to end up in weird corner cases. We really have to be careful and be as swift as possible. Any potential sleep would make the OOM situation much worse because nobody would be able to make a forward progress or (in)direct dependency on MM subsystem can easily deadlock. Those are really hard to track down and defining the notifier as blockable by design which just asks for bad implementations because most people simply do not realize how subtle the oom context is. Another thing is that it happens way too late when we have basically reclaimed the world and didn't get out of the memory pressure so you can expect any workload is suffering already. Anybody sitting on a large amount of reclaimable memory should have released that memory by that time. Proportionally to the reclaim pressure ideally. The notifier API is completely unaware of oom constrains. Just imagine you are OOM in a subset of numa nodes. Callback doesn't have any idea about that. Moreover we do have proper reclaim mechanism that has a feedback loop and that should be always preferable to an abrupt reclaim. -- Michal Hocko SUSE Labs _______________________________________________ Virtualization mailing list Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/virtualization