On Fri 14-07-17 13:43:21, Michal Hocko wrote: > On Fri 14-07-17 13:29:14, Vlastimil Babka wrote: > > On 07/14/2017 10:00 AM, Michal Hocko wrote: > > > From: Michal Hocko <mhocko@xxxxxxxx> > > > > > > build_all_zonelists has been (ab)using stop_machine to make sure that > > > zonelists do not change while somebody is looking at them. This is > > > is just a gross hack because a) it complicates the context from which > > > we can call build_all_zonelists (see 3f906ba23689 ("mm/memory-hotplug: > > > switch locking to a percpu rwsem")) and b) is is not really necessary > > > especially after "mm, page_alloc: simplify zonelist initialization". > > > > > > Updates of the zonelists happen very seldom, basically only when a zone > > > becomes populated during memory online or when it loses all the memory > > > during offline. A racing iteration over zonelists could either miss a > > > zone or try to work on one zone twice. Both of these are something we > > > can live with occasionally because there will always be at least one > > > zone visible so we are not likely to fail allocation too easily for > > > example. > > > > Given the experience with with cpusets and mempolicies, I would rather > > avoid the risk of allocation not seeing the only zone(s) that are > > allowed by its nodemask, and triggering premature OOM. > > I would argue, those are a different beast because they are directly > under control of not fully priviledged user and change between the empty > nodemask and cpusets very often. For this one to trigger we > would have to online/offline the last memory block in the zone very > often and that doesn't resemble a sensible usecase even remotely. > > > So maybe the > > updates could be done in a way to avoid that, e.g. first append a copy > > of the old zonelist to the end, then overwrite and terminate with NULL. > > But if this requires any barriers or something similar on the iteration > > site, which is performance critical, then it's bad. > > Maybe a seqcount, that the iteration side only starts checking in the > > slowpath? Like we have with cpusets now. > > I know that Mel noted that stop_machine() also never had such guarantees > > to prevent this, but it could have made the chances smaller. > > I think we can come up with some scheme but is this really worth it > considering how unlikely the whole thing is? Well, if somebody hits a > premature OOM killer or allocations failures it would have to be along > with a heavy memory hotplug operations and then it would be quite easy > to spot what is going on and try to fix it. I would rather not > overcomplicate it, to be honest. And one more thing, Mel has already brought this up in his response. stop_machine haven't is very roughly same strenght wrt. double zone visit or a missed zone because we do not restart zonelist iteration. -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@xxxxxxxxx. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@xxxxxxxxx"> email@xxxxxxxxx </a>