On Thu, May 28, 2020 at 1:30 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote: > > On Thu, May 28, 2020 at 08:48:31PM +0100, Chris Down wrote: > > Shakeel Butt writes: > > > What was the initial reason to have different behavior in the first place? > > > > This differing behaviour is simply a mistake, it was never intended to be > > this deviate from what happens elsewhere. To that extent this patch is as > > much a bug fix as it is an improvement. > > Yes, it was an oversight. > > > > > static void high_work_func(struct work_struct *work) > > > > @@ -2378,16 +2384,20 @@ void mem_cgroup_handle_over_high(void) > > > > { > > > > unsigned long penalty_jiffies; > > > > unsigned long pflags; > > > > + unsigned long nr_reclaimed; > > > > unsigned int nr_pages = current->memcg_nr_pages_over_high; > > > > > > Is there any benefit to keep current->memcg_nr_pages_over_high after > > > this change? Why not just use SWAP_CLUSTER_MAX? > > It's there for the same reason why try_to_free_pages() takes a reclaim > argument in the first place: we want to make the thread allocating the > most also do the most reclaim work. Consider a thread allocating THPs > in a loop with another thread allocating regular pages. > > Remember that all callers loop. They could theoretically all just ask > for SWAP_CLUSTER_MAX pages over and over again. > > The idea is to have fairness in most cases, and avoid allocation > failure, premature OOM, and containment failure in the edge cases that > are caused by the inherent raciness of page reclaim. > Thanks for the explanation. > > I don't feel strongly either way, but current->memcg_nr_pages_over_high can > > be very large for large allocations. > > > > That said, maybe we should just reclaim `max(SWAP_CLUSTER_MAX, current - > > high)` for each loop? I agree that with this design it looks like perhaps we > > don't need it any more. > > > > Johannes, what do you think? > > How about this: > > Reclaim memcg_nr_pages_over_high in the first iteration, then switch > to SWAP_CLUSTER_MAX in the retries. > > This acknowledges that while the page allocator and memory.max reclaim > every time an allocation is made, memory.high is currently batched and > can have larger targets. We want the allocating thread to reclaim at > least the batch size, but beyond that only what's necessary to prevent > premature OOM or failing containment. > > Add a comment stating as much. > > Once we reclaim memory.high synchronously instead of batched, this > exceptional handling is no longer needed and can be deleted again. > > Does that sound reasonable? SGTM. It does not seem controversial to me to let the task do the work to resolve the condition for which it is being throttled.