On 10/14/21 12:47, Mel Gorman wrote: > Thanks Vlastimil > > On Wed, Oct 13, 2021 at 05:39:36PM +0200, Vlastimil Babka wrote: >> > +/* >> > + * Account for pages written if tasks are throttled waiting on dirty >> > + * pages to clean. If enough pages have been cleaned since throttling >> > + * started then wakeup the throttled tasks. >> > + */ >> > +void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page, >> > + int nr_throttled) >> > +{ >> > + unsigned long nr_written; >> > + >> > + __inc_node_page_state(page, NR_THROTTLED_WRITTEN); >> >> Is this intentionally using the __ version that normally expects irqs to be >> disabled (AFAIK they are not in this path)? I think this is rarely used cold >> path so it doesn't seem worth to trade off speed for accuracy. >> > > It was intentional because IRQs can be disabled and if it's race-prone, > it's not overly problematic but you're right, better to be safe. I changed > it to the safe type as it's mostly free on x86, arm64 and s390 and for > other architectures, this is a slow path. Great, thanks. >> > + nr_written = node_page_state(pgdat, NR_THROTTLED_WRITTEN) - >> > + READ_ONCE(pgdat->nr_reclaim_start); >> >> Even if the inc above was safe, node_page_state() will return only the >> global counter, so the value we read here will only actually increment when >> some cpu's counter overflows, so it will be "bursty". Maybe it's ok, just >> worth documenting? >> > > I didn't think the penalty of doing an accurate read while writeback > throttled is worth it. I'll add a comment. > >> > + >> > + if (nr_written > SWAP_CLUSTER_MAX * nr_throttled) >> > + wake_up_all(&pgdat->reclaim_wait); >> >> Hm it seems a bit weird that the more tasks are throttled, the more we wait, >> and then wake up all. Theoretically this will lead to even more >> bursty/staggering herd behavior. Could be better to wake up single task each >> SWAP_CLUSTER_MAX, and bump nr_reclaim_start? But maybe it's not a problem in >> practice due to HZ/10 timeouts being short enough? >> > > Yes, the more tasks are throttled the longer tasks wait because tasks are > allocating faster than writeback can complete so I wanted to reduce the > allocation pressure. I considered waking one task at a time but there is > no prioritisation of tasks on the waitqueue and it's not clear that the > additional complexity is justified. With inaccurate counters, a light > allocator could get throttled for the full timeout unnecessarily. > > Even if we were to wake one task at a time, I would prefer it was done > as a potential optimisation on top. Fair enough. > Diff on top based on review feedback; Thanks, with that you can add Acked-by: Vlastimil Babka <vbabka@xxxxxxx> to the updated version > diff --git a/mm/vmscan.c b/mm/vmscan.c > index bcd22e53795f..735b1f2b5d9e 100644 > --- a/mm/vmscan.c > +++ b/mm/vmscan.c > @@ -1048,7 +1048,15 @@ void __acct_reclaim_writeback(pg_data_t *pgdat, struct page *page, > { > unsigned long nr_written; > > - __inc_node_page_state(page, NR_THROTTLED_WRITTEN); > + inc_node_page_state(page, NR_THROTTLED_WRITTEN); > + > + /* > + * This is an inaccurate read as the per-cpu deltas may not > + * be synchronised. However, given that the system is > + * writeback throttled, it is not worth taking the penalty > + * of getting an accurate count. At worst, the throttle > + * timeout guarantees forward progress. > + */ > nr_written = node_page_state(pgdat, NR_THROTTLED_WRITTEN) - > READ_ONCE(pgdat->nr_reclaim_start); >