Sorry for delayed reply. This has fallen through the cracks... On Thu 28-04-22 13:16:19, Stefan Roesch wrote: > On 4/28/22 10:47 AM, Jan Kara wrote: > > On Tue 26-04-22 10:43:32, Stefan Roesch wrote: > >> This change adds support for async write throttling in the function > >> balance_dirty_pages(). So far if throttling was required, the code was > >> waiting synchronously as long as the writes were throttled. This change > >> introduces asynchronous throttling. Instead of waiting in the function > >> balance_dirty_pages(), the timeout is set in the task_struct field > >> bdp_pause. Once the timeout has expired, the writes are no longer > >> throttled. > >> > >> - Add a new parameter to the balance_dirty_pages() function > >> - This allows the caller to pass in the nowait flag > >> - When the nowait flag is specified, the code does not wait in > >> balance_dirty_pages(), but instead stores the wait expiration in the > >> new task_struct field bdp_pause. > >> > >> - The function balance_dirty_pages_ratelimited() resets the new values > >> in the task_struct, once the timeout has expired > >> > >> This change is required to support write throttling for the async > >> buffered writes. While the writes are throttled, io_uring still can make > >> progress with processing other requests. > >> > >> Signed-off-by: Stefan Roesch <shr@xxxxxx> > > > > Maybe I miss something but I don't think this will throttle writers enough. > > For three reasons: > > > > 1) The calculated throttling pauses should accumulate for the task so that > > if we compute that say it takes 0.1s to write 100 pages and the task writes > > 300 pages, the delay adds up to 0.3s properly. Otherwise the task would not > > be throttled as long as we expect the writeback to take. > > > > 2) We must not allow the amount of dirty pages to exceed the dirty limit. > > That can easily lead to page reclaim getting into trouble reclaiming pages > > and thus machine stalls, oom kills etc. So if we are coming close to dirty > > limit and we cannot sleep, we must just fail the nowait write. > > > > 3) Even with above two problems fixed I suspect results will be suboptimal > > because balance_dirty_pages() heuristics assume they get called reasonably > > often and throttle writes so if amount of dirty pages is coming close to > > dirty limit, they think we are overestimating writeback speed and update > > throttling parameters accordingly. So if io_uring code does not throttle > > writers often enough, I think dirty throttling parameters will be jumping > > wildly resulting in poor behavior. > > > > So what I'd probably suggest is that if balance_dirty_pages() is called in > > "async" mode, we'd give tasks a pass until dirty_freerun_ceiling(). If > > balance_dirty_pages() decides the task needs to wait, we store the pause > > and bail all the way up into the place where we can sleep (io_uring code I > > assume), sleep there, and then continue doing write. > > > > Jan, thanks for the feedback. Are you suggesting to change the following > check in the function balance_dirty_pages(): > > /* > * Throttle it only when the background writeback cannot > * catch-up. This avoids (excessively) small writeouts > * when the wb limits are ramping up in case of !strictlimit. > * > * In strictlimit case make decision based on the wb counters > * and limits. Small writeouts when the wb limits are ramping > * up are the price we consciously pay for strictlimit-ing. > * > * If memcg domain is in effect, @dirty should be under > * both global and memcg freerun ceilings. > */ > if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh) && > (!mdtc || > m_dirty <= dirty_freerun_ceiling(m_thresh, m_bg_thresh))) { > unsigned long intv; > unsigned long m_intv; > > to include if we are in async mode? Actually no. This condition is the one that gives any task a free pass until dirty_freerun_ceiling(). So there's no need to do any modification for that. Sorry, I've probably formulated my suggestion in a bit confusing way. > There is no direct way to return that the process should sleep. Instead > two new fields are introduced in the proc structure. These two fields are > then used in io_uring to determine if the writes for a task need to be > throttled. > > In case the writes need to be throttled, the writes are not issued, but > instead inserted on a wait queue. We cannot sleep in the general io_uring > code path as we still want to process other requests which are affected > by the throttling. Probably you wanted to say "are not affected by the throttling" in the above. I know that you're using fields in task_struct to propagate the delay info. But IMHO that is unnecessary (although I don't care too much). Instead we could factor out a variant of balance_dirty_pages() that returns 'pause' to sleep, 0 if no sleeping needed. Normal balance_dirty_pages() would use this for pause calculation, places wanting async throttling would only get the pause to sleep. So e.g. iomap_write_iter() would then check and if returned pause is > 0, it would abort the loop similary as we'd abort it for any other reason when NOWAIT write is aborted because we need to sleep. Iouring code then detects short write / EAGAIN and offloads the write to the workqueue where normal balance_dirty_pages() can sleep as needed. This will make sure dirty limits are properly observed and we don't need that much special handling for it. Honza -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR