On 5/10/22 2:50 AM, Jan Kara wrote: > Sorry for delayed reply. This has fallen through the cracks... > > On Thu 28-04-22 13:16:19, Stefan Roesch wrote: >> On 4/28/22 10:47 AM, Jan Kara wrote: >>> On Tue 26-04-22 10:43:32, Stefan Roesch wrote: >>>> This change adds support for async write throttling in the function >>>> balance_dirty_pages(). So far if throttling was required, the code was >>>> waiting synchronously as long as the writes were throttled. This change >>>> introduces asynchronous throttling. Instead of waiting in the function >>>> balance_dirty_pages(), the timeout is set in the task_struct field >>>> bdp_pause. Once the timeout has expired, the writes are no longer >>>> throttled. >>>> >>>> - Add a new parameter to the balance_dirty_pages() function >>>> - This allows the caller to pass in the nowait flag >>>> - When the nowait flag is specified, the code does not wait in >>>> balance_dirty_pages(), but instead stores the wait expiration in the >>>> new task_struct field bdp_pause. >>>> >>>> - The function balance_dirty_pages_ratelimited() resets the new values >>>> in the task_struct, once the timeout has expired >>>> >>>> This change is required to support write throttling for the async >>>> buffered writes. While the writes are throttled, io_uring still can make >>>> progress with processing other requests. >>>> >>>> Signed-off-by: Stefan Roesch <shr@xxxxxx> >>> >>> Maybe I miss something but I don't think this will throttle writers enough. >>> For three reasons: >>> >>> 1) The calculated throttling pauses should accumulate for the task so that >>> if we compute that say it takes 0.1s to write 100 pages and the task writes >>> 300 pages, the delay adds up to 0.3s properly. Otherwise the task would not >>> be throttled as long as we expect the writeback to take. >>> >>> 2) We must not allow the amount of dirty pages to exceed the dirty limit. >>> That can easily lead to page reclaim getting into trouble reclaiming pages >>> and thus machine stalls, oom kills etc. So if we are coming close to dirty >>> limit and we cannot sleep, we must just fail the nowait write. >>> >>> 3) Even with above two problems fixed I suspect results will be suboptimal >>> because balance_dirty_pages() heuristics assume they get called reasonably >>> often and throttle writes so if amount of dirty pages is coming close to >>> dirty limit, they think we are overestimating writeback speed and update >>> throttling parameters accordingly. So if io_uring code does not throttle >>> writers often enough, I think dirty throttling parameters will be jumping >>> wildly resulting in poor behavior. >>> >>> So what I'd probably suggest is that if balance_dirty_pages() is called in >>> "async" mode, we'd give tasks a pass until dirty_freerun_ceiling(). If >>> balance_dirty_pages() decides the task needs to wait, we store the pause >>> and bail all the way up into the place where we can sleep (io_uring code I >>> assume), sleep there, and then continue doing write. >>> >> >> Jan, thanks for the feedback. Are you suggesting to change the following >> check in the function balance_dirty_pages(): >> >> /* >> * Throttle it only when the background writeback cannot >> * catch-up. This avoids (excessively) small writeouts >> * when the wb limits are ramping up in case of !strictlimit. >> * >> * In strictlimit case make decision based on the wb counters >> * and limits. Small writeouts when the wb limits are ramping >> * up are the price we consciously pay for strictlimit-ing. >> * >> * If memcg domain is in effect, @dirty should be under >> * both global and memcg freerun ceilings. >> */ >> if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh) && >> (!mdtc || >> m_dirty <= dirty_freerun_ceiling(m_thresh, m_bg_thresh))) { >> unsigned long intv; >> unsigned long m_intv; >> >> to include if we are in async mode? > > Actually no. This condition is the one that gives any task a free pass > until dirty_freerun_ceiling(). So there's no need to do any modification > for that. Sorry, I've probably formulated my suggestion in a bit confusing > way. > >> There is no direct way to return that the process should sleep. Instead >> two new fields are introduced in the proc structure. These two fields are >> then used in io_uring to determine if the writes for a task need to be >> throttled. >> >> In case the writes need to be throttled, the writes are not issued, but >> instead inserted on a wait queue. We cannot sleep in the general io_uring >> code path as we still want to process other requests which are affected >> by the throttling. > > Probably you wanted to say "are not affected by the throttling" in the > above. > Yes, that's correct. > I know that you're using fields in task_struct to propagate the delay info. > But IMHO that is unnecessary (although I don't care too much). Instead we > could factor out a variant of balance_dirty_pages() that returns 'pause' to > sleep, 0 if no sleeping needed. Normal balance_dirty_pages() would use this > for pause calculation, places wanting async throttling would only get the > pause to sleep. So e.g. iomap_write_iter() would then check and if returned > pause is > 0, it would abort the loop similary as we'd abort it for any > other reason when NOWAIT write is aborted because we need to sleep. Iouring > code then detects short write / EAGAIN and offloads the write to the > workqueue where normal balance_dirty_pages() can sleep as needed. > > This will make sure dirty limits are properly observed and we don't need > that much special handling for it. > I like the idea of factoring out a function out balance_dirty_pages(), however I see two challenges: - the write operation has already completed at this point, - so we can't really sleep on its completion in the io-worker in io-uring - we don't know how long to sleep in io-uring Currently balance_dirty_pages_ratelimited() is called at the end of the function iomap_write_iter(). If the function balance_dirty_pages_ratelimited() would instead be called at the beginning of the function iomap_write_iter() we could return -EAGAIN and then complete it in the io-worker. I'm not sure what the implications are of moving the function call to the beginning of the function iomap_write_iter(). > Honza