On Wed, 7 Jul 2021 11:51:38 +0200 Jan Kara wrote: >On Wed 07-07-21 15:40:17, Hillf Danton wrote: >> On Mon, 5 Jul 2021 18:23:17 +0200 Jan Kara wrote: >> > >> >Michael Stapelberg has reported that for workload with short big spikes >> >of writes (GCC linker seem to trigger this frequently) the write >> >throughput is heavily underestimated and tends to steadily sink until it >> >reaches zero. This has rather bad impact on writeback throttling >> >(causing stalls). The problem is that writeback throughput estimate gets >> >updated at most once per 200 ms. One update happens early after we >> >submit pages for writeback (at that point writeout of only small >> >fraction of pages is completed and thus observed throughput is tiny). >> >Next update happens only during the next write spike (updates happen >> >only from inode writeback and dirty throttling code) and if that is >> >more than 1s after previous spike, we decide system was idle and just >> >ignore whatever was written until this moment. >> > >> >Fix the problem by making sure writeback throughput estimate is also >> >updated shortly after writeback completes to get reasonable estimate of >> >throughput for spiky workloads. >> > >> >Link: https://lore.kernel.org/lkml/20210617095309.3542373-1-stapelberg+li>nux@xxxxxxxxxx >> >Reported-by: Michael Stapelberg <stapelberg+linux@xxxxxxxxxx> >> >Signed-off-by: Jan Kara <jack@xxxxxxx> >... >> >diff --git a/mm/page-writeback.c b/mm/page-writeback.c >> >index 1fecf8ebadb0..6a99ddca95c0 100644 >> >--- a/mm/page-writeback.c >> >+++ b/mm/page-writeback.c >> >@@ -1346,14 +1346,7 @@ static void __wb_update_bandwidth(struct dirty_thr>ottle_control *gdtc, >> > unsigned long dirtied; >> > unsigned long written; >> > >> >- lockdep_assert_held(&wb->list_lock); >> >- >> >- /* >> >- * rate-limit, only update once every 200ms. >> >- */ >> >- if (elapsed < BANDWIDTH_INTERVAL) >> >- return; >> >> Please leave it as it is if you are not dumping the 200ms rule. > >Well, that could break the delayed updated scheduled after the end of >writeback and for no good reason. The problematic ordering is like: After another look at 2/5, you are cutting the rule, which is worth a seperate patch. > >end writeback on inode1 > queue_delayed_work() - queues delayed work after BANDWIDTH_INTERVAL > >__wb_update_bandwidth() called e.g. from balance_dirty_pages() > wb->bw_time_stamp = now; > >end writeback on inode2 > queue_delayed_work() - does nothing since work is already queued > >delayed work calls __wb_update_bandwidth() - nothing is done since elapsed >< BANDWIDTH_INTERVAL and we may thus miss reflecting writeback of inode2 in >our estimates. Your example says the estimate based on inode2 is torpedoed by a random update, and you are looking to make that estimate meaningful at the cost of breaking the rule - how differet is it to the current one if the estimate is derived from 20ms-elapsed interval at inode2? Is it likely to see another palpablely different result at inode3 from 50ms-elapsed interval? > >> >@@ -2742,6 +2737,11 @@ static void wb_inode_writeback_start(struct bdi_wr>iteback *wb) >> > static void wb_inode_writeback_end(struct bdi_writeback *wb) >> > { >> > atomic_dec(&wb->writeback_inodes); >> >+ /* >> >+ * Make sure estimate of writeback throughput gets >> >+ * updated after writeback completed. >> >+ */ >> >+ queue_delayed_work(bdi_wq, &wb->bw_dwork, BANDWIDTH_INTERVAL); >> > } >> >> This is a bogus estimate - it does not break the 200ms rule but walks >> around it without specifying why 300ms is not good. > >Well, you're right that BANDWIDTH_INTERVAL is somewhat arbitrary here. We >do want some batching of bandwidth updates after writeback completes for >the cases where lots of inodes end their writeback in a quick succession. >I've picked BANDWIDTH_INTERVAL here as that's the batching of other >bandwidth updates as well so it kind of makes sense. I'll add a comment why >BANDWIDTH_INTERVAL is picked here. > > Honza >-- >Jan Kara <jack@xxxxxxxx> >SUSE Labs, CR > >