Thanks for taking a look! Comments inline: On Tue, 22 Jun 2021 at 14:12, Jan Kara <jack@xxxxxxx> wrote: > > On Mon 21-06-21 11:20:10, Michael Stapelberg wrote: > > Hey Miklos > > > > On Fri, 18 Jun 2021 at 16:42, Miklos Szeredi <miklos@xxxxxxxxxx> wrote: > > > > > > On Fri, 18 Jun 2021 at 10:31, Michael Stapelberg > > > <stapelberg+linux@xxxxxxxxxx> wrote: > > > > > > > Maybe, but I don’t have the expertise, motivation or time to > > > > investigate this any further, let alone commit to get it done. > > > > During our previous discussion I got the impression that nobody else > > > > had any cycles for this either: > > > > https://lore.kernel.org/linux-fsdevel/CANnVG6n=ySfe1gOr=0ituQidp56idGARDKHzP0hv=ERedeMrMA@xxxxxxxxxxxxxx/ > > > > > > > > Have you had a look at the China LSF report at > > > > http://bardofschool.blogspot.com/2011/? > > > > The author of the heuristic has spent significant effort and time > > > > coming up with what we currently have in the kernel: > > > > > > > > """ > > > > Fengguang said he draw more than 10K performance graphs and read even > > > > more in the past year. > > > > """ > > > > > > > > This implies that making changes to the heuristic will not be a quick fix. > > > > > > Having a piece of kernel code sitting there that nobody is willing to > > > fix is certainly not a great situation to be in. > > > > Agreed. > > > > > > > > And introducing band aids is not going improve the above situation, > > > more likely it will prolong it even further. > > > > Sounds like “Perfect is the enemy of good” to me: you’re looking for a > > perfect hypothetical solution, > > whereas we have a known-working low risk fix for a real problem. > > > > Could we find a solution where medium-/long-term, the code in question > > is improved, > > perhaps via a Summer Of Code project or similar community efforts, > > but until then, we apply the patch at hand? > > > > As I mentioned, I think adding min/max limits can be useful regardless > > of how the heuristic itself changes. > > > > If that turns out to be incorrect or undesired, we can still turn the > > knobs into a no-op, if removal isn’t an option. > > Well, removal of added knobs is more or less out of question as it can > break some userspace. Similarly making them no-op is problematic unless we > are pretty certain it cannot break some existing setup. That's why we have > to think twice (or better three times ;) before adding any knobs. Also > honestly the knobs you suggest will be pretty hard to tune when there are > multiple cgroups with writeback control involved (which can be affected by > the same problems you observe as well). So I agree with Miklos that this is > not the right way to go. Speaking of tunables, did you try tuning > /sys/devices/virtual/bdi/<fuse-bdi>/min_ratio? I suspect that may > workaround your problems... Back then, I did try the various tunables (vm.dirty_ratio and vm.dirty_background_ratio on the global level, /sys/class/bdi/<bdi>/{min,max}_ratio on the file system level), and they have had no observable effect on the problem at all in my tests. > > Looking into your original report and tracing you did (thanks for that, > really useful), it seems that the problem is that writeback bandwidth is > updated at most every 200ms (more frequent calls are just ignored) and are > triggered only from balance_dirty_pages() (happen when pages are dirtied) and > inode writeback code so if the workload tends to have short spikes of activity > and extended periods of quiet time, then writeback bandwidth may indeed be > seriously miscomputed because we just won't update writeback throughput > after most of writeback has happened as you observed. > > I think the fix for this can be relatively simple. We just need to make > sure we update writeback bandwidth reasonably quickly after the IO > finishes. I'll write a patch and see if it helps. Thank you! Please keep us posted.