Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Improving large folio writeback performance

Jan Kara <jack@xxxxxxx> · Fri, 17 Jan 2025 12:53:12 +0100

On Thu 16-01-25 15:38:54, Joanne Koong wrote:
> On Thu, Jan 16, 2025 at 3:01 AM Jan Kara <jack@xxxxxxx> wrote:
> > On Tue 14-01-25 16:50:53, Joanne Koong wrote:
> > > I would like to propose a discussion topic about improving large folio
> > > writeback performance. As more filesystems adopt large folios, it
> > > becomes increasingly important that writeback is made to be as
> > > performant as possible. There are two areas I'd like to discuss:
> > >
> > > == Granularity of dirty pages writeback ==
> > > Currently, the granularity of writeback is at the folio level. If one
> > > byte in a folio is dirty, the entire folio will be written back. This
> > > becomes unscalable for larger folios and significantly degrades
> > > performance, especially for workloads that employ random writes.
> > >
> > > One idea is to track dirty pages at a smaller granularity using a
> > > 64-bit bitmap stored inside the folio struct where each bit tracks a
> > > smaller chunk of pages (eg for 2 MB folios, each bit would track 32k
> > > pages), and only write back dirty chunks rather than the entire folio.
> >
> > Yes, this is known problem and as Dave pointed out, currently it is upto
> > the lower layer to handle finer grained dirtiness handling. You can take
> > inspiration in the iomap layer that already does this, or you can convert
> > your filesystem to use iomap (preferred way).
> >
> > > == Balancing dirty pages ==
> > > It was observed that the dirty page balancing logic used in
> > > balance_dirty_pages() fails to scale for large folios [1]. For
> > > example, fuse saw around a 125% drop in throughput for writes when
> > > using large folios vs small folios on 1MB block sizes, which was
> > > attributed to scheduled io waits in the dirty page balancing logic. In
> > > generic_perform_write(), dirty pages are balanced after every write to
> > > the page cache by the filesystem. With large folios, each write
> > > dirties a larger number of pages which can grossly exceed the
> > > ratelimit, whereas with small folios each write is one page and so
> > > pages are balanced more incrementally and adheres more closely to the
> > > ratelimit. In order to accomodate large folios, likely the logic in
> > > balancing dirty pages needs to be reworked.
> >
> > I think there are several separate issues here. One is that
> > folio_account_dirtied() will consider the whole folio as needing writeback
> > which is not necessarily the case (as e.g. iomap will writeback only dirty
> > blocks in it). This was OKish when pages were 4k and you were using 1k
> > blocks (which was uncommon configuration anyway, usually you had 4k block
> > size), it starts to hurt a lot with 2M folios so we might need to find a
> > way how to propagate the information about really dirty bits into writeback
> > accounting.
> 
> Agreed. The only workable solution I see is to have some sort of api
> similar to filemap_dirty_folio() that takes in the number of pages
> dirtied as an arg, but maybe there's a better solution.

Yes, something like that I suppose.

> > Another problem *may* be that fast increments to dirtied pages (as we dirty
> > 512 pages at once instead of 16 we did in the past) cause over-reaction in
> > the dirtiness balancing logic and we throttle the task too much. The
> > heuristics there try to find the right amount of time to block a task so
> > that dirtying speed matches the writeback speed and it's plausible that
> > the large increments make this logic oscilate between two extremes leading
> > to suboptimal throughput. Also, since this was observed with FUSE, I belive
> > a significant factor is that FUSE enables "strictlimit" feature of the BDI
> > which makes dirty throttling more aggressive (generally the amount of
> > allowed dirty pages is lower). Anyway, these are mostly speculations from
> > my end. This needs more data to decide what exactly (if anything) needs
> > tweaking in the dirty throttling logic.
> 
> I tested this experimentally and you're right, on FUSE this is
> impacted a lot by the "strictlimit". I didn't see any bottlenecks when
> strictlimit wasn't enabled on FUSE. AFAICT, the strictlimit affects
> the dirty throttle control freerun flag (which gets used to determine
> whether throttling can be skipped) in the balance_dirty_pages() logic.
> For FUSE, we can't turn off strictlimit for unprivileged servers, but
> maybe we can make the throttling check more permissive by upping the
> value of the min_pause calculation in wb_min_pause() for writes that
> support large folios? As of right now, the current logic makes writing
> large folios unfeasible in FUSE (estimates show around a 75% drop in
> throughput).

I think tweaking min_pause is a wrong way to do this. I think that is just a
symptom. Can you run something like:

while true; do
	cat /sys/kernel/debug/bdi/<fuse-bdi>/stats
	echo "---------"
	sleep 1
done >bdi-debug.txt

while you are writing to the FUSE filesystem and share the output file?
That should tell us a bit more about what's happening inside the writeback
throttling. Also do you somehow configure min/max_ratio for the FUSE bdi?
You can check in /sys/block/<fuse-bdi>/bdi/{min,max}_ratio . I suspect the
problem is that the BDI dirty limit does not ramp up properly when we
increase dirtied pages in large chunks.

Actually, there's a patch queued in mm tree that improves the ramping up of
bdi dirty limit for strictlimit bdis [1]. It would be nice if you could
test whether it changes something in the behavior you observe. Thanks!

								Honza

[1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche
s/mm-page-writeback-consolidate-wb_thresh-bumping-logic-into-__wb_calc_thresh.pa
tch

-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR