On Thu 16-01-25 15:38:54, Joanne Koong wrote: > On Thu, Jan 16, 2025 at 3:01 AM Jan Kara <jack@xxxxxxx> wrote: > > On Tue 14-01-25 16:50:53, Joanne Koong wrote: > > > I would like to propose a discussion topic about improving large folio > > > writeback performance. As more filesystems adopt large folios, it > > > becomes increasingly important that writeback is made to be as > > > performant as possible. There are two areas I'd like to discuss: > > > > > > == Granularity of dirty pages writeback == > > > Currently, the granularity of writeback is at the folio level. If one > > > byte in a folio is dirty, the entire folio will be written back. This > > > becomes unscalable for larger folios and significantly degrades > > > performance, especially for workloads that employ random writes. > > > > > > One idea is to track dirty pages at a smaller granularity using a > > > 64-bit bitmap stored inside the folio struct where each bit tracks a > > > smaller chunk of pages (eg for 2 MB folios, each bit would track 32k > > > pages), and only write back dirty chunks rather than the entire folio. > > > > Yes, this is known problem and as Dave pointed out, currently it is upto > > the lower layer to handle finer grained dirtiness handling. You can take > > inspiration in the iomap layer that already does this, or you can convert > > your filesystem to use iomap (preferred way). > > > > > == Balancing dirty pages == > > > It was observed that the dirty page balancing logic used in > > > balance_dirty_pages() fails to scale for large folios [1]. For > > > example, fuse saw around a 125% drop in throughput for writes when > > > using large folios vs small folios on 1MB block sizes, which was > > > attributed to scheduled io waits in the dirty page balancing logic. In > > > generic_perform_write(), dirty pages are balanced after every write to > > > the page cache by the filesystem. With large folios, each write > > > dirties a larger number of pages which can grossly exceed the > > > ratelimit, whereas with small folios each write is one page and so > > > pages are balanced more incrementally and adheres more closely to the > > > ratelimit. In order to accomodate large folios, likely the logic in > > > balancing dirty pages needs to be reworked. > > > > I think there are several separate issues here. One is that > > folio_account_dirtied() will consider the whole folio as needing writeback > > which is not necessarily the case (as e.g. iomap will writeback only dirty > > blocks in it). This was OKish when pages were 4k and you were using 1k > > blocks (which was uncommon configuration anyway, usually you had 4k block > > size), it starts to hurt a lot with 2M folios so we might need to find a > > way how to propagate the information about really dirty bits into writeback > > accounting. > > Agreed. The only workable solution I see is to have some sort of api > similar to filemap_dirty_folio() that takes in the number of pages > dirtied as an arg, but maybe there's a better solution. Yes, something like that I suppose. > > Another problem *may* be that fast increments to dirtied pages (as we dirty > > 512 pages at once instead of 16 we did in the past) cause over-reaction in > > the dirtiness balancing logic and we throttle the task too much. The > > heuristics there try to find the right amount of time to block a task so > > that dirtying speed matches the writeback speed and it's plausible that > > the large increments make this logic oscilate between two extremes leading > > to suboptimal throughput. Also, since this was observed with FUSE, I belive > > a significant factor is that FUSE enables "strictlimit" feature of the BDI > > which makes dirty throttling more aggressive (generally the amount of > > allowed dirty pages is lower). Anyway, these are mostly speculations from > > my end. This needs more data to decide what exactly (if anything) needs > > tweaking in the dirty throttling logic. > > I tested this experimentally and you're right, on FUSE this is > impacted a lot by the "strictlimit". I didn't see any bottlenecks when > strictlimit wasn't enabled on FUSE. AFAICT, the strictlimit affects > the dirty throttle control freerun flag (which gets used to determine > whether throttling can be skipped) in the balance_dirty_pages() logic. > For FUSE, we can't turn off strictlimit for unprivileged servers, but > maybe we can make the throttling check more permissive by upping the > value of the min_pause calculation in wb_min_pause() for writes that > support large folios? As of right now, the current logic makes writing > large folios unfeasible in FUSE (estimates show around a 75% drop in > throughput). I think tweaking min_pause is a wrong way to do this. I think that is just a symptom. Can you run something like: while true; do cat /sys/kernel/debug/bdi/<fuse-bdi>/stats echo "---------" sleep 1 done >bdi-debug.txt while you are writing to the FUSE filesystem and share the output file? That should tell us a bit more about what's happening inside the writeback throttling. Also do you somehow configure min/max_ratio for the FUSE bdi? You can check in /sys/block/<fuse-bdi>/bdi/{min,max}_ratio . I suspect the problem is that the BDI dirty limit does not ramp up properly when we increase dirtied pages in large chunks. Actually, there's a patch queued in mm tree that improves the ramping up of bdi dirty limit for strictlimit bdis [1]. It would be nice if you could test whether it changes something in the behavior you observe. Thanks! Honza [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche s/mm-page-writeback-consolidate-wb_thresh-bumping-logic-into-__wb_calc_thresh.pa tch -- Jan Kara <jack@xxxxxxxx> SUSE Labs, CR