On Fri, Jan 17, 2025 at 3:53 AM Jan Kara <jack@xxxxxxx> wrote: > > On Thu 16-01-25 15:38:54, Joanne Koong wrote: > > On Thu, Jan 16, 2025 at 3:01 AM Jan Kara <jack@xxxxxxx> wrote: > > > On Tue 14-01-25 16:50:53, Joanne Koong wrote: > > > > I would like to propose a discussion topic about improving large folio > > > > writeback performance. As more filesystems adopt large folios, it > > > > becomes increasingly important that writeback is made to be as > > > > performant as possible. There are two areas I'd like to discuss: > > > > > > > > == Granularity of dirty pages writeback == > > > > Currently, the granularity of writeback is at the folio level. If one > > > > byte in a folio is dirty, the entire folio will be written back. This > > > > becomes unscalable for larger folios and significantly degrades > > > > performance, especially for workloads that employ random writes. > > > > > > > > One idea is to track dirty pages at a smaller granularity using a > > > > 64-bit bitmap stored inside the folio struct where each bit tracks a > > > > smaller chunk of pages (eg for 2 MB folios, each bit would track 32k > > > > pages), and only write back dirty chunks rather than the entire folio. > > > > > > Yes, this is known problem and as Dave pointed out, currently it is upto > > > the lower layer to handle finer grained dirtiness handling. You can take > > > inspiration in the iomap layer that already does this, or you can convert > > > your filesystem to use iomap (preferred way). > > > > > > > == Balancing dirty pages == > > > > It was observed that the dirty page balancing logic used in > > > > balance_dirty_pages() fails to scale for large folios [1]. For > > > > example, fuse saw around a 125% drop in throughput for writes when > > > > using large folios vs small folios on 1MB block sizes, which was > > > > attributed to scheduled io waits in the dirty page balancing logic. In > > > > generic_perform_write(), dirty pages are balanced after every write to > > > > the page cache by the filesystem. With large folios, each write > > > > dirties a larger number of pages which can grossly exceed the > > > > ratelimit, whereas with small folios each write is one page and so > > > > pages are balanced more incrementally and adheres more closely to the > > > > ratelimit. In order to accomodate large folios, likely the logic in > > > > balancing dirty pages needs to be reworked. > > > > > > I think there are several separate issues here. One is that > > > folio_account_dirtied() will consider the whole folio as needing writeback > > > which is not necessarily the case (as e.g. iomap will writeback only dirty > > > blocks in it). This was OKish when pages were 4k and you were using 1k > > > blocks (which was uncommon configuration anyway, usually you had 4k block > > > size), it starts to hurt a lot with 2M folios so we might need to find a > > > way how to propagate the information about really dirty bits into writeback > > > accounting. > > > > Agreed. The only workable solution I see is to have some sort of api > > similar to filemap_dirty_folio() that takes in the number of pages > > dirtied as an arg, but maybe there's a better solution. > > Yes, something like that I suppose. > > > > Another problem *may* be that fast increments to dirtied pages (as we dirty > > > 512 pages at once instead of 16 we did in the past) cause over-reaction in > > > the dirtiness balancing logic and we throttle the task too much. The > > > heuristics there try to find the right amount of time to block a task so > > > that dirtying speed matches the writeback speed and it's plausible that > > > the large increments make this logic oscilate between two extremes leading > > > to suboptimal throughput. Also, since this was observed with FUSE, I belive > > > a significant factor is that FUSE enables "strictlimit" feature of the BDI > > > which makes dirty throttling more aggressive (generally the amount of > > > allowed dirty pages is lower). Anyway, these are mostly speculations from > > > my end. This needs more data to decide what exactly (if anything) needs > > > tweaking in the dirty throttling logic. > > > > I tested this experimentally and you're right, on FUSE this is > > impacted a lot by the "strictlimit". I didn't see any bottlenecks when > > strictlimit wasn't enabled on FUSE. AFAICT, the strictlimit affects > > the dirty throttle control freerun flag (which gets used to determine > > whether throttling can be skipped) in the balance_dirty_pages() logic. > > For FUSE, we can't turn off strictlimit for unprivileged servers, but > > maybe we can make the throttling check more permissive by upping the > > value of the min_pause calculation in wb_min_pause() for writes that > > support large folios? As of right now, the current logic makes writing > > large folios unfeasible in FUSE (estimates show around a 75% drop in > > throughput). > > I think tweaking min_pause is a wrong way to do this. I think that is just a > symptom. Can you run something like: > > while true; do > cat /sys/kernel/debug/bdi/<fuse-bdi>/stats > echo "---------" > sleep 1 > done >bdi-debug.txt > > while you are writing to the FUSE filesystem and share the output file? > That should tell us a bit more about what's happening inside the writeback > throttling. Also do you somehow configure min/max_ratio for the FUSE bdi? > You can check in /sys/block/<fuse-bdi>/bdi/{min,max}_ratio . I suspect the > problem is that the BDI dirty limit does not ramp up properly when we > increase dirtied pages in large chunks. This is the debug info I see for FUSE large folio writes where bs=1M and size=1G: BdiWriteback: 0 kB BdiReclaimable: 0 kB BdiDirtyThresh: 896 kB DirtyThresh: 359824 kB BackgroundThresh: 179692 kB BdiDirtied: 1071104 kB BdiWritten: 4096 kB BdiWriteBandwidth: 0 kBps b_dirty: 0 b_io: 0 b_more_io: 0 b_dirty_time: 0 bdi_list: 1 state: 1 --------- BdiWriteback: 0 kB BdiReclaimable: 0 kB BdiDirtyThresh: 3596 kB DirtyThresh: 359824 kB BackgroundThresh: 179692 kB BdiDirtied: 1290240 kB BdiWritten: 4992 kB BdiWriteBandwidth: 0 kBps b_dirty: 0 b_io: 0 b_more_io: 0 b_dirty_time: 0 bdi_list: 1 state: 1 --------- BdiWriteback: 0 kB BdiReclaimable: 0 kB BdiDirtyThresh: 3596 kB DirtyThresh: 359824 kB BackgroundThresh: 179692 kB BdiDirtied: 1517568 kB BdiWritten: 5824 kB BdiWriteBandwidth: 25692 kBps b_dirty: 0 b_io: 1 b_more_io: 0 b_dirty_time: 0 bdi_list: 1 state: 7 --------- BdiWriteback: 0 kB BdiReclaimable: 0 kB BdiDirtyThresh: 3596 kB DirtyThresh: 359824 kB BackgroundThresh: 179692 kB BdiDirtied: 1747968 kB BdiWritten: 6720 kB BdiWriteBandwidth: 0 kBps b_dirty: 0 b_io: 0 b_more_io: 0 b_dirty_time: 0 bdi_list: 1 state: 1 --------- BdiWriteback: 0 kB BdiReclaimable: 0 kB BdiDirtyThresh: 896 kB DirtyThresh: 359824 kB BackgroundThresh: 179692 kB BdiDirtied: 1949696 kB BdiWritten: 7552 kB BdiWriteBandwidth: 0 kBps b_dirty: 0 b_io: 0 b_more_io: 0 b_dirty_time: 0 bdi_list: 1 state: 1 --------- BdiWriteback: 0 kB BdiReclaimable: 0 kB BdiDirtyThresh: 3612 kB DirtyThresh: 361300 kB BackgroundThresh: 180428 kB BdiDirtied: 2097152 kB BdiWritten: 8128 kB BdiWriteBandwidth: 0 kBps b_dirty: 0 b_io: 0 b_more_io: 0 b_dirty_time: 0 bdi_list: 1 state: 1 --------- I didn't do anything to configure/change the FUSE bdi min/max_ratio. This is what I see on my system: cat /sys/class/bdi/0:52/min_ratio 0 cat /sys/class/bdi/0:52/max_ratio 1 > > Actually, there's a patch queued in mm tree that improves the ramping up of > bdi dirty limit for strictlimit bdis [1]. It would be nice if you could > test whether it changes something in the behavior you observe. Thanks! > > Honza > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche > s/mm-page-writeback-consolidate-wb_thresh-bumping-logic-into-__wb_calc_thresh.pa > tch I still see the same results (~230 MiB/s throughput using fio) with this patch applied, unfortunately. Here's the debug info I see with this patch (same test scenario as above on FUSE large folio writes where bs=1M and size=1G): BdiWriteback: 0 kB BdiReclaimable: 2048 kB BdiDirtyThresh: 3588 kB DirtyThresh: 359132 kB BackgroundThresh: 179348 kB BdiDirtied: 51200 kB BdiWritten: 128 kB BdiWriteBandwidth: 102400 kBps b_dirty: 1 b_io: 0 b_more_io: 0 b_dirty_time: 0 bdi_list: 1 state: 5 --------- BdiWriteback: 0 kB BdiReclaimable: 0 kB BdiDirtyThresh: 3588 kB DirtyThresh: 359144 kB BackgroundThresh: 179352 kB BdiDirtied: 331776 kB BdiWritten: 1216 kB BdiWriteBandwidth: 0 kBps b_dirty: 0 b_io: 0 b_more_io: 0 b_dirty_time: 0 bdi_list: 1 state: 1 --------- BdiWriteback: 0 kB BdiReclaimable: 0 kB BdiDirtyThresh: 3588 kB DirtyThresh: 359144 kB BackgroundThresh: 179352 kB BdiDirtied: 562176 kB BdiWritten: 2176 kB BdiWriteBandwidth: 0 kBps b_dirty: 0 b_io: 0 b_more_io: 0 b_dirty_time: 0 bdi_list: 1 state: 1 --------- BdiWriteback: 0 kB BdiReclaimable: 0 kB BdiDirtyThresh: 3588 kB DirtyThresh: 359144 kB BackgroundThresh: 179352 kB BdiDirtied: 792576 kB BdiWritten: 3072 kB BdiWriteBandwidth: 0 kBps b_dirty: 0 b_io: 0 b_more_io: 0 b_dirty_time: 0 bdi_list: 1 state: 1 --------- BdiWriteback: 64 kB BdiReclaimable: 0 kB BdiDirtyThresh: 3588 kB DirtyThresh: 359144 kB BackgroundThresh: 179352 kB BdiDirtied: 1026048 kB BdiWritten: 3904 kB BdiWriteBandwidth: 0 kBps b_dirty: 0 b_io: 0 b_more_io: 0 b_dirty_time: 0 bdi_list: 1 state: 1 --------- Thanks, Joanne > > -- > Jan Kara <jack@xxxxxxxx> > SUSE Labs, CR