Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Improving large folio writeback performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Fri, Jan 17, 2025 at 3:53 AM Jan Kara <jack@xxxxxxx> wrote:
>
> On Thu 16-01-25 15:38:54, Joanne Koong wrote:
> > On Thu, Jan 16, 2025 at 3:01 AM Jan Kara <jack@xxxxxxx> wrote:
> > > On Tue 14-01-25 16:50:53, Joanne Koong wrote:
> > > > I would like to propose a discussion topic about improving large folio
> > > > writeback performance. As more filesystems adopt large folios, it
> > > > becomes increasingly important that writeback is made to be as
> > > > performant as possible. There are two areas I'd like to discuss:
> > > >
> > > > == Granularity of dirty pages writeback ==
> > > > Currently, the granularity of writeback is at the folio level. If one
> > > > byte in a folio is dirty, the entire folio will be written back. This
> > > > becomes unscalable for larger folios and significantly degrades
> > > > performance, especially for workloads that employ random writes.
> > > >
> > > > One idea is to track dirty pages at a smaller granularity using a
> > > > 64-bit bitmap stored inside the folio struct where each bit tracks a
> > > > smaller chunk of pages (eg for 2 MB folios, each bit would track 32k
> > > > pages), and only write back dirty chunks rather than the entire folio.
> > >
> > > Yes, this is known problem and as Dave pointed out, currently it is upto
> > > the lower layer to handle finer grained dirtiness handling. You can take
> > > inspiration in the iomap layer that already does this, or you can convert
> > > your filesystem to use iomap (preferred way).
> > >
> > > > == Balancing dirty pages ==
> > > > It was observed that the dirty page balancing logic used in
> > > > balance_dirty_pages() fails to scale for large folios [1]. For
> > > > example, fuse saw around a 125% drop in throughput for writes when
> > > > using large folios vs small folios on 1MB block sizes, which was
> > > > attributed to scheduled io waits in the dirty page balancing logic. In
> > > > generic_perform_write(), dirty pages are balanced after every write to
> > > > the page cache by the filesystem. With large folios, each write
> > > > dirties a larger number of pages which can grossly exceed the
> > > > ratelimit, whereas with small folios each write is one page and so
> > > > pages are balanced more incrementally and adheres more closely to the
> > > > ratelimit. In order to accomodate large folios, likely the logic in
> > > > balancing dirty pages needs to be reworked.
> > >
> > > I think there are several separate issues here. One is that
> > > folio_account_dirtied() will consider the whole folio as needing writeback
> > > which is not necessarily the case (as e.g. iomap will writeback only dirty
> > > blocks in it). This was OKish when pages were 4k and you were using 1k
> > > blocks (which was uncommon configuration anyway, usually you had 4k block
> > > size), it starts to hurt a lot with 2M folios so we might need to find a
> > > way how to propagate the information about really dirty bits into writeback
> > > accounting.
> >
> > Agreed. The only workable solution I see is to have some sort of api
> > similar to filemap_dirty_folio() that takes in the number of pages
> > dirtied as an arg, but maybe there's a better solution.
>
> Yes, something like that I suppose.
>
> > > Another problem *may* be that fast increments to dirtied pages (as we dirty
> > > 512 pages at once instead of 16 we did in the past) cause over-reaction in
> > > the dirtiness balancing logic and we throttle the task too much. The
> > > heuristics there try to find the right amount of time to block a task so
> > > that dirtying speed matches the writeback speed and it's plausible that
> > > the large increments make this logic oscilate between two extremes leading
> > > to suboptimal throughput. Also, since this was observed with FUSE, I belive
> > > a significant factor is that FUSE enables "strictlimit" feature of the BDI
> > > which makes dirty throttling more aggressive (generally the amount of
> > > allowed dirty pages is lower). Anyway, these are mostly speculations from
> > > my end. This needs more data to decide what exactly (if anything) needs
> > > tweaking in the dirty throttling logic.
> >
> > I tested this experimentally and you're right, on FUSE this is
> > impacted a lot by the "strictlimit". I didn't see any bottlenecks when
> > strictlimit wasn't enabled on FUSE. AFAICT, the strictlimit affects
> > the dirty throttle control freerun flag (which gets used to determine
> > whether throttling can be skipped) in the balance_dirty_pages() logic.
> > For FUSE, we can't turn off strictlimit for unprivileged servers, but
> > maybe we can make the throttling check more permissive by upping the
> > value of the min_pause calculation in wb_min_pause() for writes that
> > support large folios? As of right now, the current logic makes writing
> > large folios unfeasible in FUSE (estimates show around a 75% drop in
> > throughput).
>
> I think tweaking min_pause is a wrong way to do this. I think that is just a
> symptom. Can you run something like:
>
> while true; do
>         cat /sys/kernel/debug/bdi/<fuse-bdi>/stats
>         echo "---------"
>         sleep 1
> done >bdi-debug.txt
>
> while you are writing to the FUSE filesystem and share the output file?
> That should tell us a bit more about what's happening inside the writeback
> throttling. Also do you somehow configure min/max_ratio for the FUSE bdi?
> You can check in /sys/block/<fuse-bdi>/bdi/{min,max}_ratio . I suspect the
> problem is that the BDI dirty limit does not ramp up properly when we
> increase dirtied pages in large chunks.

This is the debug info I see for FUSE large folio writes where bs=1M
and size=1G:


BdiWriteback:                0 kB
BdiReclaimable:              0 kB
BdiDirtyThresh:            896 kB
DirtyThresh:            359824 kB
BackgroundThresh:       179692 kB
BdiDirtied:            1071104 kB
BdiWritten:               4096 kB
BdiWriteBandwidth:           0 kBps
b_dirty:                     0
b_io:                        0
b_more_io:                   0
b_dirty_time:                0
bdi_list:                    1
state:                       1
---------
BdiWriteback:                0 kB
BdiReclaimable:              0 kB
BdiDirtyThresh:           3596 kB
DirtyThresh:            359824 kB
BackgroundThresh:       179692 kB
BdiDirtied:            1290240 kB
BdiWritten:               4992 kB
BdiWriteBandwidth:           0 kBps
b_dirty:                     0
b_io:                        0
b_more_io:                   0
b_dirty_time:                0
bdi_list:                    1
state:                       1
---------
BdiWriteback:                0 kB
BdiReclaimable:              0 kB
BdiDirtyThresh:           3596 kB
DirtyThresh:            359824 kB
BackgroundThresh:       179692 kB
BdiDirtied:            1517568 kB
BdiWritten:               5824 kB
BdiWriteBandwidth:       25692 kBps
b_dirty:                     0
b_io:                        1
b_more_io:                   0
b_dirty_time:                0
bdi_list:                    1
state:                       7
---------
BdiWriteback:                0 kB
BdiReclaimable:              0 kB
BdiDirtyThresh:           3596 kB
DirtyThresh:            359824 kB
BackgroundThresh:       179692 kB
BdiDirtied:            1747968 kB
BdiWritten:               6720 kB
BdiWriteBandwidth:           0 kBps
b_dirty:                     0
b_io:                        0
b_more_io:                   0
b_dirty_time:                0
bdi_list:                    1
state:                       1
---------
BdiWriteback:                0 kB
BdiReclaimable:              0 kB
BdiDirtyThresh:            896 kB
DirtyThresh:            359824 kB
BackgroundThresh:       179692 kB
BdiDirtied:            1949696 kB
BdiWritten:               7552 kB
BdiWriteBandwidth:           0 kBps
b_dirty:                     0
b_io:                        0
b_more_io:                   0
b_dirty_time:                0
bdi_list:                    1
state:                       1
---------
BdiWriteback:                0 kB
BdiReclaimable:              0 kB
BdiDirtyThresh:           3612 kB
DirtyThresh:            361300 kB
BackgroundThresh:       180428 kB
BdiDirtied:            2097152 kB
BdiWritten:               8128 kB
BdiWriteBandwidth:           0 kBps
b_dirty:                     0
b_io:                        0
b_more_io:                   0
b_dirty_time:                0
bdi_list:                    1
state:                       1
---------


I didn't do anything to configure/change the FUSE bdi min/max_ratio.
This is what I see on my system:

cat /sys/class/bdi/0:52/min_ratio
0
cat /sys/class/bdi/0:52/max_ratio
1


>
> Actually, there's a patch queued in mm tree that improves the ramping up of
> bdi dirty limit for strictlimit bdis [1]. It would be nice if you could
> test whether it changes something in the behavior you observe. Thanks!
>
>                                                                 Honza
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche
> s/mm-page-writeback-consolidate-wb_thresh-bumping-logic-into-__wb_calc_thresh.pa
> tch

I still see the same results (~230 MiB/s throughput using fio) with
this patch applied, unfortunately. Here's the debug info I see with
this patch (same test scenario as above on FUSE large folio writes
where bs=1M and size=1G):

BdiWriteback:                0 kB
BdiReclaimable:           2048 kB
BdiDirtyThresh:           3588 kB
DirtyThresh:            359132 kB
BackgroundThresh:       179348 kB
BdiDirtied:              51200 kB
BdiWritten:                128 kB
BdiWriteBandwidth:      102400 kBps
b_dirty:                     1
b_io:                        0
b_more_io:                   0
b_dirty_time:                0
bdi_list:                    1
state:                       5
---------
BdiWriteback:                0 kB
BdiReclaimable:              0 kB
BdiDirtyThresh:           3588 kB
DirtyThresh:            359144 kB
BackgroundThresh:       179352 kB
BdiDirtied:             331776 kB
BdiWritten:               1216 kB
BdiWriteBandwidth:           0 kBps
b_dirty:                     0
b_io:                        0
b_more_io:                   0
b_dirty_time:                0
bdi_list:                    1
state:                       1
---------
BdiWriteback:                0 kB
BdiReclaimable:              0 kB
BdiDirtyThresh:           3588 kB
DirtyThresh:            359144 kB
BackgroundThresh:       179352 kB
BdiDirtied:             562176 kB
BdiWritten:               2176 kB
BdiWriteBandwidth:           0 kBps
b_dirty:                     0
b_io:                        0
b_more_io:                   0
b_dirty_time:                0
bdi_list:                    1
state:                       1
---------
BdiWriteback:                0 kB
BdiReclaimable:              0 kB
BdiDirtyThresh:           3588 kB
DirtyThresh:            359144 kB
BackgroundThresh:       179352 kB
BdiDirtied:             792576 kB
BdiWritten:               3072 kB
BdiWriteBandwidth:           0 kBps
b_dirty:                     0
b_io:                        0
b_more_io:                   0
b_dirty_time:                0
bdi_list:                    1
state:                       1
---------
BdiWriteback:               64 kB
BdiReclaimable:              0 kB
BdiDirtyThresh:           3588 kB
DirtyThresh:            359144 kB
BackgroundThresh:       179352 kB
BdiDirtied:            1026048 kB
BdiWritten:               3904 kB
BdiWriteBandwidth:           0 kBps
b_dirty:                     0
b_io:                        0
b_more_io:                   0
b_dirty_time:                0
bdi_list:                    1
state:                       1
---------


Thanks,
Joanne
>
> --
> Jan Kara <jack@xxxxxxxx>
> SUSE Labs, CR





[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux