Re: [Lsf-pc] [LSF/MM/BPF TOPIC] Improving large folio writeback performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On Tue 21-01-25 16:29:57, Joanne Koong wrote:
> On Mon, Jan 20, 2025 at 2:42 PM Jan Kara <jack@xxxxxxx> wrote:
> > On Fri 17-01-25 14:45:01, Joanne Koong wrote:
> > > On Fri, Jan 17, 2025 at 3:53 AM Jan Kara <jack@xxxxxxx> wrote:
> > > > On Thu 16-01-25 15:38:54, Joanne Koong wrote:
> > > > I think tweaking min_pause is a wrong way to do this. I think that is just a
> > > > symptom. Can you run something like:
> > > >
> > > > while true; do
> > > >         cat /sys/kernel/debug/bdi/<fuse-bdi>/stats
> > > >         echo "---------"
> > > >         sleep 1
> > > > done >bdi-debug.txt
> > > >
> > > > while you are writing to the FUSE filesystem and share the output file?
> > > > That should tell us a bit more about what's happening inside the writeback
> > > > throttling. Also do you somehow configure min/max_ratio for the FUSE bdi?
> > > > You can check in /sys/block/<fuse-bdi>/bdi/{min,max}_ratio . I suspect the
> > > > problem is that the BDI dirty limit does not ramp up properly when we
> > > > increase dirtied pages in large chunks.
> > >
> > > This is the debug info I see for FUSE large folio writes where bs=1M
> > > and size=1G:
> > >
> > >
> > > BdiWriteback:                0 kB
> > > BdiReclaimable:              0 kB
> > > BdiDirtyThresh:            896 kB
> > > DirtyThresh:            359824 kB
> > > BackgroundThresh:       179692 kB
> > > BdiDirtied:            1071104 kB
> > > BdiWritten:               4096 kB
> > > BdiWriteBandwidth:           0 kBps
> > > b_dirty:                     0
> > > b_io:                        0
> > > b_more_io:                   0
> > > b_dirty_time:                0
> > > bdi_list:                    1
> > > state:                       1
> > > ---------
> > > BdiWriteback:                0 kB
> > > BdiReclaimable:              0 kB
> > > BdiDirtyThresh:           3596 kB
> > > DirtyThresh:            359824 kB
> > > BackgroundThresh:       179692 kB
> > > BdiDirtied:            1290240 kB
> > > BdiWritten:               4992 kB
> > > BdiWriteBandwidth:           0 kBps
> > > b_dirty:                     0
> > > b_io:                        0
> > > b_more_io:                   0
> > > b_dirty_time:                0
> > > bdi_list:                    1
> > > state:                       1
> > > ---------
> > > BdiWriteback:                0 kB
> > > BdiReclaimable:              0 kB
> > > BdiDirtyThresh:           3596 kB
> > > DirtyThresh:            359824 kB
> > > BackgroundThresh:       179692 kB
> > > BdiDirtied:            1517568 kB
> > > BdiWritten:               5824 kB
> > > BdiWriteBandwidth:       25692 kBps
> > > b_dirty:                     0
> > > b_io:                        1
> > > b_more_io:                   0
> > > b_dirty_time:                0
> > > bdi_list:                    1
> > > state:                       7
> > > ---------
> > > BdiWriteback:                0 kB
> > > BdiReclaimable:              0 kB
> > > BdiDirtyThresh:           3596 kB
> > > DirtyThresh:            359824 kB
> > > BackgroundThresh:       179692 kB
> > > BdiDirtied:            1747968 kB
> > > BdiWritten:               6720 kB
> > > BdiWriteBandwidth:           0 kBps
> > > b_dirty:                     0
> > > b_io:                        0
> > > b_more_io:                   0
> > > b_dirty_time:                0
> > > bdi_list:                    1
> > > state:                       1
> > > ---------
> > > BdiWriteback:                0 kB
> > > BdiReclaimable:              0 kB
> > > BdiDirtyThresh:            896 kB
> > > DirtyThresh:            359824 kB
> > > BackgroundThresh:       179692 kB
> > > BdiDirtied:            1949696 kB
> > > BdiWritten:               7552 kB
> > > BdiWriteBandwidth:           0 kBps
> > > b_dirty:                     0
> > > b_io:                        0
> > > b_more_io:                   0
> > > b_dirty_time:                0
> > > bdi_list:                    1
> > > state:                       1
> > > ---------
> > > BdiWriteback:                0 kB
> > > BdiReclaimable:              0 kB
> > > BdiDirtyThresh:           3612 kB
> > > DirtyThresh:            361300 kB
> > > BackgroundThresh:       180428 kB
> > > BdiDirtied:            2097152 kB
> > > BdiWritten:               8128 kB
> > > BdiWriteBandwidth:           0 kBps
> > > b_dirty:                     0
> > > b_io:                        0
> > > b_more_io:                   0
> > > b_dirty_time:                0
> > > bdi_list:                    1
> > > state:                       1
> > > ---------
> > >
> > >
> > > I didn't do anything to configure/change the FUSE bdi min/max_ratio.
> > > This is what I see on my system:
> > >
> > > cat /sys/class/bdi/0:52/min_ratio
> > > 0
> > > cat /sys/class/bdi/0:52/max_ratio
> > > 1
> >
> > OK, we can see that BdiDirtyThresh stabilized more or less at 3.6MB.
> > Checking the code, this shows we are hitting __wb_calc_thresh() logic:
> >
> >         if (unlikely(wb->bdi->capabilities & BDI_CAP_STRICTLIMIT)) {
> >                 unsigned long limit = hard_dirty_limit(dom, dtc->thresh);
> >                 u64 wb_scale_thresh = 0;
> >
> >                 if (limit > dtc->dirty)
> >                         wb_scale_thresh = (limit - dtc->dirty) / 100;
> >                 wb_thresh = max(wb_thresh, min(wb_scale_thresh, wb_max_thresh /
> >         }
> >
> > so BdiDirtyThresh is set to DirtyThresh/100. This also shows bdi never
> > generates enough throughput to ramp up it's share from this initial value.
> >
> > > > Actually, there's a patch queued in mm tree that improves the ramping up of
> > > > bdi dirty limit for strictlimit bdis [1]. It would be nice if you could
> > > > test whether it changes something in the behavior you observe. Thanks!
> > > >
> > > >                                                                 Honza
> > > >
> > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/akpm/25-new.git/tree/patche
> > > > s/mm-page-writeback-consolidate-wb_thresh-bumping-logic-into-__wb_calc_thresh.pa
> > > > tch
> > >
> > > I still see the same results (~230 MiB/s throughput using fio) with
> > > this patch applied, unfortunately. Here's the debug info I see with
> > > this patch (same test scenario as above on FUSE large folio writes
> > > where bs=1M and size=1G):
> > >
> > > BdiWriteback:                0 kB
> > > BdiReclaimable:           2048 kB
> > > BdiDirtyThresh:           3588 kB
> > > DirtyThresh:            359132 kB
> > > BackgroundThresh:       179348 kB
> > > BdiDirtied:              51200 kB
> > > BdiWritten:                128 kB
> > > BdiWriteBandwidth:      102400 kBps
> > > b_dirty:                     1
> > > b_io:                        0
> > > b_more_io:                   0
> > > b_dirty_time:                0
> > > bdi_list:                    1
> > > state:                       5
> > > ---------
> > > BdiWriteback:                0 kB
> > > BdiReclaimable:              0 kB
> > > BdiDirtyThresh:           3588 kB
> > > DirtyThresh:            359144 kB
> > > BackgroundThresh:       179352 kB
> > > BdiDirtied:             331776 kB
> > > BdiWritten:               1216 kB
> > > BdiWriteBandwidth:           0 kBps
> > > b_dirty:                     0
> > > b_io:                        0
> > > b_more_io:                   0
> > > b_dirty_time:                0
> > > bdi_list:                    1
> > > state:                       1
> > > ---------
> > > BdiWriteback:                0 kB
> > > BdiReclaimable:              0 kB
> > > BdiDirtyThresh:           3588 kB
> > > DirtyThresh:            359144 kB
> > > BackgroundThresh:       179352 kB
> > > BdiDirtied:             562176 kB
> > > BdiWritten:               2176 kB
> > > BdiWriteBandwidth:           0 kBps
> > > b_dirty:                     0
> > > b_io:                        0
> > > b_more_io:                   0
> > > b_dirty_time:                0
> > > bdi_list:                    1
> > > state:                       1
> > > ---------
> > > BdiWriteback:                0 kB
> > > BdiReclaimable:              0 kB
> > > BdiDirtyThresh:           3588 kB
> > > DirtyThresh:            359144 kB
> > > BackgroundThresh:       179352 kB
> > > BdiDirtied:             792576 kB
> > > BdiWritten:               3072 kB
> > > BdiWriteBandwidth:           0 kBps
> > > b_dirty:                     0
> > > b_io:                        0
> > > b_more_io:                   0
> > > b_dirty_time:                0
> > > bdi_list:                    1
> > > state:                       1
> > > ---------
> > > BdiWriteback:               64 kB
> > > BdiReclaimable:              0 kB
> > > BdiDirtyThresh:           3588 kB
> > > DirtyThresh:            359144 kB
> > > BackgroundThresh:       179352 kB
> > > BdiDirtied:            1026048 kB
> > > BdiWritten:               3904 kB
> > > BdiWriteBandwidth:           0 kBps
> > > b_dirty:                     0
> > > b_io:                        0
> > > b_more_io:                   0
> > > b_dirty_time:                0
> > > bdi_list:                    1
> > > state:                       1
> > > ---------
> >
> > Yeah, here the situation is really the same. As an experiment can you
> > experiment with setting min_ratio for the FUSE bdi to 1, 2, 3, ..., 10 (I
> > don't expect you should need to go past 10) and figure out when there's
> > enough slack space for the writeback bandwidth to ramp up to a full speed?
> > Thanks!
> >
> >                                                                 Honza
> 
> When locally testing this, I'm seeing that the max_ratio affects the
> bandwidth more so than min_ratio (eg the different min_ratios have
> roughly the same bandwidth per max_ratio). I'm also seeing somewhat
> high variance across runs which makes it hard to gauge what's
> accurate, but on average this is what I'm seeing:
> 
> max_ratio=1 --- bandwidth= ~230 MiB/s
> max_ratio=2 --- bandwidth= ~420 MiB/s
> max_ratio=3 --- bandwidth= ~550 MiB/s
> max_ratio=4 --- bandwidth= ~653 MiB/s
> max_ratio=5 --- bandwidth= ~700 MiB/s
> max_ratio=6 --- bandwidth= ~810 MiB/s
> max_ratio=7 --- bandwidth= ~1040 MiB/s (and then a lot of times, 561
> MiB/s on subsequent runs)

Ah, sorry. I actually misinterpretted your reply from previous email that:

> > > cat /sys/class/bdi/0:52/max_ratio
> > > 1

This means the amount of dirty pages for the fuse filesystem is indeed
hard-capped at 1% of dirty limit which happens to be ~3MB on your machine.
Checking where this is coming from I can see that fuse_bdi_init() does
this by:

	bdi_set_max_ratio(sb->s_bdi, 1);

So FUSE restricts itself and with only 3MB dirty limit and 2MB dirtying
granularity it is not surprising that dirty throttling doesn't work well.

I'd say there needs to be some better heuristic within FUSE that balances
maximum folio size and maximum dirty limit setting for the filesystem to a
sensible compromise (so that there's space for at least say 10 dirty
max-sized folios within the dirty limit).

But I guess this is just a shorter-term workaround. Long-term, finer
grained dirtiness tracking within FUSE (and writeback counters tracking in
MM) is going to be a more effective solution.

								Honza
-- 
Jan Kara <jack@xxxxxxxx>
SUSE Labs, CR




[Index of Archives]     [Linux ARM Kernel]     [Linux ARM]     [Linux Omap]     [Fedora ARM]     [IETF Annouce]     [Bugtraq]     [Linux OMAP]     [Linux MIPS]     [eCos]     [Asterisk Internet PBX]     [Linux API]

  Powered by Linux