Re: [LSF/MM/BPF TOPIC] Improving large folio writeback performance

Joanne Koong <joannelkoong@xxxxxxxxx> · Thu, 16 Jan 2025 12:14:49 -0800

On Tue, Jan 14, 2025 at 5:21 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>
> On Tue, Jan 14, 2025 at 04:50:53PM -0800, Joanne Koong wrote:
> > Hi all,
> >
> > I would like to propose a discussion topic about improving large folio
> > writeback performance. As more filesystems adopt large folios, it
> > becomes increasingly important that writeback is made to be as
> > performant as possible. There are two areas I'd like to discuss:
> >
> >
> > == Granularity of dirty pages writeback ==
> > Currently, the granularity of writeback is at the folio level. If one
> > byte in a folio is dirty, the entire folio will be written back. This
> > becomes unscalable for larger folios and significantly degrades
> > performance, especially for workloads that employ random writes.
>
> This sounds familiar, probably because we fixed this exact issue in
> the iomap infrastructure some while ago.
>
> commit 4ce02c67972211be488408c275c8fbf19faf29b3
> Author: Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx>
> Date:   Mon Jul 10 14:12:43 2023 -0700
>
>     iomap: Add per-block dirty state tracking to improve performance
>
>     When filesystem blocksize is less than folio size (either with
>     mapping_large_folio_support() or with blocksize < pagesize) and when the
>     folio is uptodate in pagecache, then even a byte write can cause
>     an entire folio to be written to disk during writeback. This happens
>     because we currently don't have a mechanism to track per-block dirty
>     state within struct iomap_folio_state. We currently only track uptodate
>     state.
>
>     This patch implements support for tracking per-block dirty state in
>     iomap_folio_state->state bitmap. This should help improve the filesystem
>     write performance and help reduce write amplification.
>
>     Performance testing of below fio workload reveals ~16x performance
>     improvement using nvme with XFS (4k blocksize) on Power (64K pagesize)
>     FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.
>
>     1. <test_randwrite.fio>
>     [global]
>             ioengine=psync
>             rw=randwrite
>             overwrite=1
>             pre_read=1
>             direct=0
>             bs=4k
>             size=1G
>             dir=./
>             numjobs=8
>             fdatasync=1
>             runtime=60
>             iodepth=64
>             group_reporting=1
>
>     [fio-run]
>
>     2. Also our internal performance team reported that this patch improves
>        their database workload performance by around ~83% (with XFS on Power)
>
>     Reported-by: Aravinda Herle <araherle@xxxxxxxxxx>
>     Reported-by: Brian Foster <bfoster@xxxxxxxxxx>
>     Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx>
>     Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx>
>
>
> > One idea is to track dirty pages at a smaller granularity using a
> > 64-bit bitmap stored inside the folio struct where each bit tracks a
> > smaller chunk of pages (eg for 2 MB folios, each bit would track 32k
> > pages), and only write back dirty chunks rather than the entire folio.
>
> Have a look at how sub-folio state is tracked via the
> folio->iomap_folio_state->state{} bitmaps.
>
> Essentially it is up to the subsystem to track sub-folio state if
> they require it; there is some generic filesystem infrastructure
> support already in place (like iomap), but if that doesn't fit a
> filesystem then it will need to provide it's own dirty/uptodate
> tracking....

Great, thanks for the info. I'll take a look at how the iomap layer does this.

>
> -Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx