Re: [LSF/MM/BPF TOPIC] Improving large folio writeback performance

Dave Chinner <david@xxxxxxxxxxxxx> · Wed, 15 Jan 2025 12:21:37 +1100

On Tue, Jan 14, 2025 at 04:50:53PM -0800, Joanne Koong wrote:
> Hi all,
> 
> I would like to propose a discussion topic about improving large folio
> writeback performance. As more filesystems adopt large folios, it
> becomes increasingly important that writeback is made to be as
> performant as possible. There are two areas I'd like to discuss:
> 
> 
> == Granularity of dirty pages writeback ==
> Currently, the granularity of writeback is at the folio level. If one
> byte in a folio is dirty, the entire folio will be written back. This
> becomes unscalable for larger folios and significantly degrades
> performance, especially for workloads that employ random writes.

This sounds familiar, probably because we fixed this exact issue in
the iomap infrastructure some while ago.

commit 4ce02c67972211be488408c275c8fbf19faf29b3
Author: Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx>
Date:   Mon Jul 10 14:12:43 2023 -0700

    iomap: Add per-block dirty state tracking to improve performance

    When filesystem blocksize is less than folio size (either with
    mapping_large_folio_support() or with blocksize < pagesize) and when the
    folio is uptodate in pagecache, then even a byte write can cause
    an entire folio to be written to disk during writeback. This happens
    because we currently don't have a mechanism to track per-block dirty
    state within struct iomap_folio_state. We currently only track uptodate
    state.

    This patch implements support for tracking per-block dirty state in
    iomap_folio_state->state bitmap. This should help improve the filesystem
    write performance and help reduce write amplification.

    Performance testing of below fio workload reveals ~16x performance
    improvement using nvme with XFS (4k blocksize) on Power (64K pagesize)
    FIO reported write bw scores improved from around ~28 MBps to ~452 MBps.

    1. <test_randwrite.fio>
    [global]
            ioengine=psync
            rw=randwrite
            overwrite=1
            pre_read=1
            direct=0
            bs=4k
            size=1G
            dir=./
            numjobs=8
            fdatasync=1
            runtime=60
            iodepth=64
            group_reporting=1

    [fio-run]

    2. Also our internal performance team reported that this patch improves
       their database workload performance by around ~83% (with XFS on Power)

    Reported-by: Aravinda Herle <araherle@xxxxxxxxxx>
    Reported-by: Brian Foster <bfoster@xxxxxxxxxx>
    Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@xxxxxxxxx>
    Reviewed-by: Darrick J. Wong <djwong@xxxxxxxxxx>

> One idea is to track dirty pages at a smaller granularity using a
> 64-bit bitmap stored inside the folio struct where each bit tracks a
> smaller chunk of pages (eg for 2 MB folios, each bit would track 32k
> pages), and only write back dirty chunks rather than the entire folio.

Have a look at how sub-folio state is tracked via the
folio->iomap_folio_state->state{} bitmaps.

Essentially it is up to the subsystem to track sub-folio state if
they require it; there is some generic filesystem infrastructure
support already in place (like iomap), but if that doesn't fit a
filesystem then it will need to provide it's own dirty/uptodate
tracking....

-Dave.
-- 
Dave Chinner
david@xxxxxxxxxxxxx