Re: [Cluster-devel] gfs2 write bandwidth regression on 6.4-rc3 compareto 5.15.y

Andreas Gruenbacher <agruenba@xxxxxxxxxx> · Mon, 10 Jul 2023 15:19:54 +0200

Hi Wang Yugui,

On Sun, May 28, 2023 at 5:53 PM Wang Yugui <wangyugui@xxxxxxxxxxxx> wrote:
> Hi,
>
> > Hi,
> >
> > gfs2 write bandwidth regression on 6.4-rc3 compare to 5.15.y.
> >
> > we added  linux-xfs@ and linux-fsdevel@ because some related problem[1]
> > and related patches[2].
> >
> > we compared 6.4-rc3(rather than 6.1.y) to 5.15.y because some related patches[2]
> > work only for 6.4 now.
> >
> > [1] https://lore.kernel.org/linux-xfs/20230508172406.1CF3.409509F4@xxxxxxxxxxxx/
> > [2] https://lore.kernel.org/linux-xfs/20230520163603.1794256-1-willy@xxxxxxxxxxxxx/
> >
> >
> > test case:
> > 1) PCIe3 SSD *4 with LVM
> > 2) gfs2 lock_nolock
> >     gfs2 attr(T) GFS2_AF_ORLOV
> >    # chattr +T /mnt/test
> > 3) fio
> > fio --name=global --rw=write -bs=1024Ki -size=32Gi -runtime=30 -iodepth 1
> > -ioengine sync -zero_buffers=1 -direct=0 -end_fsync=1 -numjobs=1 \
> >       -name write-bandwidth-1 -filename=/mnt/test/sub1/1.txt \
> >       -name write-bandwidth-2 -filename=/mnt/test/sub2/1.txt \
> >       -name write-bandwidth-3 -filename=/mnt/test/sub3/1.txt \
> >       -name write-bandwidth-4 -filename=/mnt/test/sub4/1.txt
> > 4) patches[2] are applied to 6.4-rc3.
> >
> >
> > 5.15.y result
> >       fio WRITE: bw=5139MiB/s (5389MB/s),
> > 6.4-rc3 result
> >       fio  WRITE: bw=2599MiB/s (2725MB/s)
>
> more test result:
>
> 5.17.0  WRITE: bw=4988MiB/s (5231MB/s)
> 5.18.0  WRITE: bw=5165MiB/s (5416MB/s)
> 5.19.0  WRITE: bw=5511MiB/s (5779MB/s)
> 6.0.5   WRITE: bw=3055MiB/s (3203MB/s), WRITE: bw=3225MiB/s (3382MB/s)
> 6.1.30  WRITE: bw=2579MiB/s (2705MB/s)
>
> so this regression  happen in some code introduced in 6.0,
> and maybe some minor regression in 6.1 too?

thanks for this bug report. Bob has noticed a similar looking
performance regression recently, and it turned out that commit
e1fa9ea85ce8 ("gfs2: Stop using glock holder auto-demotion for now")
inadvertently caused buffered writes to fall back to writing single
pages instead of multiple pages at once. That patch was added in
v5.18, so it doesn't perfectly align with the regression history
you're reporting, but maybe there's something else going on that we're
not aware of.

In any case, the regression introduced by commit e1fa9ea85ce8 should
be fixed by commit c8ed1b359312 ("gfs2: Fix duplicate
should_fault_in_pages() call"), which ended up in v6.5-rc1.

Could you please check where we end up with that fix?

Thank you very much,
Andreas