Regression in NFS probably due to very large amounts of readahead

Anders Blomdell <anders.blomdell@xxxxxxxxx> · Sat, 23 Nov 2024 23:32:41 +0100

When we (re)started one of our servers with 6.11.3-200.fc40.x86_64,
we got terrible performance (lots of nfs: server x.x.x.x not responding).
What triggered this problem was virtual machines with NFS-mounted qcow2 disks
that often triggered large readaheads that generates long streaks of disk I/O
of 150-600 MB/s (4 ordinary HDD's) that filled up the buffer/cache area of the
machine.

A git bisect gave the following suspect:

git bisect start
# status: waiting for both good and bad commits
# bad: [8e24a758d14c0b1cd42ab0aea980a1030eea811f] Linux 6.11.3
git bisect bad 8e24a758d14c0b1cd42ab0aea980a1030eea811f
# status: waiting for good commit(s), bad commit known
# good: [8a886bee7aa574611df83a028ab435aeee071e00] Linux 6.10.11
git bisect good 8a886bee7aa574611df83a028ab435aeee071e00
# good: [0c3836482481200ead7b416ca80c68a29cfdaabd] Linux 6.10
git bisect good 0c3836482481200ead7b416ca80c68a29cfdaabd
# good: [f669aac34c5f76b58e6cad1fef0643e5ae16d413] Merge tag 'trace-v6.11-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace
git bisect good f669aac34c5f76b58e6cad1fef0643e5ae16d413
# bad: [78eb4ea25cd5fdbdae7eb9fdf87b99195ff67508] sysctl: treewide: constify the ctl_table argument of proc_handlers
git bisect bad 78eb4ea25cd5fdbdae7eb9fdf87b99195ff67508
# good: [acc5965b9ff8a1889f5b51466562896d59c6e1b9] Merge tag 'char-misc-6.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
git bisect good acc5965b9ff8a1889f5b51466562896d59c6e1b9
# good: [8e313211f7d46d42b6aa7601b972fe89dcc4a076] Merge tag 'pinctrl-v6.11-1' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl
git bisect good 8e313211f7d46d42b6aa7601b972fe89dcc4a076
# bad: [fbc90c042cd1dc7258ebfebe6d226017e5b5ac8c] Merge tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
git bisect bad fbc90c042cd1dc7258ebfebe6d226017e5b5ac8c
# good: [f416817197e102b9bc6118101c3be652dac01a44] kmsan: support SLAB_POISON
git bisect good f416817197e102b9bc6118101c3be652dac01a44
# bad: [f6a6de245fdb1dfb4307b0a80ce7fa35ba2c35a6] Docs/mm/damon/index: add links to admin-guide doc
git bisect bad f6a6de245fdb1dfb4307b0a80ce7fa35ba2c35a6
# bad: [a0b856b617c585b86a077aae5176c946e1462b7d] mm/ksm: optimize the chain()/chain_prune() interfaces
git bisect bad a0b856b617c585b86a077aae5176c946e1462b7d
# good: [b1a80f4be7691a1ea007e24ebb3c8ca2e4a20f00] kmsan: do not pass NULL pointers as 0
git bisect good b1a80f4be7691a1ea007e24ebb3c8ca2e4a20f00
# bad: [58540f5cde404f512c80fb7b868b12005f0e2747] readahead: simplify gotos in page_cache_sync_ra()
git bisect bad 58540f5cde404f512c80fb7b868b12005f0e2747
# bad: [7c877586da3178974a8a94577b6045a48377ff25] readahead: properly shorten readahead when falling back to do_page_cache_ra()
git bisect bad 7c877586da3178974a8a94577b6045a48377ff25
# good: [ee86814b0562f18255b55c5e6a01a022895994cf] mm/migrate: move NUMA hinting fault folio isolation + checks under PTL
git bisect good ee86814b0562f18255b55c5e6a01a022895994cf
# good: [901a269ff3d59c9ee0e6be35c6044dc4bf2c0fdf] filemap: fix page_cache_next_miss() when no hole found
git bisect good 901a269ff3d59c9ee0e6be35c6044dc4bf2c0fdf
# first bad commit: [7c877586da3178974a8a94577b6045a48377ff25] readahead: properly shorten readahead when falling back to do_page_cache_ra()

I would much appreciate some guidance on how to proceed to track down what goes wrong.

Best regards

Anders Blomdell