Re: Regression in NFS probably due to very large amounts of readahead

"NeilBrown" <neilb@xxxxxxx> · Wed, 27 Nov 2024 19:37:10 +1100

On Wed, 27 Nov 2024, Jan Kara wrote:
> On Tue 26-11-24 11:37:19, Jan Kara wrote:
> > On Tue 26-11-24 09:01:35, Anders Blomdell wrote:
> > > On 2024-11-26 02:48, Philippe Troin wrote:
> > > > On Sat, 2024-11-23 at 23:32 +0100, Anders Blomdell wrote:
> > > > > When we (re)started one of our servers with 6.11.3-200.fc40.x86_64,
> > > > > we got terrible performance (lots of nfs: server x.x.x.x not
> > > > > responding).
> > > > > What triggered this problem was virtual machines with NFS-mounted
> > > > > qcow2 disks
> > > > > that often triggered large readaheads that generates long streaks of
> > > > > disk I/O
> > > > > of 150-600 MB/s (4 ordinary HDD's) that filled up the buffer/cache
> > > > > area of the
> > > > > machine.
> > > > > 
> > > > > A git bisect gave the following suspect:
> > > > > 
> > > > > git bisect start
> > > > 
> > > > 8< snip >8
> > > > 
> > > > > # first bad commit: [7c877586da3178974a8a94577b6045a48377ff25]
> > > > > readahead: properly shorten readahead when falling back to
> > > > > do_page_cache_ra()
> > > > 
> > > > Thank you for taking the time to bisect, this issue has been bugging
> > > > me, but it's been non-deterministic, and hence hard to bisect.
> > > > 
> > > > I'm seeing the same problem on 6.11.10 (and earlier 6.11.x kernels) in
> > > > slightly different setups:
> > > > 
> > > > (1) On machines mounting NFSv3 shared drives. The symptom here is a
> > > > "nfs server XXX not responding, still trying" that never recovers
> > > > (while the server remains pingable and other NFSv3 volumes from the
> > > > hanging server can be mounted).
> > > > 
> > > > (2) On VMs running over qemu-kvm, I see very long stalls (can be up to
> > > > several minutes) on random I/O. These stalls eventually recover.
> > > > 
> > > > I've built a 6.11.10 kernel with
> > > > 7c877586da3178974a8a94577b6045a48377ff25 reverted and I'm back to
> > > > normal (no more NFS hangs, no more VM stalls).
> > > > 
> > > Some printk debugging, seems to indicate that the problem
> > > is that the entity 'ra->size - (index - start)' goes
> > > negative, which then gets cast to a very large unsigned
> > > 'nr_to_read' when calling 'do_page_cache_ra'. Where the true
> > > bug is still eludes me, though.
> > 
> > Thanks for the report, bisection and debugging! I think I see what's going
> > on. read_pages() can go and reduce ra->size when ->readahead() callback
> > failed to read all folios prepared for reading and apparently that's what
> > happens with NFS and what can lead to negative argument to
> > do_page_cache_ra(). Now at this point I'm of the opinion that updating
> > ra->size / ra->async_size does more harm than good (because those values
> > show *desired* readahead to happen, not exact number of pages read),
> > furthermore it is problematic because ra can be shared by multiple
> > processes and so updates are inherently racy. If we indeed need to store
> > number of read pages, we could do it through ractl which is call-site local
> > and used for communication between readahead generic functions and callers.
> > But I have to do some more history digging and code reading to understand
> > what is using this logic in read_pages().
> 
> Hum, checking the history the update of ra->size has been added by Neil two
> years ago in 9fd472af84ab ("mm: improve cleanup when ->readpages doesn't
> process all pages"). Neil, the changelog seems as there was some real
> motivation behind updating of ra->size in read_pages(). What was it? Now I
> somewhat disagree with reducing ra->size in read_pages() because it seems
> like a wrong place to do that and if we do need something like that,
> readahead window sizing logic should rather be changed to take that into
> account? But it all depends on what was the real rationale behind reducing
> ra->size in read_pages()...
> 

I cannot tell you much more than what the commit itself says.
If there are any pages still in the rac, then we didn't try read-ahead
and shouldn't pretend that we did. Else the numbers will be wrong.

I think the important part of the patch was the
delete_from_page_cache().
Leaving pages in the page cache which we didn't try to read will cause
a future read-ahead to skip those pages and they can only be read
synchronously.

But maybe you are right that ra, being shared, shouldn't be modified
like this.

Thanks,
NeilBrown