Re: Regression in NFS probably due to very large amounts of readahead

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





On 2024-11-26 13:49, Anders Blomdell wrote:


On 2024-11-26 11:37, Jan Kara wrote:
On Tue 26-11-24 09:01:35, Anders Blomdell wrote:
On 2024-11-26 02:48, Philippe Troin wrote:
On Sat, 2024-11-23 at 23:32 +0100, Anders Blomdell wrote:
When we (re)started one of our servers with 6.11.3-200.fc40.x86_64,
we got terrible performance (lots of nfs: server x.x.x.x not
responding).
What triggered this problem was virtual machines with NFS-mounted
qcow2 disks
that often triggered large readaheads that generates long streaks of
disk I/O
of 150-600 MB/s (4 ordinary HDD's) that filled up the buffer/cache
area of the
machine.

A git bisect gave the following suspect:

git bisect start

8< snip >8

# first bad commit: [7c877586da3178974a8a94577b6045a48377ff25]
readahead: properly shorten readahead when falling back to
do_page_cache_ra()

Thank you for taking the time to bisect, this issue has been bugging
me, but it's been non-deterministic, and hence hard to bisect.

I'm seeing the same problem on 6.11.10 (and earlier 6.11.x kernels) in
slightly different setups:

(1) On machines mounting NFSv3 shared drives. The symptom here is a
"nfs server XXX not responding, still trying" that never recovers
(while the server remains pingable and other NFSv3 volumes from the
hanging server can be mounted).

(2) On VMs running over qemu-kvm, I see very long stalls (can be up to
several minutes) on random I/O. These stalls eventually recover.

I've built a 6.11.10 kernel with
7c877586da3178974a8a94577b6045a48377ff25 reverted and I'm back to
normal (no more NFS hangs, no more VM stalls).

Some printk debugging, seems to indicate that the problem
is that the entity 'ra->size - (index - start)' goes
negative, which then gets cast to a very large unsigned
'nr_to_read' when calling 'do_page_cache_ra'. Where the true
bug is still eludes me, though.

Thanks for the report, bisection and debugging! I think I see what's going
on. read_pages() can go and reduce ra->size when ->readahead() callback
failed to read all folios prepared for reading and apparently that's what
happens with NFS and what can lead to negative argument to
do_page_cache_ra(). Now at this point I'm of the opinion that updating
ra->size / ra->async_size does more harm than good (because those values
show *desired* readahead to happen, not exact number of pages read),
furthermore it is problematic because ra can be shared by multiple
processes and so updates are inherently racy. If we indeed need to store
number of read pages, we could do it through ractl which is call-site local
and used for communication between readahead generic functions and callers.
But I have to do some more history digging and code reading to understand
what is using this logic in read_pages().

                                Honza
Good, look forward to a quick revert, and don't forget to CC GKH, so I get kernels recent  that work ASAP.
BTW, here is the output of the problematic reads from my printk modified kernel, all the good ones omitted:

nov 13:49:11 fay-02 kernel: mm/readahead.c:490 000000002cdf0a09: nr_to_read=-3 size=8 index=173952 mark=173947 start=173941 async=5 err=-17
nov 13:49:12 fay-02 kernel: mm/readahead.c:490 000000002cdf0a09: nr_to_read=-7 size=20 index=4158252 mark=4158225 start=4158225 async=20 err=-17
nov 13:49:16 fay-02 kernel: mm/readahead.c:490 0000000036189388: nr_to_read=-8 size=4 index=17978832 mark=17978820 start=17978820 async=4 err=-17
nov 13:49:19 fay-02 kernel: mm/readahead.c:490 00000000ce741f0d: nr_to_read=-5 size=8 index=3074784 mark=3074771 start=3074771 async=8 err=-17
nov 13:49:21 fay-02 kernel: mm/readahead.c:490 00000000ce741f0d: nr_to_read=-4 size=6 index=3087040 mark=3087030 start=3087030 async=6 err=-17
nov 13:49:23 fay-02 kernel: mm/readahead.c:490 0000000036189388: nr_to_read=-2 size=16 index=16118408 mark=16118405 start=16118390 async=10 err=-17
nov 13:49:24 fay-02 kernel: mm/readahead.c:490 0000000036189388: nr_to_read=-10 size=16 index=20781128 mark=20781118 start=20781102 async=16 err=-17
nov 13:49:24 fay-02 kernel: mm/readahead.c:490 0000000036189388: nr_to_read=-13 size=16 index=20679424 mark=20679411 start=20679395 async=10 err=-17
nov 13:49:25 fay-02 kernel: mm/readahead.c:490 0000000036189388: nr_to_read=-9 size=4 index=20792116 mark=20792103 start=20792103 async=4 err=-17
nov 13:50:22 fay-02 kernel: mm/readahead.c:490 000000009b8f0763: nr_to_read=-7 size=4 index=4172 mark=4167 start=4161 async=1 err=-17
nov 13:50:24 fay-02 kernel: mm/readahead.c:490 00000000295f3a99: nr_to_read=-7 size=4 index=4108 mark=4097 start=4097 async=1 err=-17
nov 13:50:24 fay-02 kernel: mm/readahead.c:490 00000000295f3a99: nr_to_read=-7 size=4 index=4428 mark=4417 start=4417 async=4 err=-17
nov 13:56:48 fay-02 kernel: mm/readahead.c:490 000000009b8f0763: nr_to_read=-10 size=18 index=85071484 mark=85071456 start=85071456 async=18 err=-17

--- a/mm/readahead.c
+++ b/mm/readahead.c
@@ -485,7 +485,21 @@ void page_cache_ra_order(struct readahead_control *ractl,
        if (!err)
                return;
 fallback:
-       do_page_cache_ra(ractl, ra->size - (index - start), ra->async_size);
+       long nr_to_read = ra->size - (index - start);
+       if (index > mark) {
+         printk("%s:%d %p: "
+                "nr_to_read=%ld "
+                "size=%d index=%ld mark=%ld start=%ld async=%d err=%d",
+                __FILE__, __LINE__,
+                ractl->mapping->host,
+                nr_to_read,
+                ra->size, index, mark, start, ra->async_size, err);
+       }
+       if (nr_to_read < 0) {
+         printk("SKIP");
+         return;
+       }
+       do_page_cache_ra(ractl, nr_to_read, ra->async_size);
 }
static unsigned long ractl_max_pages(struct readahead_control *ractl,

Regards

/Anders




[Index of Archives]     [Linux Ext4 Filesystem]     [Union Filesystem]     [Filesystem Testing]     [Ceph Users]     [Ecryptfs]     [NTFS 3]     [AutoFS]     [Kernel Newbies]     [Share Photos]     [Security]     [Netfilter]     [Bugtraq]     [Yosemite News]     [MIPS Linux]     [ARM Linux]     [Linux Security]     [Linux Cachefs]     [Reiser Filesystem]     [Linux RAID]     [NTFS 3]     [Samba]     [Device Mapper]     [CEPH Development]

  Powered by Linux