On 4/29/22 9:37 AM, Chuck Lever III wrote: > > >> On Apr 29, 2022, at 8:54 AM, Dennis Dalessandro <dennis.dalessandro@xxxxxxxxxxxxxxxxxxxx> wrote: >> >> On 4/28/22 3:56 PM, Trond Myklebust wrote: >>> On Thu, 2022-04-28 at 15:47 -0400, Dennis Dalessandro wrote: >>>> On 4/28/22 11:42 AM, Dennis Dalessandro wrote: >>>>> On 4/28/22 10:57 AM, Chuck Lever III wrote: >>>>>>> On Apr 28, 2022, at 9:05 AM, Dennis Dalessandro >>>>>>> <dennis.dalessandro@xxxxxxxxxxxxxxxxxxxx> wrote: >>>>>>> >>>>>>> Hi NFS folks, >>>>>>> >>>>>>> I've noticed a pretty nasty regression in our NFS capability >>>>>>> between 5.17 and >>>>>>> 5.18-rc1. I've tried to bisect but not having any luck. The >>>>>>> problem I'm seeing >>>>>>> is it takes 3 minutes to copy a file from NFS to the local >>>>>>> disk. When it should >>>>>>> take less than half a second, which it did up through 5.17. >>>>>>> >>>>>>> It doesn't seem to be network related, but can't rule that out >>>>>>> completely. >>>>>>> >>>>>>> I tried to bisect but the problem can be intermittent. Some >>>>>>> runs I'll see a >>>>>>> problem in 3 out of 100 cycles, sometimes 0 out of 100. >>>>>>> Sometimes I'll see it >>>>>>> 100 out of 100. >>>>>> >>>>>> It's not clear from your problem report whether the problem >>>>>> appears >>>>>> when it's the server running v5.18-rc or the client. >>>>> >>>>> That's because I don't know which it is. I'll do a quick test and >>>>> find out. I >>>>> was testing the same kernel across both nodes. >>>> >>>> Looks like it is the client. >>>> >>>> server client result >>>> ------ ------ ------ >>>> 5.17 5.17 Pass >>>> 5.17 5.18 Fail >>>> 5.18 5.18 Fail >>>> 5.18 5.17 Pass >>>> >>>> Is there a patch for the client issue you mentioned that I could try? >>>> >>>> -Denny >>> >>> Try this one >> >> Thanks for the patch. Unfortunately it doesn't seem to solve the issue, still >> see intermittent hangs. I applied it on top of -rc4: >> >> copy /mnt/nfs_test/run_nfs_test.junk to /dev/shm/run_nfs_test.tmp... >> >> real 2m6.072s >> user 0m0.002s >> sys 0m0.263s >> Done >> >> While it was hung I checked the mem usage on the machine: >> >> # free -h >> total used free shared buff/cache available >> Mem: 62Gi 871Mi 61Gi 342Mi 889Mi 61Gi >> Swap: 4.0Gi 0B 4.0Gi >> >> Doesn't appear to be under memory pressure. > > Hi, since you know now that it is the client, perhaps a bisect > would be more successful? I've been testing all week. I pulled the nfs-rdma tree that was sent to Linus for 5.18 and tested. I see the problem on pretty much all the patches. However it's the frequency that it hits which changes. I'll see 1-5 cycles out of 2500 where the copy takes minutes up to: "NFS: Convert readdir page cache to use a cookie based index" After this I start seeing it around 10 times in 500 and by the last patch 10 times in less than 100. Is there any kind of tracing/debugging I could turn on to get more insight on what is taking so long when it does go bad? -Denny