On 23.03.23 13:02, Daire Byrne wrote: > Yes, sorry, I did see that patch was posted not long after I reported > the excessive repeat ACCESS calls but I have not had a chance to test > it. > > It certainly looks promising. I will try to get to it next week and > update my original email. Hmm, no update afaics (or I missed it). No worries, but I for now will assume this was fixed. Please correct me if it later turns out that I've been wrong. #regzbot fix: 21fd9e8700de86d1169 #regzbot ignore-activity Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) -- Everything you wanna know about Linux kernel regression tracking: https://linux-regtracking.leemhuis.info/about/#tldr That page also explains what to do if mails like this annoy you. > On Thu, 23 Mar 2023 at 11:54, Thorsten Leemhuis > <regressions@xxxxxxxxxxxxx> wrote: >> >> On 20.03.23 13:34, Thorsten Leemhuis wrote: >>> On 10.03.23 11:43, Linux regression tracking (Thorsten Leemhuis) wrote: >>>> [CCing the regression list, as it should be in the loop for regressions: >>>> https://docs.kernel.org/admin-guide/reporting-regressions.html] >>>> >>>> [also adding the author of the culprit (Trond) and the second NFS client >>>> maintainer (Anna) to the list of recipients] >>> >>> Trond, sorry to bother you, but do you still have this regression report >>> on your radar? It looks a bit like it fall through the cracks, as I >>> don't see any reply since it was posted nearly two weeks ago. Or did >>> some progress to address this happen and I just missed it? >> >> Daire, it seems the NFS developers ignore my inquiries, sorry. >> >> But well, I noticed there is a patch that references the culprit you >> found in the bisection: >> >> https://lore.kernel.org/all/20230308080327.33906-1-chengen.du@xxxxxxxxxxxxx/ >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=master&id=21fd9e8700de86d1169f6336e97d7a74916ed04a >> >> I wonder if you are aware of it or maybe even tested it already. To me >> it sounds like it could fix your problem, but this is not my area of >> expertise, so I might be totally wrong. >> >> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) >> -- >> Everything you wanna know about Linux kernel regression tracking: >> https://linux-regtracking.leemhuis.info/about/#tldr >> If I did something stupid, please tell me, as explained on that page. >> >> #regzbot ignore-activity >> >>>> [TLDR: I'm adding this report to the list of tracked Linux kernel >>>> regressions; the text you find below is based on a few templates >>>> paragraphs you might have encountered already in similar form. >>>> See link in footer if these mails annoy you.] >>>> >>>> On 07.03.23 18:38, Daire Byrne wrote: >>>>> I noticed a change in behaviour in the v6.2.x client versus v6.1.12 (and below). >>>>> >>>>> We have some servers that mount Netapps from different locations many >>>>> milliseconds away, and these contain apps and libs that get added to >>>>> the LD_LIBRARY_PATH and PATH on remote login. >>>>> >>>>> I then noticed that when I ssh'd into a remote server that had these >>>>> mounts and the shell was starting, the first login was normal and I >>>>> observed an expected flurry of lookups,getattrs and access calls for a >>>>> grand total of only ~120 packets to the Netapp. >>>>> >>>>> But when I disconnect and reconnect (ssh), now I see a flood of access >>>>> calls to the netapp for a handful of repeating filehandles which look >>>>> something like: >>>>> >>>>> 2700 85.942563180 10.23.112.10 → 10.23.21.11 NFS 254 V3 ACCESS Call, >>>>> FH: 0x7f36addc, [Check: RD LU MD XT DL] >>>>> 2701 85.999838796 10.23.21.11 → 10.23.112.10 NFS 190 V3 ACCESS Reply >>>>> (Call In 2700), [Allowed: RD LU MD XT DL] >>>>> 2702 85.999970825 10.23.112.10 → 10.23.21.11 NFS 254 V3 ACCESS Call, >>>>> FH: 0x7f36addc, [Check: RD LU MD XT DL] >>>>> 2703 86.055340946 10.23.21.11 → 10.23.112.10 NFS 190 V3 ACCESS Reply >>>>> (Call In 2702), [Allowed: RD LU MD XT DL] >>>>> 2704 86.056865308 10.23.112.10 → 10.23.21.11 NFS 254 V3 ACCESS Call, >>>>> FH: 0x7f36addc, [Check: RD LU MD XT DL] >>>>> 2705 86.112233415 10.23.21.11 → 10.23.112.10 NFS 190 V3 ACCESS Reply >>>>> (Call In 2704), [Allowed: RD LU MD XT DL] >>>>> >>>>> This time we total 5000+ packets for this login which becomes very >>>>> noticeable when the Netapp is 50ms away. >>>>> >>>>> I didn't understand why the first login was fine but the second goes >>>>> into this repeating access pattern. I set actimeo=3600 (long) but it >>>>> does not seem to affect it. >>>>> >>>>> I do not see this prior to v6.2 where repeated logins are equally fast >>>>> and we don't see the repeating access calls. >>>>> >>>>> So a bit of digging through the v6.2 changes and this looked like the >>>>> relevant change: >>>>> >>>>> commit 0eb43812c027 ("NFS: Clear the file access cache upon login”) >>>>> [PATCH] NFS: Judge the file access cache's timestamp in rcu path? >>>>> >>>>> I reverted those and got the prior (v6.1) performance. >>>>> >>>>> What constitutes a login exactly? I also have services like "sysstat" >>>>> or pcp that cause a systemd-logind to trigger regularly on our >>>>> machines.... does that count and invalidate the cache? >>>>> >>>>> Do the repeated access calls on the same handful of filehandles make >>>>> sense? Even prior to those patches (or v6.1) there are only a couple >>>>> of ACCESS calls to the Netapp on login. >>>>> >>>>> We are a bit unique in that we run quite a few WAN high latency NFS >>>>> workflows so are happy to trade long lived caches (e.g. actimeo and >>>>> even nocto on occasion) for lower ops at the expense of total >>>>> correctness. >>>> >>>> Thanks for the report. To be sure the issue doesn't fall through the >>>> cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression >>>> tracking bot: >>>> >>>> #regzbot ^introduced 0eb43812c027 >>>> #regzbot title nfs: flood of access on second log-in (first is fine) >>>> #regzbot ignore-activity >>>> >>>> This isn't a regression? This issue or a fix for it are already >>>> discussed somewhere else? It was fixed already? You want to clarify when >>>> the regression started to happen? Or point out I got the title or >>>> something else totally wrong? Then just reply and tell me -- ideally >>>> while also telling regzbot about it, as explained by the page listed in >>>> the footer of this mail. >>>> >>>> Developers: When fixing the issue, remember to add 'Link:' tags pointing >>>> to the report (the parent of this mail). See page linked in footer for >>>> details. >>>> >>>> Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) >>>> -- >>>> Everything you wanna know about Linux kernel regression tracking: >>>> https://linux-regtracking.leemhuis.info/about/#tldr >>>> That page also explains what to do if mails like this annoy you. > >