Re: NFS mount lockups since about a month ago

Terry Barnaby <terry1@xxxxxxxxxxx> · Thu, 30 Sep 2021 09:35:18 +0100

Thanks for the feedback everyone.

This is a very lightly loaded system with just 3 users ATM and very 
little going on across the network (just editing code files etc). The 
problem occurred again yesterday. For about 10 minutes my KDE desktop 
locked up in 20 second bursts and then the problem went away for the 
rest of the day. During that time the desktop and server were idle for 
98.5% and pings continued fine. A kconsole window doing an "ls /home" 
every 5 seconds was locked up doing the ls. I had kconsole windows open 
doing the pings, top's and ls'es and although I couldn't operate the 
desktop (move virtual desktops etc) the ping and top windows were 
updating fine. No error messages in /var/log/messages on both systems 
and the sar stats showed nothing out of the ordinary.

I am pretty sure the Ethernet network is fine including cables, switches 
Ethernet adapters etc. Pings are fine etc. It just appears that the 
client programs get a huge (> 20 secs) delayed response to accesses to 
/home every now and then which points to NFS issues. Most of the system 
stats counters just give the amount of access, not the latency of an 
access which is what I need to track down the problem as there are few 
disk and network accesses going on.

As I said all has been fine on this system until about a month ago and 
the only obvious changes are the Fedora updates so I wondered if anyone 
new if there had been changes to the NFS stack recently and/or how to 
log peak NFS latencies ?

Terry
On 26/09/2021 18:06, Roger Heflin wrote:
Make sure you have sar/sysstat enabled and changed to do 1 minute samples.

sar -d will show disk perf.  If one of the disks "blips" at the
firmware level (working on a hard to read block maybe), the util% on
that device will be significantly higher than all other disks so will
stand out.  Then you can look deeper at the smart data.

sar generically will show your cpu/system time and sar -n DEV will
show detailed network traffic, sar -n EDEV will show network errors.

With it set to 1 minute you should be able to detect most blips.

On Sun, Sep 26, 2021 at 10:26 AM Jamie Fargen <jamie@xxxxxxxxxxxxxx> wrote:
Are there network switches under your control? It sounds similar to what happens when MTU on the systems MTU do not match or one system MTU is set above the value on the switch ports.

Next time the issue occurs use ping with the do not fragment flag.
ex $ ping -m DO -s 8972 ip.address

This example should be the highest value to work in the case of MTU size 9000, there is 28 byte overhead for IPv4 packets.

Second, are you sure no one is attaching to the network and duplicating the MAC address of your NFS server or perhaps the system that is stalled? If the switches are manageable you would have to insure that the MAC addresses are being learned on the correct ports.

-Jamie

On Sun, Sep 26, 2021 at 10:24 AM Tom Horsley <horsley1953@xxxxxxxxx> wrote:
On Sun, 26 Sep 2021 10:26:19 -0300
George N. White III wrote:

If you have cron jobs that use a lot of network bandwidth it may work
fine until some network issue causing lots of retransmits bogs it down.
Which is why you should check the dumb stuff first! Has a critter
chewed on the ethernet cable to the server?
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure

_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure