Re: NFS mount lockups since about a month ago

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 01/10/2021 19:05, Roger Heflin wrote:
it will show latency.  await is average iotime in ms, and %util is
calced based in await and iops/sec.  So long as your turn sar down to
1 minute samples it should tell you which of the 2 disks had higher
await/util%.    With a 10 minute sample the 40sec pause may get spread
out across enough iops that you cannot see it.

If one disk pauses that disks utilization will be significantly higher
than the other disk, and if utilization is much higher for the same or
less IOPS that is generally a bad sign.   2 similar disks with similar
iops will generally have similar util.    The math is close to (iops *
await / 10)(returns percent).

Are you using MDraid or hardware raid?   doing a "grep mddevice
/var/log/messages will show if md forced a rewrite and/or had a slow.

you can do this on those disks:
  smartctl -l scterc,20,20 /dev/<device>

I believe 20 (2.0 seconds) is as low as a WD red lets you go according
to my tests.  If the disk hangs it will hang for 2 seconds vers the
current default (it should be 7 seconds, and really depends on how
many bad blocks there are together that try to be read).   Setting it
to 2 will make the overall timeout 3.5x smaller, so if that reduce the
hang time by about that that is a confirmation that it is a disk
issue.

and do this on the disks:
  smartctl --all /dev/sdb | grep -E '(Reallocated|Current_Pen|Offline Uncor)'


if any of those 3 is nonzero in the last column, that may be the
issue.   The smart firmware will fail disks that are perfectly find,
and it will fail to fail horribly bad disks.    The PASS/FAIL
absolutely cannot be trusted no matter what is says.  FAIL is more
often right, but PASS  is often unreliable.

So if nonzero note the number, and next pause look again and see if
the numbers changed.
_______________________________________________

Thanks for the info, I am using MDraid. There are no "mddevice" messages in /var/log/messages and smartctl -a lists no errors on any of the disks. The disks are about 3 years old, I change them in servers between 3 and 4 years old.

I will create a program to measure the effective sars output and detect any discrepancies as this problem only occurs now and then along with measuring iolatency on NFS accesses on the clients to see if I can track down if it is a server disk issue or an NFS issue. Thanks again for the info.
_______________________________________________
users mailing list -- users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure



[Index of Archives]     [Older Fedora Users]     [Fedora Announce]     [Fedora Package Announce]     [EPEL Announce]     [EPEL Devel]     [Fedora Magazine]     [Fedora Summer Coding]     [Fedora Laptop]     [Fedora Cloud]     [Fedora Advisory Board]     [Fedora Education]     [Fedora Security]     [Fedora Scitech]     [Fedora Robotics]     [Fedora Infrastructure]     [Fedora Websites]     [Anaconda Devel]     [Fedora Devel Java]     [Fedora Desktop]     [Fedora Fonts]     [Fedora Marketing]     [Fedora Management Tools]     [Fedora Mentors]     [Fedora Package Review]     [Fedora R Devel]     [Fedora PHP Devel]     [Kickstart]     [Fedora Music]     [Fedora Packaging]     [Fedora SELinux]     [Fedora Legal]     [Fedora Kernel]     [Fedora OCaml]     [Coolkey]     [Virtualization Tools]     [ET Management Tools]     [Yum Users]     [Yosemite News]     [Gnome Users]     [KDE Users]     [Fedora Art]     [Fedora Docs]     [Fedora Sparc]     [Libvirt Users]     [Fedora ARM]

  Powered by Linux