Hello, I'm back with these NFS problems.... Server and client have been updated but it still rise time to time. server is: Linux robin.legi.grenoble-inp.fr 3.10.0-1127.18.2.el7.x86_64 #1 SMP Sun Jul 26 15:27:06 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux client is : Linux grivola.legi.grenoble-inp.fr 3.10.0-1127.18.2.el7.x86_64 #1 SMP Sun Jul 26 15:27:06 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux CentOS Linux release 7.8.2003 (Core) each. It seams related to an scp session: the NFS client downloads a large data set from a remote server and store the files on it's NFS file system. On the client I have such messages in /var/log/messages: Aug 28 10:03:08 grivola kernel: INFO: task scp:78495 blocked for more than 120 seconds. Aug 28 10:03:08 grivola kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Aug 28 10:03:08 grivola kernel: scp D ffff97e37fa9acc0 0 78495 147369 0x00000084 Aug 28 10:03:08 grivola kernel: Call Trace: Aug 28 10:03:08 grivola kernel: [<ffffffff92783ef0>] ? bit_wait+0x50/0x50 Aug 28 10:03:08 grivola kernel: [<ffffffff92785da9>] schedule+0x29/0x70 Aug 28 10:03:08 grivola kernel: [<ffffffff927838b1>] schedule_timeout+0x221/0x2d0 Aug 28 10:03:08 grivola kernel: [<ffffffffc132e7e6>] ? rpc_run_task+0xf6/0x150 [sunrpc] Aug 28 10:03:08 grivola kernel: [<ffffffffc133d850>] ? rpc_put_task+0x10/0x20 [sunrpc] Aug 28 10:03:08 grivola kernel: [<ffffffff92783ef0>] ? bit_wait+0x50/0x50 Aug 28 10:03:08 grivola kernel: [<ffffffff9278549d>] io_schedule_timeout+0xad/0x130 Aug 28 10:03:08 grivola kernel: [<ffffffff92785538>] io_schedule+0x18/0x20 Aug 28 10:03:08 grivola kernel: [<ffffffff92783f01>] bit_wait_io+0x11/0x50 Aug 28 10:03:08 grivola kernel: [<ffffffff92783a27>] __wait_on_bit+0x67/0x90 Aug 28 10:03:08 grivola kernel: [<ffffffff921bd741>] wait_on_page_bit+0x81/0xa0 Aug 28 10:03:08 grivola kernel: [<ffffffff920c7840>] ? wake_bit_function+0x40/0x40 Aug 28 10:03:08 grivola kernel: [<ffffffff921bd871>] __filemap_fdatawait_range+0x111/0x190 Aug 28 10:03:08 grivola kernel: [<ffffffff921bd904>] filemap_fdatawait_range+0x14/0x30 Aug 28 10:03:08 grivola kernel: [<ffffffff921bd947>] filemap_fdatawait+0x27/0x30 Aug 28 10:03:08 grivola kernel: [<ffffffff921bfd1c>] filemap_write_and_wait+0x4c/0x80 Aug 28 10:03:08 grivola kernel: [<ffffffffc097ddd0>] nfs_wb_all+0x20/0x100 [nfs] Aug 28 10:03:08 grivola kernel: [<ffffffffc09700e0>] nfs_setattr+0x1f0/0x210 [nfs] Aug 28 10:03:08 grivola kernel: [<ffffffff9226cecc>] notify_change+0x30c/0x4d0 Aug 28 10:03:08 grivola kernel: [<ffffffff9224af05>] do_truncate+0x75/0xc0 Aug 28 10:03:08 grivola kernel: [<ffffffff92250118>] ? __sb_start_write+0x58/0x120 Aug 28 10:03:08 grivola kernel: [<ffffffff9224b329>] do_sys_ftruncate.constprop.14+0x139/0x1a0 Aug 28 10:03:08 grivola kernel: [<ffffffff9224b3ce>] SyS_ftruncate+0xe/0x10 Aug 28 10:03:08 grivola kernel: [<ffffffff92792ed2>] system_call_fastpath+0x25/0x2a At this time the NFS server freeze. Even a ssh session or the local console (via IDRAC or screen/keyboard physically plugged on the server) do not work. I have no special messages on the NFS server. The freeze period end with: On the server: Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing state ID with incorrect client ID Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing state ID with incorrect client ID Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing state ID with incorrect client ID Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing state ID with incorrect client ID Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing state ID with incorrect client ID Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing state ID with incorrect client ID Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing state ID with incorrect client ID Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing state ID with incorrect client ID Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing state ID with incorrect client ID Aug 28 10:20:26 robin kernel: NFSD: client 194.254.66.26 testing state ID with incorrect client ID and on the client: Aug 28 10:20:26 grivola kernel: nfs: server robin.legi.grenoble-inp.fr OK Aug 28 10:20:26 grivola kernel: nfs: server robin.legi.grenoble-inp.fr OK Aug 28 10:20:26 grivola kernel: nfs: server robin.legi.grenoble-inp.fr OK Aug 28 10:20:26 grivola kernel: nfs: server robin.legi.grenoble-inp.fr OK Aug 28 10:20:26 grivola kernel: nfs: server robin.legi.grenoble-inp.fr OK I do not know how to investigate this.... Patrick Le 09/07/2020 à 12:11, Patrick Bégou a écrit : > Hi Orion, > > no, I still have this problem. I delay working on it as I the latest > updates have not been installed on the server and on the client. I'll > work again on this problem as soon as possible. > > Thanks Charles for your detailed information on how to track this > problem. I'll check all these metrics. > > I have several clients for this nfs server and the problem seems only to > occur from the client using nfs 4.1 in CentOS Linux release 7.7.1908 (Core). > The default options used are: > rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=194.254.xx.xx,local_lock=none,addr=194.254.yy.yy > > On olders clients (Red Hat Enterprise Linux Server release 6.7 > (Santiago)) default options are: > rw,intr,hard,sloppy,vers=4,addr=194.254.xx.xx,clientaddr=194.254.yy.yy > > The server in CentOS7.6.1810 > > Will see if the latest updates help to solve the problem. > > Patrick > > Le 03/07/2020 à 00:05, Orion Poplawski a écrit : >> On 6/1/20 3:08 AM, Patrick Bégou wrote: >>> Le 13/05/2020 à 02:13, Orion Poplawski a écrit : >>>> On 5/12/20 2:46 AM, Patrick Bégou wrote: >>>>> Hi, >>>>> >>>>> I need some help with NFSv4 setup/tuning. I have a dedicated nfs >>>>> server >>>>> (2 x E5-2620 8cores/16 threads each, 64GB RAM, 1x10Gb ethernet and >>>>> 16x >>>>> 8TB HDD) used by two servers and a small cluster (400 cores). All the >>>>> servers are running CentOS 7, the cluster is running CentOS6. >>>>> >>>>> Time to time on the server I get: >>>>> >>>>> kernel: NFSD: client xxx.xxx.xxx.xxx testing state ID with >>>>> incorrect client ID >>>>> >>>>> And the client xxx.xxx.xxx.xxx freeze whith: >>>>> >>>>> kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding, >>>>> still trying >>>>> kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK >>>>> kernel: nfs: server xxxxx.legi.grenoble-inp.fr not responding, >>>>> still trying >>>>> kernel: nfs: server xxxxx.legi.grenoble-inp.fr OK >>>>> >>>>> There is a discussion on RedHat7 support about this but only open to >>>>> subscribers. Other searches with google do not provide useful >>>>> information. >>>> FYI - you can get access to such info with a free RHEL developers >>>> account. >>>> >>>> >>> Thanks for your suggestion. As the problem is back I've subscribed to >>> reach the full content of this discussion. >>> >>> The answer was "do not use antivirus" :-(. I do not use antivirus as I >>> am CentOS only. >>> >>> Patrick >>> >> Just curious to see if you have had any luck resolving these issues? >> I'm afraid that NFS on EL 7 has become much less stable for us >> recently as well with lots more client access hangs. >> >> Orion >> > _______________________________________________ > CentOS mailing list > CentOS@xxxxxxxxxx > https://lists.centos.org/mailman/listinfo/centos _______________________________________________ CentOS mailing list CentOS@xxxxxxxxxx https://lists.centos.org/mailman/listinfo/centos