Hight I/O Wait Rates - RHEL 6.1 + GFS2 + NFS

anderson souza <andersonlira@xxxxxxxxx> · Mon, 27 Jun 2011 23:55:09 -0600

Hi everyone,

I have an Active/Passive RHCS 6.1 runing with 8TB of GFS2 with NFS on top and exporting 26 mouting points to 250 NFS clients. The GFS2 mounting points are mounted with noatime, nodiratime, data="" and localflocks options, and also the SAN and servers are fast (4Gbps and 8Gb, dual controllers working in LB, H.A... QuadCore, 48GB of memory...). The cluster has been doing its work (failover working fine...), however and unfortunately I have seen hight I/Owait rates, sometimes around 60-70% (on which is very bad), and a couple of glock_workqueue jobs, so I get a bunch of gfs2_quotad, nfsd errors and qdisk latency. The debugfs didn't show me "W", only "G" and "H".

Have you guys seen it before?
Looks like some glock's contention?
How could I get it fixed and what does it mean?

Thank you very much

Jun 27 18:48:05  kernel: INFO: task gfs2_quotad:19066 blocked for more than 120 seconds.
Jun 27 18:48:05  kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 27 18:48:05  kernel: gfs2_quotad   D 0000000000000004     0 19066      2 0x00000080

Jun 27 18:48:05  kernel: ffff880bb01e1c20 0000000000000046 0000000000000000 ffffffffa045ec6d
Jun 27 18:48:05  kernel: 0000000000000000 ffff880be6e2b000 ffff880bb01e1c50 00000001051d8b46
Jun 27 18:48:05  kernel: ffff880be4865af8 ffff880bb01e1fd8 000000000000f598 ffff880be4865af8t

Jun 27 18:48:05  kernel: Call Trace:
Jun 27 18:48:05  kernel: [<ffffffffa045ec6d>] ? dlm_put_lockspace+0x1d/0x40 [dlm]
Jun 27 18:48:05  kernel: [<ffffffffa0525c50>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]

Jun 27 18:48:05  kernel: [<ffffffffa0525c5e>] gfs2_glock_holder_wait+0xe/0x20 [gfs2]
Jun 27 18:48:05  kernel: [<ffffffff814db87f>] __wait_on_bit+0x5f/0x90
Jun 27 18:48:05  kernel: [<ffffffffa0525c50>] ? gfs2_glock_holder_wait+0x0/0x20 [gfs2]

Jun 27 18:48:05  kernel: [<ffffffff814db928>] out_of_line_wait_on_bit+0x78/0x90
Jun 27 18:48:05  kernel: [<ffffffff8108e140>] ? wake_bit_function+0x0/0x50
Jun 27 18:48:05  kernel: [<ffffffffa0526816>] gfs2_glock_wait+0x36/0x40 [gfs2]

Jun 27 18:48:05  kernel: [<ffffffffa0529011>] gfs2_glock_nq+0x191/0x370 [gfs2]
Jun 27 18:48:05  kernel: [<ffffffff8107a11b>] ? try_to_del_timer_sync+0x7b/0xe0
Jun 27 18:48:05  kernel: [<ffffffffa05427f8>] gfs2_statfs_sync+0x58/0x1b0 [gfs2]

Jun 27 18:48:05  kernel: [<ffffffff814db52a>] ? schedule_timeout+0x19a/0x2e0
Jun 27 18:48:05  kernel: [<ffffffffa05427f0>] ? gfs2_statfs_sync+0x50/0x1b0 [gfs2]
Jun 27 18:48:05  kernel: [<ffffffffa053a787>] quotad_check_timeo+0x57/0xb0 [gfs2]

Jun 27 18:48:05  kernel: [<ffffffffa053aa14>] gfs2_quotad+0x234/0x2b0 [gfs2]
Jun 27 18:48:05  kernel: [<ffffffff8108e100>] ? autoremove_wake_function+0x0/0x40
Jun 27 18:48:05  kernel: [<ffffffffa053a7e0>] ? gfs2_quotad+0x0/0x2b0 [gfs2]

Jun 27 18:48:05  kernel: [<ffffffff8108dd96>] kthread+0x96/0xa0
Jun 27 18:48:05  kernel: [<ffffffff8100c1ca>] child_rip+0xa/0x20
Jun 27 18:48:05  kernel: [<ffffffff8108dd00>] ? kthread+0x0/0xa0

Jun 27 18:48:05  kernel: [<ffffffff8100c1c0>] ? child_rip+0x0/0x20

Jun 27 19:49:07  kernel: __ratelimit: 57 callbacks suppressed
Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!

Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!

Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!
Jun 27 19:49:07  kernel: nfsd: peername failed (err 107)!

Jun 27 20:00:58  kernel: rpc-srv/tcp: nfsd: got error -104 when sending 140 bytes - shutting down socket
Jun 27 20:00:58  kernel: __ratelimit: 40 callbacks suppressed
qdiskd[10078]: qdisk cycle took more than 1 second to complete (1.170000)

qdisk cycle took more than 1 second to complete (1.120000)

Thanks
James S.
--
Linux-cluster mailing list
Linux-cluster@xxxxxxxxxx
https://www.redhat.com/mailman/listinfo/linux-cluster