One of cluster work super slow (v6.8)

Pavel Znamensky <kompastver@xxxxxxxxx> · Tue, 23 Jun 2020 19:31:12 +0300

Hi all,There's something strange with one of our clusters and glusterfs version 6.8: it's quite slow and one node is overloaded.
This is distributed cluster with four servers with the same specs/OS/versions:

Volume Name: st2
Type: Distributed-Replicate
Volume ID: 4755753b-37c4-403b-b1c8-93099bfc4c45
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: st2a:/vol3/st2
Brick2: st2b:/vol3/st2
Brick3: st2c:/vol3/st2
Brick4: st2d:/vol3/st2
Options Reconfigured:
cluster.rebal-throttle: aggressive
nfs.disable: on
performance.readdir-ahead: off
transport.address-family: inet6
performance.quick-read: off
performance.cache-size: 1GB
performance.io-cache: on
performance.io-thread-count: 16
cluster.data-self-heal-algorithm: full
network.ping-timeout: 20
server.event-threads: 2
client.event-threads: 2
cluster.readdir-optimize: on
performance.read-ahead: off
performance.parallel-readdir: on
cluster.self-heal-daemon: enable
storage.health-check-timeout: 20

op.version for this cluster remains 50400
st2a is a replica for the st2b and st2c is a replica for st2d.All our 50 clients mount this volume using FUSE and in contrast with other our cluster this one works terrible slow.
Interesting thing here is that there are very low HDDs and network utilization from one hand and quite overloaded server from another hand.
Also, there are no files which should be healed according to `gluster volume heal st2 info`.
Load average across servers:
st2a:
load average: 28,73, 26,39, 27,44
st2b:
load average: 0,24, 0,46, 0,76
st2c:
load average: 0,13, 0,20, 0,27
st2d:
load average:2,93, 2,11, 1,50

If we stop glusterfs on st2a server the cluster will work as fast as we expected.
Previously the cluster worked on a version 5.x and there were no such problems.

Interestingly, that almost all CPU usage on st2a generates by a "system" load.
The most CPU intensive process is glusterfsd.
`top -H` for glusterfsd process shows this:

PID USER      PR  NI    VIRT    RES    SHR S %CPU %MEM     TIME+ COMMAND                                                        
13894 root      20   0 2172892  96488   9056 R 74,0  0,1 122:09.14 glfs_iotwr00a 
13888 root      20   0 2172892  96488   9056 R 73,7  0,1 121:38.26 glfs_iotwr004 
13891 root      20   0 2172892  96488   9056 R 73,7  0,1 121:53.83 glfs_iotwr007 
13920 root      20   0 2172892  96488   9056 R 73,0  0,1 122:11.27 glfs_iotwr00f 
13897 root      20   0 2172892  96488   9056 R 68,3  0,1 121:09.82 glfs_iotwr00d 
13896 root      20   0 2172892  96488   9056 R 68,0  0,1 122:03.99 glfs_iotwr00c 
13868 root      20   0 2172892  96488   9056 R 67,7  0,1 122:42.55 glfs_iotwr000 
13889 root      20   0 2172892  96488   9056 R 67,3  0,1 122:17.02 glfs_iotwr005 
13887 root      20   0 2172892  96488   9056 R 67,0  0,1 122:29.88 glfs_iotwr003 
13885 root      20   0 2172892  96488   9056 R 65,0  0,1 122:04.85 glfs_iotwr001 
13892 root      20   0 2172892  96488   9056 R 55,0  0,1 121:15.23 glfs_iotwr008 
13890 root      20   0 2172892  96488   9056 R 54,7  0,1 121:27.88 glfs_iotwr006 
13895 root      20   0 2172892  96488   9056 R 54,0  0,1 121:28.35 glfs_iotwr00b 
13893 root      20   0 2172892  96488   9056 R 53,0  0,1 122:23.12 glfs_iotwr009 
13898 root      20   0 2172892  96488   9056 R 52,0  0,1 122:30.67 glfs_iotwr00e 
13886 root      20   0 2172892  96488   9056 R 41,3  0,1 121:26.97 glfs_iotwr002
13878 root      20   0 2172892  96488   9056 S  1,0  0,1   1:20.34 glfs_rpcrqhnd 
13840 root      20   0 2172892  96488   9056 S  0,7  0,1   0:51.54 glfs_epoll000 
13841 root      20   0 2172892  96488   9056 S  0,7  0,1   0:51.14 glfs_epoll001 
13877 root      20   0 2172892  96488   9056 S  0,3  0,1   1:20.02 glfs_rpcrqhnd 
13833 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.00 glusterfsd    
13834 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.14 glfs_timer    
13835 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.00 glfs_sigwait  
13836 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.16 glfs_memsweep 
13837 root      20   0 2172892  96488   9056 S  0,0  0,1   0:00.05 glfs_sproc0      

Also I didn't find relevant messages in log files.
Honestly, don't know what to do. Does someone know how to debug or fix this behaviour?

Best regards,
Pavel

________

Community Meeting Calendar:

Schedule -
Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC
Bridge: https://bluejeans.com/441850968

Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users