Hi all,
op.version for this cluster remains 50400
st2a is a replica for the st2b and st2c is a replica for st2d.
There's something strange with one of our clusters and glusterfs version 6.8: it's quite slow and one node is overloaded.
This is distributed cluster with four servers with the same specs/OS/versions:
Volume Name: st2
Type: Distributed-Replicate
Volume ID: 4755753b-37c4-403b-b1c8-93099bfc4c45
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: st2a:/vol3/st2
Brick2: st2b:/vol3/st2
Brick3: st2c:/vol3/st2
Brick4: st2d:/vol3/st2
Options Reconfigured:
cluster.rebal-throttle: aggressive
nfs.disable: on
performance.readdir-ahead: off
transport.address-family: inet6
performance.quick-read: off
performance.cache-size: 1GB
performance.io-cache: on
performance.io-thread-count: 16
cluster.data-self-heal-algorithm: full
network.ping-timeout: 20
server.event-threads: 2
client.event-threads: 2
cluster.readdir-optimize: on
performance.read-ahead: off
performance.parallel-readdir: on
cluster.self-heal-daemon: enable
storage.health-check-timeout: 20
Volume ID: 4755753b-37c4-403b-b1c8-93099bfc4c45
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: st2a:/vol3/st2
Brick2: st2b:/vol3/st2
Brick3: st2c:/vol3/st2
Brick4: st2d:/vol3/st2
Options Reconfigured:
cluster.rebal-throttle: aggressive
nfs.disable: on
performance.readdir-ahead: off
transport.address-family: inet6
performance.quick-read: off
performance.cache-size: 1GB
performance.io-cache: on
performance.io-thread-count: 16
cluster.data-self-heal-algorithm: full
network.ping-timeout: 20
server.event-threads: 2
client.event-threads: 2
cluster.readdir-optimize: on
performance.read-ahead: off
performance.parallel-readdir: on
cluster.self-heal-daemon: enable
storage.health-check-timeout: 20
All our 50 clients mount this volume using FUSE and in contrast with other our cluster this one works terrible slow.
Interesting thing here is that there are very low HDDs and network utilization from one hand and quite overloaded server from another hand.
Also, there are no files which should be healed according to `gluster volume heal st2 info`.
Load average across servers:
st2a:
load average: 28,73, 26,39, 27,44
st2b:
load average: 0,24, 0,46, 0,76
st2c:
load average: 0,13, 0,20, 0,27
st2d:
load average:2,93, 2,11, 1,50
load average: 28,73, 26,39, 27,44
st2b:
load average: 0,24, 0,46, 0,76
st2c:
load average: 0,13, 0,20, 0,27
st2d:
load average:2,93, 2,11, 1,50
If we stop glusterfs on st2a server the cluster will work as fast as we expected.
Previously the cluster worked on a version 5.x and there were no such problems.
Interestingly, that almost all CPU usage on st2a generates by a "system" load.
The most CPU intensive process is glusterfsd.
`top -H` for glusterfsd process shows this:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
13894 root 20 0 2172892 96488 9056 R 74,0 0,1 122:09.14 glfs_iotwr00a
13888 root 20 0 2172892 96488 9056 R 73,7 0,1 121:38.26 glfs_iotwr004
13891 root 20 0 2172892 96488 9056 R 73,7 0,1 121:53.83 glfs_iotwr007
13920 root 20 0 2172892 96488 9056 R 73,0 0,1 122:11.27 glfs_iotwr00f
13897 root 20 0 2172892 96488 9056 R 68,3 0,1 121:09.82 glfs_iotwr00d
13896 root 20 0 2172892 96488 9056 R 68,0 0,1 122:03.99 glfs_iotwr00c
13868 root 20 0 2172892 96488 9056 R 67,7 0,1 122:42.55 glfs_iotwr000
13889 root 20 0 2172892 96488 9056 R 67,3 0,1 122:17.02 glfs_iotwr005
13887 root 20 0 2172892 96488 9056 R 67,0 0,1 122:29.88 glfs_iotwr003
13885 root 20 0 2172892 96488 9056 R 65,0 0,1 122:04.85 glfs_iotwr001
13892 root 20 0 2172892 96488 9056 R 55,0 0,1 121:15.23 glfs_iotwr008
13890 root 20 0 2172892 96488 9056 R 54,7 0,1 121:27.88 glfs_iotwr006
13895 root 20 0 2172892 96488 9056 R 54,0 0,1 121:28.35 glfs_iotwr00b
13893 root 20 0 2172892 96488 9056 R 53,0 0,1 122:23.12 glfs_iotwr009
13898 root 20 0 2172892 96488 9056 R 52,0 0,1 122:30.67 glfs_iotwr00e
13886 root 20 0 2172892 96488 9056 R 41,3 0,1 121:26.97 glfs_iotwr002
13878 root 20 0 2172892 96488 9056 S 1,0 0,1 1:20.34 glfs_rpcrqhnd
13840 root 20 0 2172892 96488 9056 S 0,7 0,1 0:51.54 glfs_epoll000
13841 root 20 0 2172892 96488 9056 S 0,7 0,1 0:51.14 glfs_epoll001
13877 root 20 0 2172892 96488 9056 S 0,3 0,1 1:20.02 glfs_rpcrqhnd
13833 root 20 0 2172892 96488 9056 S 0,0 0,1 0:00.00 glusterfsd
13834 root 20 0 2172892 96488 9056 S 0,0 0,1 0:00.14 glfs_timer
13835 root 20 0 2172892 96488 9056 S 0,0 0,1 0:00.00 glfs_sigwait
13836 root 20 0 2172892 96488 9056 S 0,0 0,1 0:00.16 glfs_memsweep
13837 root 20 0 2172892 96488 9056 S 0,0 0,1 0:00.05 glfs_sproc0
13894 root 20 0 2172892 96488 9056 R 74,0 0,1 122:09.14 glfs_iotwr00a
13888 root 20 0 2172892 96488 9056 R 73,7 0,1 121:38.26 glfs_iotwr004
13891 root 20 0 2172892 96488 9056 R 73,7 0,1 121:53.83 glfs_iotwr007
13920 root 20 0 2172892 96488 9056 R 73,0 0,1 122:11.27 glfs_iotwr00f
13897 root 20 0 2172892 96488 9056 R 68,3 0,1 121:09.82 glfs_iotwr00d
13896 root 20 0 2172892 96488 9056 R 68,0 0,1 122:03.99 glfs_iotwr00c
13868 root 20 0 2172892 96488 9056 R 67,7 0,1 122:42.55 glfs_iotwr000
13889 root 20 0 2172892 96488 9056 R 67,3 0,1 122:17.02 glfs_iotwr005
13887 root 20 0 2172892 96488 9056 R 67,0 0,1 122:29.88 glfs_iotwr003
13885 root 20 0 2172892 96488 9056 R 65,0 0,1 122:04.85 glfs_iotwr001
13892 root 20 0 2172892 96488 9056 R 55,0 0,1 121:15.23 glfs_iotwr008
13890 root 20 0 2172892 96488 9056 R 54,7 0,1 121:27.88 glfs_iotwr006
13895 root 20 0 2172892 96488 9056 R 54,0 0,1 121:28.35 glfs_iotwr00b
13893 root 20 0 2172892 96488 9056 R 53,0 0,1 122:23.12 glfs_iotwr009
13898 root 20 0 2172892 96488 9056 R 52,0 0,1 122:30.67 glfs_iotwr00e
13886 root 20 0 2172892 96488 9056 R 41,3 0,1 121:26.97 glfs_iotwr002
13878 root 20 0 2172892 96488 9056 S 1,0 0,1 1:20.34 glfs_rpcrqhnd
13840 root 20 0 2172892 96488 9056 S 0,7 0,1 0:51.54 glfs_epoll000
13841 root 20 0 2172892 96488 9056 S 0,7 0,1 0:51.14 glfs_epoll001
13877 root 20 0 2172892 96488 9056 S 0,3 0,1 1:20.02 glfs_rpcrqhnd
13833 root 20 0 2172892 96488 9056 S 0,0 0,1 0:00.00 glusterfsd
13834 root 20 0 2172892 96488 9056 S 0,0 0,1 0:00.14 glfs_timer
13835 root 20 0 2172892 96488 9056 S 0,0 0,1 0:00.00 glfs_sigwait
13836 root 20 0 2172892 96488 9056 S 0,0 0,1 0:00.16 glfs_memsweep
13837 root 20 0 2172892 96488 9056 S 0,0 0,1 0:00.05 glfs_sproc0
Also I didn't find relevant messages in log files.
Honestly, don't know what to do. Does someone know how to debug or fix this behaviour?
Best regards,
Pavel
________ Community Meeting Calendar: Schedule - Every 2nd and 4th Tuesday at 14:30 IST / 09:00 UTC Bridge: https://bluejeans.com/441850968 Gluster-users mailing list Gluster-users@xxxxxxxxxxx https://lists.gluster.org/mailman/listinfo/gluster-users