We have a GlusterFS cluster which consists of 4 nodes with one brick each and a distributed-replicated volume of 72 TB.
Today I extended the cluster to 8 machines and added new bricks to the volume, so it now contains 8 bricks.
I didn’t start the rebalance yet to limit the impact during the day but to my surprise all glusterfsd process went sky high and performance was really really bad. So effectively I cause downtime to our storage service while I didn’t anticipated this, hence I didn’t do any rebalance yet.
Can somebody explain to me why adding bricks to a volume causes this high CPU usage? I can imagine the meta data needed to be synced but if this is so heavy, why can’t I tune this?
Volume Name: gv0
Type: Distributed-Replicate
Volume ID: 0322f20f-e507-492b-91db-cb4c953a24eb
Status: Started
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: s-s35-06:/glusterfs/bricks/brick1/brick
Brick2: s-s35-07:/glusterfs/bricks/brick1/brick
Brick3: s-s35-08:/glusterfs/bricks/brick1/brick
Brick4: s-s35-09:/glusterfs/bricks/brick1/brick
Brick5: v39-app-01:/glusterfs/bricks/brick1/gv0
Brick6: v39-app-02:/glusterfs/bricks/brick1/gv0
Brick7: v39-app-03:/glusterfs/bricks/brick1/gv0
Brick8: v39-app-04:/glusterfs/bricks/brick1/gv0
Options Reconfigured:
performance.cache-size: 256MB
nfs.disable: on
geo-replication.indexing: off
geo-replication.ignore-pid-check: on
changelog.changelog: on
performance.io-thread-count: 32
performance.write-behind-window-size: 5MB