Re: 3.8.2 : Node not healing

Lindsay Mathieson <lindsay.mathieson@xxxxxxxxx> · Sun, 21 Aug 2016 10:32:43 +1000

On 20/08/2016 9:28 PM, Pranith Kumar Karampuri wrote:
Lindsay,
       Please do "gluster volume set <volname> 
data-self-heal-algorithm full" to prevent diff self-heals(checksum 
computations on the files) which use a lot of CPU if not already.

I'll givbe that a spin and see how it works out - toss up as to which is 
a bigger resource problem, CPU or bandwidth :)

One more thing that could have lead to lot of CPU is full directory 
heals on .shard. Krutika recently implemented a feature called 
granular entry self-heal which should address this issue in future. We 
have throttling feature coming along in future as well to play nice 
with rest of the system.

I already have "cluster.granular-entry-heal: on" and 
"cluster.locking-scheme: granular" set, or are you saying that feature 
has improvements yet to come?

Anyway, I'm not really looking at cpu hogging (well not much anyway :)), 
rather I was trying to find why heal were not starting. With my first 
test I had 25000 shard needing healing and nothing happened for over 3 
hours untill I shutdown all vm's on the ndoe and restarted it.

I did the same test yesterday
- killed all gluster processes on a node
- waited to heal-count rose to 1500
- restarted gluster on that node
- nothing happened for 45 minutes (heal-count stayed at 1500).
- I shutdown all VM's on that node
- healing started withint several minutes and completed in under half an 
hour

Which leads me to wonder if having active local I/O on a gluster node 
when you crash and restarted the gluster processes (as opposed to 
rebooting the node) blocks the heals from starting.

If so, not a huge issue for me - typically that will never happen as 
gluster never actually crashes on me :) The most likely scenario is 
rolling upgrades or hard reboots.

gluster v info

Volume Name: datastore4
Type: Replicate
Volume ID: 0ba131ef-311d-4bb1-be46-596e83b2f6ce
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: vnb.proxmox.softlog:/tank/vmdata/datastore4
Brick2: vng.proxmox.softlog:/tank/vmdata/datastore4
Brick3: vna.proxmox.softlog:/tank/vmdata/datastore4
Options Reconfigured:
cluster.locking-scheme: granular
cluster.granular-entry-heal: on
performance.readdir-ahead: on
cluster.self-heal-window-size: 1024
cluster.data-self-heal: on
features.shard: on
cluster.quorum-type: auto
cluster.server-quorum-type: server
nfs.disable: on
nfs.addr-namelookup: off
nfs.enable-ino32: off
performance.strict-write-ordering: off
performance.stat-prefetch: on
performance.quick-read: off
performance.read-ahead: off
performance.io-cache: off
cluster.eager-lock: enable
network.remote-dio: enable
features.shard-block-size: 64MB
cluster.background-self-heal-count: 16

--
Lindsay Mathieson

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users