Re: 3.8.3 Shards Healing Glacier Slow

Anuradha Talur <atalur@xxxxxxxxxx> · Mon, 29 Aug 2016 06:39:18 -0400 (EDT)

Response inline.

----- Original Message -----
> From: "Krutika Dhananjay" <kdhananj@xxxxxxxxxx>
> To: "David Gossage" <dgossage@xxxxxxxxxxxxxxxxxx>
> Cc: "gluster-users@xxxxxxxxxxx List" <Gluster-users@xxxxxxxxxxx>
> Sent: Monday, August 29, 2016 3:55:04 PM
> Subject: Re:  3.8.3 Shards Healing Glacier Slow
> 
> Could you attach both client and brick logs? Meanwhile I will try these steps
> out on my machines and see if it is easily recreatable.
> 
> -Krutika
> 
> On Mon, Aug 29, 2016 at 2:31 PM, David Gossage < dgossage@xxxxxxxxxxxxxxxxxx
> > wrote:
> 
> 
> 
> Centos 7 Gluster 3.8.3
> 
> Brick1: ccgl1.gl.local:/gluster1/BRICK1/1
> Brick2: ccgl2.gl.local:/gluster1/BRICK1/1
> Brick3: ccgl4.gl.local:/gluster1/BRICK1/1
> Options Reconfigured:
> cluster.data-self-heal-algorithm: full
> cluster.self-heal-daemon: on
> cluster.locking-scheme: granular
> features.shard-block-size: 64MB
> features.shard: on
> performance.readdir-ahead: on
> storage.owner-uid: 36
> storage.owner-gid: 36
> performance.quick-read: off
> performance.read-ahead: off
> performance.io-cache: off
> performance.stat-prefetch: on
> cluster.eager-lock: enable
> network.remote-dio: enable
> cluster.quorum-type: auto
> cluster.server-quorum-type: server
> server.allow-insecure: on
> cluster.self-heal-window-size: 1024
> cluster.background-self-heal-count: 16
> performance.strict-write-ordering: off
> nfs.disable: on
> nfs.addr-namelookup: off
> nfs.enable-ino32: off
> cluster.granular-entry-heal: on
> 
> Friday did rolling upgrade from 3.8.3->3.8.3 no issues.
> Following steps detailed in previous recommendations began proces of
> replacing and healngbricks one node at a time.
> 
> 1) kill pid of brick
> 2) reconfigure brick from raid6 to raid10
> 3) recreate directory of brick
> 4) gluster volume start <> force
> 5) gluster volume heal <> full
Hi,

I'd suggest that full heal is not used. There are a few bugs in full heal.
Better safe than sorry ;)
Instead I'd suggest the following steps:

1) kill pid of brick
2) to configuring of brick that you need
3) recreate brick dir
4) while the brick is still down, from the mount point:
   a) create a dummy non existent dir under / of mount.
   b) set a non existent extended attribute on / of mount.
Doing these steps will ensure that heal happens only from updated brick to down brick.
5) gluster v start <> force
6) gluster v heal <>
> 
> 1st node worked as expected took 12 hours to heal 1TB data. Load was little
> heavy but nothing shocking.
> 
> About an hour after node 1 finished I began same process on node2. Heal
> proces kicked in as before and the files in directories visible from mount
> and .glusterfs healed in short time. Then it began crawl of .shard adding
> those files to heal count at which point the entire proces ground to a halt
> basically. After 48 hours out of 19k shards it has added 5900 to heal list.
> Load on all 3 machnes is negligible. It was suggested to change this value
> to full cluster.data-self-heal-algorithm and restart volume which I did. No
> efffect. Tried relaunching heal no effect, despite any node picked. I
> started each VM and performed a stat of all files from within it, or a full
> virus scan and that seemed to cause short small spikes in shards added, but
> not by much. Logs are showing no real messages indicating anything is going
> on. I get hits to brick log on occasion of null lookups making me think its
> not really crawling shards directory but waiting for a shard lookup to add
> it. I'll get following in brick log but not constant and sometime multiple
> for same shard.
> 
> [2016-08-29 08:31:57.478125] W [MSGID: 115009]
> [server-resolve.c:569:server_resolve] 0-GLUSTER1-server: no resolution type
> for (null) (LOOKUP)
> [2016-08-29 08:31:57.478170] E [MSGID: 115050]
> [server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server: 12591783:
> LOOKUP (null) (00000000-0000-0000-00
> 00-000000000000/241a55ed-f0d5-4dbc-a6ce-ab784a0ba6ff.221) ==> (Invalid
> argument) [Invalid argument]
> 
> This one repeated about 30 times in row then nothing for 10 minutes then one
> hit for one different shard by itself.
> 
> How can I determine if Heal is actually running? How can I kill it or force
> restart? Does node I start it from determine which directory gets crawled to
> determine heals?
> 
> David Gossage
> Carousel Checks Inc. | System Administrator
> Office 708.613.2284
> 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users@xxxxxxxxxxx
> http://www.gluster.org/mailman/listinfo/gluster-users
> 
> 
> _______________________________________________
> Gluster-users mailing list
> Gluster-users@xxxxxxxxxxx
> http://www.gluster.org/mailman/listinfo/gluster-users

-- 
Thanks,
Anuradha.
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users