Re: canceling full heal 3.8

David Gossage <dgossage@xxxxxxxxxxxxxxxxxx> · Sat, 27 Aug 2016 15:07:32 -0500

On Sat, Aug 27, 2016 at 9:58 AM, David Gossage <dgossage@xxxxxxxxxxxxxxxxxx> wrote:
On Fri, Aug 26, 2016 at 8:40 PM, David Gossage <dgossage@xxxxxxxxxxxxxxxxxx> wrote:
I was in process of redoing underlying disk layout for a brick.  triggered full heal.  then realized I had skipped a step of applying zfs set xattr=sa which is kind of important running zfs under linux.
Rather than wait however many hours until my TB of data heals is their a command in 3.8 to cancel a heal begun by gluster volume heal GLUSTER1 full?  If not won't be end of world just waste of time to wait and then have to redo after writing out a TB of data.

Does the heal process crawl from any particular node when invoked?  I have 3 nodes.  I ran command from node 3, node 2 is one with files needing healed, node 1 is brick I heaeld yesterday but forgot to set xattr=sa on which usually has bad performance results for zfsonlinux.  I did set it about 30 minutes into the heal figuring better some than none until I could redo it again.

12 hours later the 1TB of data was healed so I figured I'd move on to node 2, then 3.  Then assuming 12 hour windows for each node I could redo node 1 with correct settings before Monday.  When node 1 healed it first found all the visible files from mount point and .glusterfs, hen numbers jumped back up after those were done and it started finding shards.  It happened fairly quickly.  2nd time around with node 2 it is crawling to a standstill while finding all the shards to heal.  I'm wondering if its doing the crawl from node 1 and the poor settings that existed for first 30 minutes of file heals is slowing it down.  If so I would hope once the files that were created/healed while settings weren't correct are found and it moves past them the rest should go faster.

The only errors in any logs are brick logs

[2016-08-27 14:25:10.022786] E [MSGID: 115050] [server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server: 3251237: LOOKUP (null) (00000000-0000-0000-0000-000000000000/4c7d44fc-a0c1-413b-8dc4-2abbbe1d4d4f.423) ==> (Invalid argument) [Invalid argument]
[2016-08-27 14:36:59.234073] W [MSGID: 115009] [server-resolve.c:569:server_resolve] 0-GLUSTER1-server: no resolution type for (null) (LOOKUP)
[2016-08-27 14:36:59.234128] E [MSGID: 115050] [server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server: 3288322: LOOKUP (null) (00000000-0000-0000-0000-000000000000/4c7d44fc-a0c1-413b-8dc4-2abbbe1d4d4f.328) ==> (Invalid argument) [Invalid argument]

And I would hope that it's just related to heal process or when a shard is hit and its found it doesnt exist here it errors out as expected.

7 hours after starting full heal shards still haven't started healing, and count from heal statistics heal-count has only reached 1800 out of 19000 shards.  shards dir hasn't even been recreated yet.  Creation of the non sharded stubs (do they have a more official term?) in the visible mount point was as speedy as expected.  shards are painfully slow.

David Gossage

Carousel Checks Inc. | System Administrator

Office 708.613.2284

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users