----- Original Message ----- > From: "David Gossage" <dgossage@xxxxxxxxxxxxxxxxxx> > To: "Anuradha Talur" <atalur@xxxxxxxxxx> > Cc: "gluster-users@xxxxxxxxxxx List" <Gluster-users@xxxxxxxxxxx>, "Krutika Dhananjay" <kdhananj@xxxxxxxxxx> > Sent: Monday, August 29, 2016 5:12:42 PM > Subject: Re: 3.8.3 Shards Healing Glacier Slow > > On Mon, Aug 29, 2016 at 5:39 AM, Anuradha Talur <atalur@xxxxxxxxxx> wrote: > > > Response inline. > > > > ----- Original Message ----- > > > From: "Krutika Dhananjay" <kdhananj@xxxxxxxxxx> > > > To: "David Gossage" <dgossage@xxxxxxxxxxxxxxxxxx> > > > Cc: "gluster-users@xxxxxxxxxxx List" <Gluster-users@xxxxxxxxxxx> > > > Sent: Monday, August 29, 2016 3:55:04 PM > > > Subject: Re: 3.8.3 Shards Healing Glacier Slow > > > > > > Could you attach both client and brick logs? Meanwhile I will try these > > steps > > > out on my machines and see if it is easily recreatable. > > > > > > -Krutika > > > > > > On Mon, Aug 29, 2016 at 2:31 PM, David Gossage < > > dgossage@xxxxxxxxxxxxxxxxxx > > > > wrote: > > > > > > > > > > > > Centos 7 Gluster 3.8.3 > > > > > > Brick1: ccgl1.gl.local:/gluster1/BRICK1/1 > > > Brick2: ccgl2.gl.local:/gluster1/BRICK1/1 > > > Brick3: ccgl4.gl.local:/gluster1/BRICK1/1 > > > Options Reconfigured: > > > cluster.data-self-heal-algorithm: full > > > cluster.self-heal-daemon: on > > > cluster.locking-scheme: granular > > > features.shard-block-size: 64MB > > > features.shard: on > > > performance.readdir-ahead: on > > > storage.owner-uid: 36 > > > storage.owner-gid: 36 > > > performance.quick-read: off > > > performance.read-ahead: off > > > performance.io-cache: off > > > performance.stat-prefetch: on > > > cluster.eager-lock: enable > > > network.remote-dio: enable > > > cluster.quorum-type: auto > > > cluster.server-quorum-type: server > > > server.allow-insecure: on > > > cluster.self-heal-window-size: 1024 > > > cluster.background-self-heal-count: 16 > > > performance.strict-write-ordering: off > > > nfs.disable: on > > > nfs.addr-namelookup: off > > > nfs.enable-ino32: off > > > cluster.granular-entry-heal: on > > > > > > Friday did rolling upgrade from 3.8.3->3.8.3 no issues. > > > Following steps detailed in previous recommendations began proces of > > > replacing and healngbricks one node at a time. > > > > > > 1) kill pid of brick > > > 2) reconfigure brick from raid6 to raid10 > > > 3) recreate directory of brick > > > 4) gluster volume start <> force > > > 5) gluster volume heal <> full > > Hi, > > > > I'd suggest that full heal is not used. There are a few bugs in full heal. > > Better safe than sorry ;) > > Instead I'd suggest the following steps: > > > > Currently I brought the node down by systemctl stop glusterd as I was > getting sporadic io issues and a few VM's paused so hoping that will help. > I may wait to do this till around 4PM when most work is done in case it > shoots load up. > > > > 1) kill pid of brick > > 2) to configuring of brick that you need > > 3) recreate brick dir > > 4) while the brick is still down, from the mount point: > > a) create a dummy non existent dir under / of mount. > > > > so if noee 2 is down brick, pick node for example 3 and make a test dir > under its brick directory that doesnt exist on 2 or should I be dong this > over a gluster mount? You should be doing this over gluster mount. > > > b) set a non existent extended attribute on / of mount. > > > > Could you give me an example of an attribute to set? I've read a tad on > this, and looked up attributes but haven't set any yet myself. > Sure. setfattr -n "user.some-name" -v "some-value" <path-to-mount> > Doing these steps will ensure that heal happens only from updated brick to > > down brick. > > 5) gluster v start <> force > > 6) gluster v heal <> > > > > Will it matter if somewhere in gluster the full heal command was run other > day? Not sure if it eventually stops or times out. > full heal will stop once the crawl is done. So if you want to trigger heal again, run gluster v heal <>. Actually even brick up or volume start force should trigger the heal. > > > > > 1st node worked as expected took 12 hours to heal 1TB data. Load was > > little > > > heavy but nothing shocking. > > > > > > About an hour after node 1 finished I began same process on node2. Heal > > > proces kicked in as before and the files in directories visible from > > mount > > > and .glusterfs healed in short time. Then it began crawl of .shard adding > > > those files to heal count at which point the entire proces ground to a > > halt > > > basically. After 48 hours out of 19k shards it has added 5900 to heal > > list. > > > Load on all 3 machnes is negligible. It was suggested to change this > > value > > > to full cluster.data-self-heal-algorithm and restart volume which I > > did. No > > > efffect. Tried relaunching heal no effect, despite any node picked. I > > > started each VM and performed a stat of all files from within it, or a > > full > > > virus scan and that seemed to cause short small spikes in shards added, > > but > > > not by much. Logs are showing no real messages indicating anything is > > going > > > on. I get hits to brick log on occasion of null lookups making me think > > its > > > not really crawling shards directory but waiting for a shard lookup to > > add > > > it. I'll get following in brick log but not constant and sometime > > multiple > > > for same shard. > > > > > > [2016-08-29 08:31:57.478125] W [MSGID: 115009] > > > [server-resolve.c:569:server_resolve] 0-GLUSTER1-server: no resolution > > type > > > for (null) (LOOKUP) > > > [2016-08-29 08:31:57.478170] E [MSGID: 115050] > > > [server-rpc-fops.c:156:server_lookup_cbk] 0-GLUSTER1-server: 12591783: > > > LOOKUP (null) (00000000-0000-0000-00 > > > 00-000000000000/241a55ed-f0d5-4dbc-a6ce-ab784a0ba6ff.221) ==> (Invalid > > > argument) [Invalid argument] > > > > > > This one repeated about 30 times in row then nothing for 10 minutes then > > one > > > hit for one different shard by itself. > > > > > > How can I determine if Heal is actually running? How can I kill it or > > force > > > restart? Does node I start it from determine which directory gets > > crawled to > > > determine heals? > > > > > > David Gossage > > > Carousel Checks Inc. | System Administrator > > > Office 708.613.2284 > > > > > > _______________________________________________ > > > Gluster-users mailing list > > > Gluster-users@xxxxxxxxxxx > > > http://www.gluster.org/mailman/listinfo/gluster-users > > > > > > > > > _______________________________________________ > > > Gluster-users mailing list > > > Gluster-users@xxxxxxxxxxx > > > http://www.gluster.org/mailman/listinfo/gluster-users > > > > -- > > Thanks, > > Anuradha. > > > -- Thanks, Anuradha. _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users