The rebalance failures appear to be because the connection to subvolume bigdata2-client-8 was lost. Rebalance will stop if any dht subvolume goes down. From the logs: [2015-06-04 23:24:36.714719] I [client.c:2215:client_rpc_notify] 0-bigdata2-client-8: disconnected from bigdata2-client-8. Client process will keep trying to connect to glusterd until brick's port is available [2015-06-04 23:24:36.714734] W [dht-common.c:5953:dht_notify] 0-bigdata2-dht: Received CHILD_DOWN. Exiting [2015-06-04 23:24:36.714745] I [MSGID: 109029] [dht-rebalance.c:2136:gf_defrag_stop] 0-: Received stop command on rebalance Did anything happen to the brick process for 0-bigdata2-client-8 that would cause this? The brick logs might help here. I need to look into why the rebalance never proceeded on gluster-6. The logs show the following : [2015-06-03 15:18:17.905569] W [client-handshake.c:1109:client_setvolume_cbk] 0-bigdata2-client-1: failed to set the volume (Permission denied) [2015-06-03 15:18:17.905583] W [client-handshake.c:1135:client_setvolume_cbk] 0-bigdata2-client-1: failed to get 'process-uuid' from reply dict [2015-06-03 15:18:17.905592] E [client-handshake.c:1141:client_setvolume_cbk] 0-bigdata2-client-1: SETVOLUME on remote-host failed: Authentication for all subvols on gluster-6. Can you send us the brick logs for those as well? Thanks, Nithya ----- Original Message ----- > From: "Branden Timm" <btimm@xxxxxxxx> > To: "Nithya Balachandran" <nbalacha@xxxxxxxxxx> > Cc: gluster-users@xxxxxxxxxxx > Sent: Saturday, 6 June, 2015 12:20:53 AM > Subject: Re: One host won't rebalance > > Update on this. After two out of three servers entered failed state during > rebalance, and the third hadn't done anything yet, I cancelled the > rebalance. I then stopped/started the volume, and ran rebalance fix-layout. > As of this point, it is running on all three servers successfully. > > Once fix-layout is done I will attempt another data rebalance and update this > list with the results. > > > > ________________________________________ > From: gluster-users-bounces@xxxxxxxxxxx <gluster-users-bounces@xxxxxxxxxxx> > on behalf of Branden Timm <btimm@xxxxxxxx> > Sent: Friday, June 5, 2015 10:38 AM > To: Nithya Balachandran > Cc: gluster-users@xxxxxxxxxxx > Subject: Re: One host won't rebalance > > Sure, here is gluster volume info: > > Volume Name: bigdata2 > Type: Distribute > Volume ID: 2cd214fa-6fa4-49d0-93f6-de2c510d4dd4 > Status: Started > Number of Bricks: 15 > Transport-type: tcp > Bricks: > Brick1: gluster-6.redacted:/gluster/brick1/data > Brick2: gluster-6.redacted:/gluster/brick2/data > Brick3: gluster-6.redacted:/gluster/brick3/data > Brick4: gluster-6.redacted:/gluster/brick4/data > Brick5: gluster-7.redacted:/gluster/brick1/data > Brick6: gluster-7.redacted:/gluster/brick2/data > Brick7: gluster-7.redacted:/gluster/brick3/data > Brick8: gluster-7.redacted:/gluster/brick4/data > Brick9: gluster-8.redacted:/gluster/brick1/data > Brick10: gluster-8.redacted:/gluster/brick2/data > Brick11: gluster-8.redacted:/gluster/brick3/data > Brick12: gluster-8.redacted:/gluster/brick4/data > Brick13: gluster-7.redacted:/gluster-sata/brick1/data > Brick14: gluster-8.redacted:/gluster-sata/brick1/data > Brick15: gluster-6.redacted:/gluster-sata/brick1/data > Options Reconfigured: > cluster.readdir-optimize: on > performance.enable-least-priority: off > > Attached is a tarball containing logs for gluster-6, 7 and 8. I should also > note that as of this morning, the two hosts that were successfully running > the rebalance show as failed, while the affected host still is sitting at 0 > secs progress: > > Node Rebalanced-files size scanned failures skipped > status run time in secs > --------- ----------- ----------- ----------- ----------- > ----------- ------------ -------------- > localhost 0 0Bytes 0 0 > 0 in progress 0.00 > gluster-7.glbrc.org 3020 19.4TB 12730 > 4 0 failed 105165.00 > gluster-8.glbrc.org 0 0Bytes 0 > 0 0 failed 0.00 > volume rebalance: bigdata2: success: > > Thanks! > > ________________________________________ > From: Nithya Balachandran <nbalacha@xxxxxxxxxx> > Sent: Friday, June 5, 2015 4:46 AM > To: Branden Timm > Cc: Atin Mukherjee; gluster-users@xxxxxxxxxxx > Subject: Re: One host won't rebalance > > Hi, > > Can you send us the gluster volume info for the volume and the rebalance log > for the nodes? What is the pid of the process which does not proceed? > > Thanks, > Nithya > > ----- Original Message ----- > > From: "Atin Mukherjee" <amukherj@xxxxxxxxxx> > > To: "Branden Timm" <btimm@xxxxxxxx>, "Atin Mukherjee" > > <atin.mukherjee83@xxxxxxxxx> > > Cc: gluster-users@xxxxxxxxxxx > > Sent: Friday, June 5, 2015 9:26:44 AM > > Subject: Re: One host won't rebalance > > > > > > > > On 06/05/2015 12:05 AM, Branden Timm wrote: > > > I should add that there are additional errors as well in the brick logs. > > > I've posted them to a gist at > > > https://gist.github.com/brandentimm/576432ddabd70184d257 > > As I mentioned earlier, DHT team can answer all your question on this > > failure. > > > > ~Atin > > > > > > > > > ________________________________ > > > From: gluster-users-bounces@xxxxxxxxxxx > > > <gluster-users-bounces@xxxxxxxxxxx> > > > on behalf of Branden Timm <btimm@xxxxxxxx> > > > Sent: Thursday, June 4, 2015 1:31 PM > > > To: Atin Mukherjee > > > Cc: gluster-users@xxxxxxxxxxx > > > Subject: Re: One host won't rebalance > > > > > > > > > I have stopped and restarted the rebalance several times, with no > > > difference in results. I have restarted all gluster services several > > > times, and completely rebooted the affected system. > > > > > > > > > Yes, gluster volume status does show an active rebalance task for volume > > > bigdata2. > > > > > > > > > I just noticed something else in the brick logs. I am seeing tons of > > > message similar to these two: > > > > > > > > > [2015-06-04 16:22:26.179797] E [posix-helpers.c:938:posix_handle_pair] > > > 0-bigdata2-posix: /<redacted path>: key:glusterfs-internal-fop flags: 1 > > > length:4 error:Operation not supported > > > [2015-06-04 16:22:26.179874] E [posix.c:2325:posix_create] > > > 0-bigdata2-posix: setting xattrs on /<path redacted> failed (Operation > > > not > > > supported) > > > > > > > > > Note that both messages were referring to the same file. I have confirmed > > > that xattr support is on in the underlying system. Additionally, these > > > messages are NOT appearing on the other cluster members that seem to be > > > unaffected by whatever is going on. > > > > > > > > > I found this bug which seems to be similar, but it was theoretically > > > closed > > > for the 3.6.1 release: > > > https://bugzilla.redhat.com/show_bug.cgi?id=1098794 > > > > > > > > > Thanks again for your help. > > > > > > > > > ________________________________ > > > From: Atin Mukherjee <atin.mukherjee83@xxxxxxxxx> > > > Sent: Thursday, June 4, 2015 1:25 PM > > > To: Branden Timm > > > Cc: Shyamsundar Ranganathan; Susant Palai; gluster-users@xxxxxxxxxxx; > > > Atin > > > Mukherjee; Nithya Balachandran > > > Subject: Re: One host won't rebalance > > > > > > > > > Sent from Samsung Galaxy S4 > > > On 4 Jun 2015 22:18, "Branden Timm" > > > <btimm@xxxxxxxx<mailto:btimm@xxxxxxxx>> > > > wrote: > > >> > > >> Atin, thank you for the response. Indeed I have investigated the locks > > >> on > > >> that file, and it is a glusterfs process with an exclusive read/write > > >> lock on the entire file: > > >> > > >> lsof > > >> /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid > > >> COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME > > >> glusterfs 12776 root 6uW REG 253,1 6 15730814 > > >> /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid > > >> > > >> That process was invoked with the following options: > > >> > > >> ps -ef | grep 12776 > > >> root 12776 1 0 Jun03 ? 00:00:03 /usr/sbin/glusterfs -s > > >> localhost --volfile-id rebalance/bigdata2 --xlator-option > > >> *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes > > >> --xlator-option *dht.assert-no-child-down=yes --xlator-option > > >> *replicate*.data-self-heal=off --xlator-option > > >> *replicate*.metadata-self-heal=off --xlator-option > > >> *replicate*.entry-self-heal=off --xlator-option > > >> *replicate*.readdir-failover=off --xlator-option > > >> *dht.readdir-optimize=on > > >> --xlator-option *dht.rebalance-cmd=1 --xlator-option > > >> *dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db --socket-file > > >> /var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock > > >> --pid-file > > >> /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid > > >> -l /var/log/glusterfs/bigdata2-rebalance.log > > > This means there is already a rebalance process alive. Could you help me > > > with following: > > > 1. What does bigdata2-rebalance.log says? Don't you see a shutting down > > > log > > > somewhere? > > > 2. Does output of gluster volume status consider bigdata2 is in > > > rebalancing? > > > > > > As a work around can you kill this process and start a fresh rebalance > > > process? > > >> > > >> Not sure if this information is helpful, but thanks for your reply. > > >> > > >> ________________________________________ > > >> From: Atin Mukherjee <amukherj@xxxxxxxxxx<mailto:amukherj@xxxxxxxxxx>> > > >> Sent: Thursday, June 4, 2015 9:24 AM > > >> To: Branden Timm; > > >> gluster-users@xxxxxxxxxxx<mailto:gluster-users@xxxxxxxxxxx>; Nithya > > >> Balachandran; Susant Palai; Shyamsundar Ranganathan > > >> Subject: Re: One host won't rebalance > > >> > > >> On 06/04/2015 06:30 PM, Branden Timm wrote: > > >>> I'm really hoping somebody can at least point me in the right direction > > >>> on how to diagnose this. This morning, roughly 24 hours after > > >>> initiating > > >>> the rebalance, one host of three in the cluster still hasn't done > > >>> anything: > > >>> > > >>> > > >>> Node Rebalanced-files size scanned failures > > >>> skipped status run time in secs > > >>> --------- ----------- ----------- ----------- ----------- > > >>> ----------- ------------ -------------- > > >>> localhost 2543 14.2TB 11162 0 > > >>> 0 in progress 60946.00 > > >>> gluster-8 1358 6.7TB 9298 0 > > >>> 0 in progress 60946.00 > > >>> gluster-6 0 0Bytes 0 0 > > >>> 0 in progress 0.00 > > >>> > > >>> > > >>> The only error showing up in the rebalance log is this: > > >>> > > >>> > > >>> [2015-06-03 19:59:58.314100] E [MSGID: 100018] > > >>> [glusterfsd.c:1677:glusterfs_pidfile_update] 0-glusterfsd: pidfile > > >>> /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid > > >>> lock failed [Resource temporarily unavailable] > > >> This looks like acquiring posix file lock failed and seems like > > >> rebalance is *actually not* running. I would leave it to dht folks to > > >> comment on it. > > >> > > >> ~Atin > > >>> > > >>> > > >>> Any help would be greatly appreciated! > > >>> > > >>> > > >>> > > >>> ________________________________ > > >>> From: > > >>> gluster-users-bounces@xxxxxxxxxxx<mailto:gluster-users-bounces@xxxxxxxxxxx> > > >>> <gluster-users-bounces@xxxxxxxxxxx<mailto:gluster-users-bounces@xxxxxxxxxxx>> > > >>> on behalf of Branden Timm <btimm@xxxxxxxx<mailto:btimm@xxxxxxxx>> > > >>> Sent: Wednesday, June 3, 2015 11:52 AM > > >>> To: gluster-users@xxxxxxxxxxx<mailto:gluster-users@xxxxxxxxxxx> > > >>> Subject: One host won't rebalance > > >>> > > >>> > > >>> Greetings Gluster Users, > > >>> > > >>> I started a rebalance operation on my distributed volume today (CentOS > > >>> 6.6/GlusterFS 3.6.3), and one of the three hosts comprising the cluster > > >>> is just sitting at 0.00 for 'run time in secs', and shows 0 files > > >>> scanned, failed, or skipped. > > >>> > > >>> > > >>> I've reviewed the rebalance log for the affected server, and I'm seeing > > >>> these messages: > > >>> > > >>> > > >>> [2015-06-03 15:34:32.703692] I [MSGID: 100030] [glusterfsd.c:2018:main] > > >>> 0-/usr/sbin/glusterfs: Started running /usr/sbin/glusterfs version > > >>> 3.6.3 > > >>> (args: /usr/sbin/glusterfs -s localhost --volfile-id rebalance/bigdata2 > > >>> --xlator-option *dht.use-readdirp=yes --xlator-option > > >>> *dht.lookup-unhashed=yes --xlator-option *dht.assert-no-child-down=yes > > >>> --xlator-option *replicate*.data-self-heal=off --xlator-option > > >>> *replicate*.metadata-self-heal=off --xlator-option > > >>> *replicate*.entry-self-heal=off --xlator-option > > >>> *replicate*.readdir-failover=off --xlator-option > > >>> *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=1 > > >>> --xlator-option *dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db > > >>> --socket-file > > >>> /var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock > > >>> --pid-file > > >>> /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid > > >>> -l /var/log/glusterfs/bigdata2-rebalance.log) > > >>> [2015-06-03 15:34:32.704217] E [MSGID: 100018] > > >>> [glusterfsd.c:1677:glusterfs_pidfile_update] 0-glusterfsd: pidfile > > >>> /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid > > >>> lock failed [Resource temporarily unavailable] > > >>> > > >>> > > >>> I initially investigated the first warning, readv on > > >>> 127.0.0.1:24007<http://127.0.0.1:24007> failed. netstat shows that > > >>> ip/port belonging to a glusterd process. Beyond that I wasn't able to > > >>> tell why there would be a problem. > > >>> > > >>> > > >>> Next, I checked out what was up with the lock file that reported > > >>> resource > > >>> temprarily unavailable. The file is present and contains the pid of a > > >>> running glusterd process: > > >>> > > >>> > > >>> root 12776 1 0 10:18 ? 00:00:00 /usr/sbin/glusterfs -s > > >>> localhost --volfile-id rebalance/bigdata2 --xlator-option > > >>> *dht.use-readdirp=yes --xlator-option *dht.lookup-unhashed=yes > > >>> --xlator-option *dht.assert-no-child-down=yes --xlator-option > > >>> *replicate*.data-self-heal=off --xlator-option > > >>> *replicate*.metadata-self-heal=off --xlator-option > > >>> *replicate*.entry-self-heal=off --xlator-option > > >>> *replicate*.readdir-failover=off --xlator-option > > >>> *dht.readdir-optimize=on --xlator-option *dht.rebalance-cmd=1 > > >>> --xlator-option *dht.node-uuid=3b5025d4-3230-4914-ad0d-32f78587c4db > > >>> --socket-file > > >>> /var/run/gluster/gluster-rebalance-2cd214fa-6fa4-49d0-93f6-de2c510d4dd4.sock > > >>> --pid-file > > >>> /var/lib/glusterd/vols/bigdata2/rebalance/3b5025d4-3230-4914-ad0d-32f78587c4db.pid > > >>> -l /var/log/glusterfs/bigdata2-rebalance.log > > >>> > > >>> > > >>> Finally, one other thing I saw from running 'gluster volume status > > >>> <volname> clients' is that the affected server is the only one of the > > >>> three that lists a 127.0.0.1<http://127.0.0.1>:<port> client for each > > >>> of > > >>> it's bricks. I don't know why there would be a client coming from > > >>> loopback on the server, but it seems strange. Additionally, it makes me > > >>> wonder if the fact that I have auth.allow set to a single subnet (that > > >>> doesn't include 127.0.0.1) is causing this problem for some reason, or > > >>> if loopback is implicitly allowed to connect. > > >>> > > >>> > > >>> Any tips or suggestions would be much appreciated. Thanks! > > >>> > > >>> > > >>> > > >>> > > >>> _______________________________________________ > > >>> Gluster-users mailing list > > >>> Gluster-users@xxxxxxxxxxx<mailto:Gluster-users@xxxxxxxxxxx> > > >>> http://www.gluster.org/mailman/listinfo/gluster-users > > >>> > > >> > > >> -- > > >> ~Atin > > >> _______________________________________________ > > >> Gluster-users mailing list > > >> Gluster-users@xxxxxxxxxxx<mailto:Gluster-users@xxxxxxxxxxx> > > >> http://www.gluster.org/mailman/listinfo/gluster-users > > > > > > > > > > > > _______________________________________________ > > > Gluster-users mailing list > > > Gluster-users@xxxxxxxxxxx > > > http://www.gluster.org/mailman/listinfo/gluster-users > > > > > > > -- > > ~Atin > > _______________________________________________ > > Gluster-users mailing list > > Gluster-users@xxxxxxxxxxx > > http://www.gluster.org/mailman/listinfo/gluster-users > > > _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users