Re: gluster remove-brick

mohammad kashif <kashif.alig@xxxxxxxxx> · Mon, 4 Feb 2019 13:23:03 +0000

Hi Nithya
I tried attching the logs but it was tool big. So I have put it on one drive accessible by everyone

https://drive.google.com/drive/folders/1744WcOfrqe_e3lRPxLpQ-CBuXHp_o44T?usp=sharing 

I am attaching rebalance-logs which is for the period when I ran fix-layout after adding new disk and then started remove-disk option.

  All of the nodes have atleast 8 TB disk available

/dev/sdb                  73T   65T  8.0T  90% /glusteratlas/brick001
/dev/sdb                  73T   65T  8.0T  90% /glusteratlas/brick002
/dev/sdb                  73T   65T  8.0T  90% /glusteratlas/brick003
/dev/sdb                  73T   65T  8.0T  90% /glusteratlas/brick004
/dev/sdb                  73T   65T  8.0T  90% /glusteratlas/brick005
/dev/sdb                  80T   67T   14T  83% /glusteratlas/brick006
/dev/sdb                  37T  1.6T   35T   5% /glusteratlas/brick007
/dev/sdb                  89T   15T   75T  17% /glusteratlas/brick008
/dev/sdb                  89T   14T   76T  16% /glusteratlas/brick009

brick007 is the one I am removing

gluster volume info

Volume Name: atlasglust
Type: Distribute
Volume ID: fbf0ebb8-deab-4388-9d8a-f722618a624b
Status: Started
Snapshot Count: 0
Number of Bricks: 9
Transport-type: tcp
Bricks:
Brick1: pplxgluster01**:/glusteratlas/brick001/gv0
Brick2: pplxgluster02.**:/glusteratlas/brick002/gv0
Brick3: pplxgluster03.**:/glusteratlas/brick003/gv0
Brick4: pplxgluster04.**:/glusteratlas/brick004/gv0
Brick5: pplxgluster05.**:/glusteratlas/brick005/gv0
Brick6: pplxgluster06.**:/glusteratlas/brick006/gv0
Brick7: pplxgluster07.**:/glusteratlas/brick007/gv0
Brick8: pplxgluster08.**:/glusteratlas/brick008/gv0
Brick9: pplxgluster09.**:/glusteratlas/brick009/gv0
Options Reconfigured:
nfs.disable: on
performance.readdir-ahead: on
transport.address-family: inet
auth.allow: ***
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.stat-prefetch: on
performance.md-cache-timeout: 600
performance.parallel-readdir: off
performance.cache-size: 1GB
performance.client-io-threads: on
cluster.lookup-optimize: on
client.event-threads: 4
server.event-threads: 4
performance.cache-invalidation: on
diagnostics.brick-log-level: WARNING
diagnostics.client-log-level: WARNING

Thanks

On Mon, Feb 4, 2019 at 11:37 AM Nithya Balachandran <nbalacha@xxxxxxxxxx> wrote:
Hi,

On Mon, 4 Feb 2019 at 16:39, mohammad kashif <kashif.alig@xxxxxxxxx> wrote:
Hi Nithya
Thanks for replying so quickly. It is very much appreciated.

There are lots if  " [No space left on device] " errors which I can not understand as there are much space on all of the nodes.

This means that Gluster could not find sufficient space for the file. Would you be willing to share your rebalance log file? 
Please provide the following information:
The gluster version 
The gluster volume info for the volume
How full are the individual bricks for the volume?

A little bit of background will be useful in this case. I had cluster of seven nodes of varying capacity(73, 73, 73, 46, 46, 46,46 TB) .  The cluster was almost 90% full so every node has almost 8 to 15 TB free space.  I added two new nodes with 100TB each and ran fix-layout which completed successfully.

After that I started remove-brick operation.  I don't think that any point , any of the nodes were 100% full. Looking at my ganglia graph, there is minimum 5TB always available at every node.

I was keeping an eye on remove-brick status and for very long time there was no failures and then at some point these 17000 failures appeared and it stayed like that.

 Thanks

Kashif

Let me explain a little bit of background.  

On Mon, Feb 4, 2019 at 5:09 AM Nithya Balachandran <nbalacha@xxxxxxxxxx> wrote:
Hi,
The status shows quite a few failures. Please check the rebalance logs to see why that happened. We can decide what to do based on the errors.
Once you run a commit, the brick will no longer be part of the volume and you will not be able to access those files via the client.
Do you have sufficient space on the remaining bricks for the files on the removed brick?

Regards,
Nithya

On Mon, 4 Feb 2019 at 03:50, mohammad kashif <kashif.alig@xxxxxxxxx> wrote:
Hi
I have a pure distributed gluster volume with nine nodes and trying to remove one node, I ran 
gluster volume remove-brick atlasglust nodename:/glusteratlas/brick007/gv0 start

It completed but with around 17000 failures

      Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
          nodename          4185858        27.5TB       6746030         17488             0            completed      405:15:34

I can see that there is still 1.5 TB of data on the node which I was trying to remove.

I am not sure what to do now?  Should I run remove-brick command again so the files which has been failed can be tried again?

or should I run commit first and then try to remove node again?

Please advise as I don't want to remove files.

Thanks

Kashif

_______________________________________________

Gluster-users mailing list

Gluster-users@xxxxxxxxxxx

https://lists.gluster.org/mailman/listinfo/gluster-users

_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
https://lists.gluster.org/mailman/listinfo/gluster-users