Hey there, I've been madly hacking on cool new puppet-gluster features... In my lack of sleep, I've put together some comments about gluster add/remove brick features. Hopefully they are useful, and make sense. These are sort of "bugs". Have a look, and let me know if I should formally report any of these... Cheers... James PS: this is also mirrored here: http://paste.fedoraproject.org/50402/12956713 because email has destroyed formatting :P All tests are done on gluster 3.4.1, using CentOS 6.4 on vm's. Firewall has been disabled for testing purposes. gluster --version glusterfs 3.4.1 built on Sep 27 2013 13:13:58 ### 1) simple operations shouldn't fail # running the following commands in succession without files: # gluster volume add-brick examplevol vmx1.example.com:/tmp/foo9 vmx2.example.com:/tmp/foo9 # gluster volume remove-brick examplevol vmx1.example.com:/tmp/foo9 vmx2.example.com:/tmp/foo9 start ... status shows a failure: [root at vmx1 ~]# gluster volume add-brick examplevol vmx1.example.com:/tmp/foo9 vmx2.example.com:/tmp/foo9 volume add-brick: success [root at vmx1 ~]# gluster volume remove-brick examplevol vmx1.example.com:/tmp/foo9 vmx2.example.com:/tmp/foo9 status Node Rebalanced-files size scanned failures skipped status run-time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 0 0Bytes 0 0 not started 0.00 vmx2.example.com 0 0Bytes 0 0 not started 0.00 [root at vmx1 ~]# gluster volume remove-brick examplevol vmx1.example.com:/tmp/foo9 vmx2.example.com:/tmp/foo9 start volume remove-brick start: success ID: ecbcc2b6-4351-468a-8f53-3a09159e4059 [root at vmx1 ~]# gluster volume remove-brick examplevol vmx1.example.com:/tmp/foo9 vmx2.example.com:/tmp/foo9 status Node Rebalanced-files size scanned failures skipped status run-time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 0 0Bytes 8 0 completed 0.00 vmx2.example.com 0 0Bytes 0 1 failed 0.00 [root at vmx1 ~]# gluster volume remove-brick examplevol vmx1.example.com:/tmp/foo9 vmx2.example.com:/tmp/foo9 commit Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y volume remove-brick commit: success [root at vmx1 ~]# ### 1b) on the other node, the output shows an extra row (also including the failure) [root at vmx2 ~]# gluster volume remove-brick examplevol vmx1.example.com:/tmp/foo9 vmx2.example.com:/tmp/foo9 status Node Rebalanced-files size scanned failures skipped status run-time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 0 0Bytes 0 0 completed 0.00 localhost 0 0Bytes 0 0 completed 0.00 vmx1.example.com 0 0Bytes 0 1 failed 0.00 ### 2) formatting: # the "skipped" column doesn't seem to have any data, as a result formatting is broken... # this problem is obviously not seen in the more useful --xml output below. neither is the 'skipped' column. [root at vmx1 examplevol]# gluster volume remove-brick examplevol vmx1.example.com:/tmp/foo3 vmx2.example.com:/tmp/foo3 status Node Rebalanced-files size scanned failures skipped status run-time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 0 0Bytes 8 0 completed 0.00 vmx2.example.com 0 0Bytes 8 0 completed 0.00 <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <cliOutput> <opRet>0</opRet> <opErrno>115</opErrno> <opErrstr/> <volRemoveBrick> <task-id>d99cab76-cd7d-4579-80ae-c1e6faff3d1d</task-id> <nodeCount>2</nodeCount> <node> <nodeName>localhost</nodeName> <files>0</files> <size>0</size> <lookups>8</lookups> <failures>0</failures> <status>3</status> <statusStr>completed</statusStr> </node> <node> <nodeName>vmx2.example.com</nodeName> <files>0</files> <size>0</size> <lookups>8</lookups> <failures>0</failures> <status>3</status> <statusStr>completed</statusStr> </node> <aggregate> <files>0</files> <size>0</size> <lookups>16</lookups> <failures>0</failures> <status>3</status> <statusStr>completed</statusStr> </aggregate> </volRemoveBrick> </cliOutput> ### 3) [root at vmx1 examplevol]# gluster volume remove-brick examplevol vmx1.example.com:/tmp/foo3 vmx2.example.com:/tmp/foo3 status Node Rebalanced-files size scanned failures skipped status run-time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 0 0Bytes 8 0 completed 0.00 vmx2.example.com 0 0Bytes 8 0 completed 0.00 [root at vmx1 examplevol]# gluster volume remove-brick examplevol vmx1.example.com:/tmp/foo3 vmx2.example.com:/tmp/foo3 commit Removing brick(s) can result in data loss. Do you want to Continue? (y/n) y volume remove-brick commit: success This shouldn't warn you that you might experience data loss. If the rebalance has successfully worked, and bricks shouldn't be accepting new files, then gluster should know this, and just let you commit safely. I guess you can consider this a UI bug, as long as it checks before hand that it's safe. ### 4) Aggregate "totals", aka the <aggregate> </aggregate> data isn't shown in the normal command line output. ### 5) the volume shouldn't have to be "started" for a rebalance to work... we might want to do a rebalance, but keep it "stopped" so that clients can't mount. This is probably due to gluster needing it "online" to rebalance, but nonetheless, it doesn't work with what users/sysadmins expect. ### 6) in the command: gluster volume rebalance myvolume status ; gluster volume rebalance myvolume status --xml && echo t No where does it mention the volume, or the specific bricks which are being [re-]balanced. In particular, a volume name would be especially useful in the --xml output. This would be useful if multiple rebalances are going on... I realize this is because the rebalance command only allows you to specify one volume at a time, but to be consistent with other commands, a volume rebalance status command should let you get info on many volumes. Also, still missing per brick information. Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 0 0Bytes 2 0 0 in progress 1.00 vmx2.example.com 0 0Bytes 7 0 0 in progress 1.00 volume rebalance: examplevol: success: <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <cliOutput> <opRet>0</opRet> <opErrno>115</opErrno> <opErrstr/> <volRebalance> <task-id>c5e9970b-f96a-4a28-af14-5477cf90d638</task-id> <op>3</op> <nodeCount>2</nodeCount> <node> <nodeName>localhost</nodeName> <files>0</files> <size>0</size> <lookups>2</lookups> <failures>0</failures> <status>1</status> <statusStr>in progress</statusStr> </node> <node> <nodeName>vmx2.example.com</nodeName> <files>0</files> <size>0</size> <lookups>7</lookups> <failures>0</failures> <status>1</status> <statusStr>in progress</statusStr> </node> <aggregate> <files>0</files> <size>0</size> <lookups>9</lookups> <failures>0</failures> <status>1</status> <statusStr>in progress</statusStr> </aggregate> </volRebalance> </cliOutput> -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: This is a digitally signed message part URL: <http://supercolony.gluster.org/pipermail/gluster-users/attachments/20131030/94d04e44/attachment.sig>