Thanks for your replies. The vols seem to match the bricks ok. FWIW, gfs-node01:/sda is the first brick; perhaps it is getting the lion's share of the pointers? The results of a log search and seem even more confusing. sdb drained rather than sda, but an error in rebalancing shows up in sdd. I include excerpts from scratch-rebalance, ls -l and getfattr, and bricks/sda Does any of this suggest anything? One failure is an apparent duplicate. This seems to refer to the relevant brick, and the date is correct. [2013-09-15 19:10:07.620881] W [client3_1-fops.c:258:client3_1_mknod_cbk] 0-scratch-client-0: remote operation failed: File exists. Path: /nwhitehorn/vetoblast/data/Level2_IC86.2011_data_Run00118906_Qtot1500.h5.out (00000000-0000-0000-0000-000000000000) On the array that actually drained (mostly): [2013-09-15 19:10:19.483040] W [client3_1-fops.c:647:client3_1_unlink_cbk] 0-scratch-client-12: remote operation failed: No such file or directory [2013-09-15 19:10:19.483122] W [client3_1-fops.c:647:client3_1_unlink_cbk] 0-scratch-client-12: remote operation failed: No such file or directory [2013-09-15 19:10:19.494585] W [client3_1-fops.c:258:client3_1_mknod_cbk] 0-scratch-client-12: remote operation failed: File exists. Path: /nwhitehorn/vetoblast/data/Level2_IC86.2011_data_Run00118429_Qtot1500.h5 (00000000-0000-0000-0000-000000000000) [2013-09-15 19:10:19.494701] W [client3_1-fops.c:258:client3_1_mknod_cbk] 0-scratch-client-12: remote operation failed: File exists. Path: /nwhitehorn/vetoblast/data/Level2_IC86.2011_data_Run00118429_Qtot1500.h5 (00000000-0000-0000-0000-000000000000) An example failure where I can trace the files is an apparent duplicate: gfs-node01 # grep -A2 -B2 Level2_IC86.2011_data_Run00118218_Qtot1500.h5 scratch-rebalance.log [2013-09-15 19:10:30.164409] W [client3_1-fops.c:258:client3_1_mknod_cbk] 0-scratch-client-3: remote operation failed: File exists. Path: /nwhitehorn/vetoblast/data/Level2a_IC79_data_Run00117874_Qtot1500.h5.out (00000000-0000-0000-0000-000000000000) [2013-09-15 19:10:30.164473] W [client3_1-fops.c:258:client3_1_mknod_cbk] 0-scratch-client-3: remote operation failed: File exists. Path: /nwhitehorn/vetoblast/data/Level2a_IC79_data_Run00117874_Qtot1500.h5.out (00000000-0000-0000-0000-000000000000) [2013-09-15 19:10:30.176606] I [dht-common.c:956:dht_lookup_everywhere_cbk] 0-scratch-dht: deleting stale linkfile /nwhitehorn/vetoblast/data/Level2_IC86.2011_data_Run00118218_Qtot1500.h5 on scratch-client-2 [2013-09-15 19:10:30.176717] I [dht-common.c:956:dht_lookup_everywhere_cbk] 0-scratch-dht: deleting stale linkfile /nwhitehorn/vetoblast/data/Level2_IC86.2011_data_Run00118218_Qtot1500.h5 on scratch-client-2 [2013-09-15 19:10:30.176856] I [dht-common.c:956:dht_lookup_everywhere_cbk] 0-scratch-dht: deleting stale linkfile /nwhitehorn/vetoblast/data/Level2_IC86.2011_data_Run00118218_Qtot1500.h5 on scratch-client-2 [2013-09-15 19:10:30.177232] W [client3_1-fops.c:647:client3_1_unlink_cbk] 0-scratch-client-2: remote operation failed: No such file or directory [2013-09-15 19:10:30.177303] W [client3_1-fops.c:647:client3_1_unlink_cbk] 0-scratch-client-2: remote operation failed: No such file or directory [2013-09-15 19:10:30.178101] W [client3_1-fops.c:258:client3_1_mknod_cbk] 0-scratch-client-3: remote operation failed: File exists. Path: /nwhitehorn/vetoblast/data/Level2_IC86.2011_data_Run00118218_Qtot1500.h5 (00000000-0000-0000-0000-000000000000) [2013-09-15 19:10:30.178150] W [client3_1-fops.c:258:client3_1_mknod_cbk] 0-scratch-client-3: remote operation failed: File exists. Path: /nwhitehorn/vetoblast/data/Level2_IC86.2011_data_Run00118218_Qtot1500.h5 (00000000-0000-0000-0000-000000000000) [2013-09-15 19:10:30.192605] W [client3_1-fops.c:2566:client3_1_opendir_cbk] 0-scratch-client-7: remote operation failed: No such file or directory. Path: /nwhitehorn/vetoblast/data (00000000-0000-0000-0000-000000000000) [2013-09-15 19:10:30.192830] W [client3_1-fops.c:2566:client3_1_opendir_cbk] 0-scratch-client-7: remote operation failed: No such file or directory. Path: /nwhitehorn/vetoblast/data (00000000-0000-0000-0000-000000000000) gfs-node01 # ls -l /sdd/nwhitehorn/vetoblast/data/Level2_IC86.2011_data_Run00118218_Qtot1500.h5 ---------T 2 34037 40978 0 Sep 15 14:10 /sdd/nwhitehorn/vetoblast/data/Level2_IC86.2011_data_Run00118218_Qtot1500.h5 gfs-node01 # ssh i3admin at gfs-node06 sudo ls -l /sdb/nwhitehorn/vetoblast/data/Level2_IC86.2011_data_Run00118218_Qtot1500.h5 -rw-r--r-- 2 34037 40978 715359 May 1 22:28 /sdb/nwhitehorn/vetoblast/data/Level2_IC86.2011_data_Run00118218_Qtot1500.h5 gfs-node01 # getfattr -d -m . -e hex /sdd/nwhitehorn/vetoblast/data/Level2_IC86.2011_data_Run00118218_Qtot1500.h5 # file: sdd/nwhitehorn/vetoblast/data/Level2_IC86.2011_data_Run00118218_Qtot1500.h5 trusted.gfid=0x11fb3ffd87be4ce3a88576466279819f trusted.glusterfs.dht.linkto=0x736372617463682d636c69656e742d313200 gfs-node01 # ssh i3admin at gfs-node06 sudo getfattr -d -m . -e hex /sdb/nwhitehorn/vetoblast/data/Level2_IC86.2011_data_Run00118218_Qtot1500.h5 # file: sdb/nwhitehorn/vetoblast/data/Level2_IC86.2011_data_Run00118218_Qtot1500.h5 trusted.gfid=0x11fb3ffd87be4ce3a88576466279819f Further. # getfattr -d -m . -e hex /sdd # file: sdd trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000bffffffdd5555551 trusted.glusterfs.volume-id=0xde1fbb473e5a45dc8df804f7f73a3ecc gfs-node01 # getfattr -d -m . -e hex /sdc # file: sdc trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000aaaaaaa8bffffffc trusted.glusterfs.volume-id=0xde1fbb473e5a45dc8df804f7f73a3ecc gfs-node01 # getfattr -d -m . -e hex /sdb # file: sdb trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000000000000000000000 trusted.glusterfs.volume-id=0xde1fbb473e5a45dc8df804f7f73a3ecc gfs-node01 # getfattr -d -m . -e hex /sda # file: sda trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000555555546aaaaaa8 trusted.glusterfs.volume-id=0xde1fbb473e5a45dc8df804f7f73a3ecc gfs-node01 # ssh i3admin at gfs-node04 sudo getfattr -d -m . -e hex /sdb # file: sdb trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000002aaaaaaa3ffffffe trusted.glusterfs.volume-id=0xde1fbb473e5a45dc8df804f7f73a3ecc bricks/sda etc logs have a rather monotonous [2013-09-16 22:23:01.723146] I [server-handshake.c:571:server_setvolume] 0-scratch-server: accepted client from node086-11928-2013/09/16-22:22:57:696729-scratch-client-0-0 (version: 3.3.2) [2013-09-16 22:23:01.769154] I [server.c:703:server_rpc_notify] 0-scratch-server: disconnecting connectionfrom node086-11928-2013/09/16-22:22:57:696729-scratch-client-0-0 [2013-09-16 22:23:01.769211] I [server-helpers.c:741:server_connection_put] 0-scratch-server: Shutting down connection node086-11928-2013/09/16-22:22:57:696729-scratch-client-0-0 [2013-09-16 22:23:01.769253] I [server-helpers.c:629:server_connection_destroy] 0-scratch-server: destroyed connection of node086-11928-2013/09/16-22:22:57:696729-scratch-client-0-0 > On 09/17/2013 03:26 AM, james.bellinger at icecube.wisc.edu wrote: >> I inherited a system with a wide mix of array sizes (no replication) in >> 3.2.2, and wanted to drain data from a failing array. >> >> I upgraded to 3.3.2, and began a >> gluster volume remove-brick scratch "gfs-node01:/sda" start >> >> After some time I got this: >> gluster volume remove-brick scratch "gfs-node01:/sda" status >> Node Rebalanced-files size scanned failures >> status >> --------- ----------- ----------- ----------- ----------- >> ------------ >> localhost 0 0Bytes 0 0 >> not started >> gfs-node06 0 0Bytes 0 0 >> not started >> gfs-node03 0 0Bytes 0 0 >> not started >> gfs-node05 0 0Bytes 0 0 >> not started >> gfs-node01 2257394624 2.8TB 5161640 208878 >> completed >> >> Two things jump instantly to mind: >> 1) The number of failures is rather large > Can you see the rebalance logs (/var/log/scratch-rebalance.log) to > figure out what the error messages are? >> 2) A _different_ disk seems to have been _partially_ drained. >> /dev/sda 2.8T 2.7T 12G 100% /sda >> /dev/sdb 2.8T 769G 2.0T 28% /sdb >> /dev/sdc 2.8T 2.1T 698G 75% /sdc >> /dev/sdd 2.8T 2.2T 589G 79% /sdd >> >> > I know this sounds silly, but just to be sure, is /dev/sda actually > mounted on "gfs-node01:sda"? > If yes,the files that _were_ successfully rebalanced should have been > moved from gfs-node01:sda to one of the other bricks. Is that the case? > >> When I mount the system it is read-only (another problem I want to fix > Again, the mount logs could shed some information .. > (btw a successful rebalance start/status sequence should be followed by > the rebalance 'commit' command to ensure the volume information gets > updated) > >> ASAP) so I'm pretty sure the failures aren't due to users changing the >> system underneath me. >> >> Thanks for any pointers. >> >> James Bellinger >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users at gluster.org >> http://supercolony.gluster.org/mailman/listinfo/gluster-users > > _______________________________________________ > Gluster-users mailing list > Gluster-users at gluster.org > http://supercolony.gluster.org/mailman/listinfo/gluster-users >