As a final(?) follow-up to my problem, after restarting the rebalance with: gluster volume rebalance [vol-name] fix-layout start it finished up last night after plowing thru the entirety of the filesystem - fixing about ~1M files (apparently ~2.2TB), all while the fs remained live (tho probably a bit slower than users would have liked). That's a strong '+' in the gluster column for resiliency. I started the rebalance without waiting for any advice to the contrary. 3.3 is supposed to have a built-in rebalance operator, but I saw no evidence of it. Other info from gluster.org suggested that it wouldn't do any harm to do this, so I went ahead and started it. Do the gluster wizards have any final words on this before I write this up in our trouble report? best wishes harry On Thu, Aug 2, 2012 at 4:37 PM, Harry Mangalam <hjmangalam at gmail.com> wrote: > Further to what I wrote before: > gluster server overload; recovers, now "Transport endpoint is not > connected" for some files > <http://goo.gl/CN6ud> > > I'm getting conflicting info here. On one hand, the peer that had its > glusterfsd lock up seems to be in the gluster system, according to > the frequently referenced 'gluster peer status' > > Thu Aug 02 15:48:46 [1.00 0.89 0.92] root at pbs1:~ > 729 $ gluster peer status > Number of Peers: 3 > > Hostname: pbs4ib > Uuid: 2a593581-bf45-446c-8f7c-212c53297803 > State: Peer in Cluster (Connected) > > Hostname: pbs2ib > Uuid: 26de63bd-c5b7-48ba-b81d-5d77a533d077 > State: Peer in Cluster (Connected) > > Hostname: pbs3ib > Uuid: c79c4084-d6b9-4af9-b975-40dd6aa99b42 > State: Peer in Cluster (Connected) > > On the other hand, some errors that I provided yesterday: > =================================================== > [2012-08-01 18:07:26.104910] W > [dht-selfheal.c:875:dht_selfheal_directory] 0-gli-dht: 1 subvolumes > down -- not fixing > =================================================== > > as well as this information: > $ gluster volume status all detail > > [top 2 brick stanzas trimmed; they're online] > > ------------------------------------------------------------------------------ > Brick : Brick pbs3ib:/bducgl > Port : 24018 > Online : N <<===================== > Pid : 20953 > File System : xfs > Device : /dev/md127 > Mount Options : rw > Inode Size : 256 > Disk Space Free : 6.1TB > Total Disk Space : 8.2TB > Inode Count : 1758158080 > Free Inodes : 1752326373 > > ------------------------------------------------------------------------------ > Brick : Brick pbs4ib:/bducgl > Port : 24009 > Online : Y > Pid : 20948 > File System : xfs > Device : /dev/sda > Mount Options : rw > Inode Size : 256 > Disk Space Free : 4.6TB > Total Disk Space : 6.4TB > Inode Count : 1367187392 > Free Inodes : 1361305613 > > The above implies fairly strongly that the brick did not re-establish > connection to the volume, altho the gluster peer info did. > > Strangely enough, when I RE-restarted the glusterd, it DID come back > and re-joined the gluster volume and now the (restarted) fix-layout > job is proceeding without those "subvolumes > down -- not fixing" errors, just a steady stream of 'found > anomalies/fixing the layout' messages, tho at the rate that it's going > it looks like it will take several days. > > Still better several days to fix the data on-disk and having the fs > live than having to tell users that their data is gone and then having > to rebuild from zero. Luckily, it's officially a /scratch filesystem. > > Harry > > -- > Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine > [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 > 415 South Circle View Dr, Irvine, CA, 92697 [shipping] > MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) > -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [m/c 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 415 South Circle View Dr, Irvine, CA, 92697 [shipping] MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://gluster.org/pipermail/gluster-users/attachments/20120806/58799db7/attachment.htm>