Hi again Pranith, On 30/06/14 11:58, Pranith Kumar Karampuri wrote: > Oops, I see you are the same user who posted about VM files self-heal. > Sorry I couldn't get back in time. So you are using 3.4.2. > Could you post logfiles of mount, bricks please. That should help us > to find more information about any issues. > When you say the log for the mount, which log is that? There are none that I can identify with the mount. > gluster volume heal <volname> info heal-failed records the last 1024 > failures. It also prints the timestamp of when the failures occurred. > Even after the heal is successful it keeps showing the errors. So > timestamp of when the heal failed is important. Because some of these > commands are causing such confusion we depracated these commands in > upcoming releases (3.6). > So far I've been focusing on the heal-failed count, which I fully, and I believe understandably, expect to show zero when there are no errors. Now that I look at the timestamps of those errors I realise they are all from *before* the slave brick was added back in. May I assume then that in reality there are no unhealed files? If this is correct, I must point out that if errors are reported when there are none that is a massive design flaw. It means things like nagios checks, such as the one we use, are useless. This makes monitoring near enough to impossible. > This is probably a stupid question but let me ask it anyway. When a > brick contents are erased from backend > we need to make sure about the following two things: > 1) Extended attributes of the root brick is showing pending operations > on the brick that is erased > 2) Execute "gluster volume heal <volname> full" 1) While gluster was stopped I merely did an rm -rf on both the data sub-directory and the .gluster sub-directory. How do I show that there are pending operations? 2) Yes, I did run that. > > Did you do the steps above? > > Since you are on 3.4.2 I think best way to check what files are healed > is using extended attributes in the backend. Could you please post > them again. I don't quite understand what you're asking for. I understand attributes as belonging to files and directories, not operations. Please elaborate. > > Pranith > > On 06/30/2014 07:12 AM, Pranith Kumar Karampuri wrote: >> >> On 06/30/2014 04:03 AM, John Gardeniers wrote: >>> Hi All, >>> >>> We have 2 servers, each with on 5TB brick, configured as replica 2. >>> After a series of events that caused the 2 bricks to become way out of >>> step gluster was turned off on one server and its brick was wiped of >>> everything but the attributes were untouched. >>> >>> This weekend we stopped the client and gluster and made a backup of the >>> remaining brick, just to play safe. Gluster was then turned back on, >>> first on the "master" and then on the "slave". Self-heal kicked in and >>> started rebuilding the second brick. However, after 2 full days all >>> files in the volume are still showing heal failed errors. >>> >>> The rebuild was, in my opinion at least, very slow, taking most of a >>> day >>> even though the system is on a 10Gb LAN. The data is a little under >>> 1.4TB committed, 2TB allocated. >> How much more to be healed? 0.6TB? >>> >>> Once the 2 bricks were very close to having the same amount of space >>> used things slowed right down. For the last day both bricks show a very >>> slow increase in used space, even though there are no changes being >>> written by the client. By slow I mean just a few KB per minute. >> Is the I/O still in progress on the mount? Self-heal doesn't happen >> on files where I/O is going on mounts in 3.4.x. So that could be the >> reason if I/O is going on. >>> >>> The logs are confusing, to say the least. In >>> etc-glusterfs-glusterd.vol.log on both servers there are thousands of >>> entries such as (possibly because I was using watch to monitor >>> self-heal >>> progress): >>> >>> [2014-06-29 21:41:11.289742] I >>> [glusterd-volume-ops.c:478:__glusterd_handle_cli_heal_volume] >>> 0-management: Received heal vol req for volume gluster-rhev >> What versoin of gluster are you using? >>> That timestamp is the latest on either server, that's about 9 hours ago >>> as I type this. I find that a bit disconcerting. I have requested >>> volume >>> heal-failed info since then. >>> >>> The brick log on the "master" server (the one from which we are >>> rebuilding the new brick) contains no entries since before the rebuild >>> started. >>> >>> On the "slave" server the brick log shows a lot of entries such as: >>> >>> [2014-06-28 08:49:47.887353] E [marker.c:2140:marker_removexattr_cbk] >>> 0-gluster-rhev-marker: Numerical result out of range occurred while >>> creating symlinks >>> [2014-06-28 08:49:47.887382] I >>> [server-rpc-fops.c:745:server_removexattr_cbk] 0-gluster-rhev-server: >>> 10311315: REMOVEXATTR >>> /44d30b24-1ed7-48a0-b905-818dc0a006a2/images/02d4bd3c-b057-4f04-ada5-838f83d0b761/d962466d-1894-4716-b5d0-3a10979145ec >>> >>> (1c1f53ac-afe2-420d-8c93-b1eb53ffe8b1) of key ==> (Numerical result >>> out >>> of range) >> CC Raghavendra who knows about marker translator. >>> >>> Those entries are around the time the rebuild was starting. The final >>> entries in that same log (immediately after those listed above) are: >>> >>> [2014-06-29 12:47:28.473999] I >>> [server-rpc-fops.c:243:server_inodelk_cbk] 0-gluster-rhev-server: 2869: >>> INODELK (null) (c67e9bbe-5956-4c61-b650-2cd5df4c4df0) ==> (No such file >>> or directory) >>> [2014-06-29 12:47:28.489527] I [server-rpc-fops.c:1572:server_open_cbk] >>> 0-gluster-rhev-server: 2870: OPEN (null) >>> (c67e9bbe-5956-4c61-b650-2cd5df4c4df0) ==> (No such file or directory) >> These logs are harmless and were fixed in 3.5 I think. Are you on 3.4.x? >> >>> >>> As I type it's 2014-06-30 08:31. >>> >>> What do they mean and how can I rectify it? >>> >>> regards, >>> John >>> >>> _______________________________________________ >>> Gluster-users mailing list >>> Gluster-users@xxxxxxxxxxx >>> http://supercolony.gluster.org/mailman/listinfo/gluster-users >> >> _______________________________________________ >> Gluster-users mailing list >> Gluster-users@xxxxxxxxxxx >> http://supercolony.gluster.org/mailman/listinfo/gluster-users > > > ______________________________________________________________________ > This email has been scanned by the Symantec Email Security.cloud service. > For more information please visit http://www.symanteccloud.com > ______________________________________________________________________ _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://supercolony.gluster.org/mailman/listinfo/gluster-users