On Thu, Mar 3, 2016 at 4:10 PM, Ravishankar N <ravishankar@xxxxxxxxxx> wrote:
Hi,
On 03/03/2016 11:14 AM, ABHISHEK PALIWAL wrote:
Hi Ravi,As I discussed earlier this issue, I investigated this issue and find that healing is not triggered because the "gluster volume heal c_glusterfs info split-brain" command not showing any entries as a outcome of this command even though the file in split brain case.
Couple of observations from the 'commands_output' file.
getfattr -d -m . -e hex opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
The afr xattrs do not indicate that the file is in split brain:
# file: opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
trusted.afr.c_glusterfs-client-1=0x000000000000000000000000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x000000000000000b56d6dd1d000ec7a9
trusted.gfid=0x9f5e354ecfda40149ddce7d5ffe760ae
getfattr -d -m . -e hex opt/lvmdir/c2/brick/logfiles/availability/CELLO_AVAILABILITY2_LOG.xml
trusted.afr.c_glusterfs-client-0=0x000000080000000000000000
trusted.afr.c_glusterfs-client-2=0x000000020000000000000000
trusted.afr.c_glusterfs-client-4=0x000000020000000000000000
trusted.afr.c_glusterfs-client-6=0x000000020000000000000000
trusted.afr.dirty=0x000000000000000000000000
trusted.bit-rot.version=0x000000000000000b56d6dcb7000c87e7
trusted.gfid=0x9f5e354ecfda40149ddce7d5ffe760ae
1. There doesn't seem to be a split-brain going by the trusted.afr* xattrs.
if it is not the split brain problem then how can I resolve this.
2. You seem to have re-used the bricks from another volume/setup. For replica 2, only trusted.afr.c_glusterfs-client-0 and trusted.afr.c_glusterfs-client-1 must be present but I see 4 xattrs - client-0,2,4 and 6
could you please suggest why these entries are there because I am not able to find out scenario. I am rebooting the one board multiple times to reproduce the issue and after every reboot doing the remove-brick and add-brick on the same volume for the second board.
3. On the rebooted node, do you have ssl enabled by any chance? There is a bug for "Not able to fetch volfile' when ssl is enabled: https://bugzilla.redhat.com/show_bug.cgi?id=1258931
Btw, you for data and metadata split-brains you can use the gluster CLI https://github.com/gluster/glusterfs-specs/blob/master/done/Features/heal-info-and-split-brain-resolution.md instead of modifying the file from the back end.
But you are saying it is not split brain problem and even the split-brain command is not showing any file so how can I find the bigger file in size. Also in my case the file size is fix 2MB it is overwritten every time.
-Ravi
AbhishekBut my question is why the split-brain command not showing any file in output.and this works fine for me.So, what I have done I manually deleted the gfid entry of that file from .glusterfs directory and follow the instruction mentioned in the following link to do heal
https://github.com/gluster/glusterfs/blob/master/doc/debugging/split-brain.md
Here I am attaching all the log which I get from the node for you and also the output of commands from both of the boards
In this tar file two directories are present
000300 - log for the board which is running continuously
002500- log for the board which is rebooted
I am waiting for your reply please help me out on this issue.
Thanks in advanced.
Regards,
On Fri, Feb 26, 2016 at 1:21 PM, ABHISHEK PALIWAL <abhishpaliwal@xxxxxxxxx> wrote:
On Fri, Feb 26, 2016 at 10:28 AM, Ravishankar N <ravishankar@xxxxxxxxxx> wrote:
On 02/26/2016 10:10 AM, ABHISHEK PALIWAL wrote:
Yes correct
Okay, so when you say the files are not in sync until some time, are you getting stale data when accessing from the mount?
I'm not able to figure out why heal info shows zero when the files are not in sync, despite all IO happening from the mounts. Could you provide the output of getfattr -d -m . -e hex /brick/file-name from both bricks when you hit this issue?
I'll provide the logs once I get. here delay means we are powering on the second board after the 10 minutes.
On Feb 26, 2016 9:57 AM, "Ravishankar N" <ravishankar@xxxxxxxxxx> wrote:
Hello,
On 02/26/2016 08:29 AM, ABHISHEK PALIWAL wrote:
We are using Glugsterfs-3.7.8Hi Ravi,Thanks for the response.
Here is the use case:
We have a logging file which saves logs of the events for every board of a node and these files are in sync using glusterfs. System in replica 2 mode it means When one brick in a replicated volume goes offline, the glusterd daemons on the other nodes keep track of all the files that are not replicated to the offline brick. When the offline brick becomes available again, the cluster initiates a healing process, replicating the updated files to that brick. But in our casse, we see that log file of one board is not in the sync and its format is corrupted means files are not in sync.
Just to understand you correctly, you have mounted the 2 node replica-2 volume on both these nodes and writing to a logging file from the mounts right?
AbhishekRegards,But when we tried to put delay > 5 min before the healing everything is working fine.Solution:Even the outcome of #gluster volume heal c_glusterfs info shows that there is no pending heals.
Also , The logging file which is updated is of fixed size and the new entries will be wrapped ,overwriting the old entries.
This way we have seen that after few restarts , the contents of the same file on two bricks are different , but the volume heal info shows zero entries
On Fri, Feb 26, 2016 at 6:35 AM, Ravishankar N <ravishankar@xxxxxxxxxx> wrote:
On 02/25/2016 06:01 PM, ABHISHEK PALIWAL wrote:
In current two node setup when we rebooted one node then the self-healing process starts less than 5min interval on the board which resulting the corruption of the some files data.Hi,Here, I have one query regarding the time taken by the healing process.
Heal should start immediately after the brick process comes up. What version of gluster are you using? What do you mean by corruption of data? Also, how did you observe that the heal started after 5 minutes?
-Ravi
And to resolve it I have search on google and found the following link:
https://support.rackspace.com/how-to/glusterfs-troubleshooting/
Mentioning that the healing process can takes upto 10min of time to start this process.
Here is the statement from the link:
"Healing replicated volumes
When any brick in a replicated volume goes offline, the glusterd daemons on the remaining nodes keep track of all the files that are not replicated to the offline brick. When the offline brick becomes available again, the cluster initiates a healing process, replicating the updated files to that brick. The start of this process can take up to 10 minutes, based on observation."
After giving the time of more than 5 min file corruption problem has been resolved.
So, Here my question is there any way through which we can reduce the time taken by the healing process to start?
Regards,
Abhishek Paliwal
_______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel
--
Regards
Abhishek Paliwal
--
Regards
Abhishek Paliwal
--
Regards
Abhishek Paliwal
--
Regards
Abhishek Paliwal
_______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users