Re: Automated split-brain resolution

Pranith Kumar Karampuri <pkarampu@xxxxxxxxxx> · Mon, 11 Aug 2014 11:20:15 +0530

On 08/09/2014 12:18 AM, Harshavardhana wrote:

While we could extend the existing heal command, we also need to provide a
policy flag. Entering "y/n" for 1000 files does not make the process any
easier.

What i meant was not a solution just to give you suggestions, of
course there should be improvements on that too. Look at e2fsck output
when fixing corruption issues for example.

I don't follow this part completely. If `info split-brain` gives you the
gfid instead of file path, you could just go to the .glusterfs/<gfid
hardlink> and do a setfattr there.

It isn't about just setfattr, one needs to validate which file it
points to make any sense. Are you saying that do you know the contents
of the file just by looking at a canonical gfid form?

command for each entry in the file. Also makes it easy to integrate with a
GUI: Click 'get files in sb' and you have a scroll-down list of files with
polices against each file. Select a file, tick the policy and click
'resolve-sb' and done!

I agree to policy style, but the inherent problem is never fixed you
are still asking some one to write scripts using "info split-brain".

Here is the breakdown how it happens today

- grep /var/log/glusterfs/glustershd.log | awk (get gfids)
- Run the script to see which files are really in split brain
"(gfid-to-file.sh)" - Thanks Joe Julian!
   Do this on all servers and grab output

   Now this on a large enough cluster example 250TB volume with
60million files takes 4hrs, assuming that
   we didn't have more split brain in between
- Next 'gather getfattr/setfattr' output
- Figure out which to be deleted - then delete.

This whole cycle is a 2~3day activity on bigger clusters.

With your approach after having a policy

- grep /var/log/glusterfs/glustershd.log | awk (get gfids)
- Run the script to see which files are really in split brain
"(gfid-to-file.sh)" - Thanks Joe Julian!
   Do this on all servers and grab output

   Now this on a large enough cluster example 250TB volume with
60million files takes 4hrs, assuming that
   we didn't have more split brain in between.
- Figure out which to be deleted provide a policy based on
source-brick or bigger-file.  (In-fact this seems like just a
replacement for `rm -rf`)

Now what is ideal

- Figure out which file be deleted based on a policy (name your policy)

A 250TB cluster is a simply POC cluster in case of GlusterFS not
production, so you could think of scales of magnitude higher when
there is a problem.

Questions that occur here is:

- Why does one write a script at all? when we are ought to be
responsible for this information and even providing valid suggestions.
This is a standard problem where there are split-brains in distributed 
systems. For example even in git there are cases where it gives up 
asking users to fix the file i.e. merge conflicts. If the user doesn't 
want split-brains they should move to replica-3 and enable 
client-quorum. But if the user made a conscious decision to live with 
split-brain problems favouring availability/using replica-2, then 
split-brains do happen and it needs user intervention. All we are trying 
to do is to make this process a bit painless by coming up with 
meaningful policies.

If the user knows his workload is append only and there are split-brains 
the only command he needs to execute is:
'gluster volume heal <volname> split-brain bigger-file'
no grep, no finding file paths, nothing.

Every time I had to interact with users about fixing split-brains, I 
found that they need to know internals of afr to fix the split-brain 
themselves which is a tedious process and there is still possibility of 
users making mistakes while clearing the xattrs (It happened once or 
twice :-( ). That is the reason for implementing the second version of 
command to choose the file from the brick 'gluster volume heal <VOLNAME> 
split-brain source-brick <brick_name> <file>'

There were also instances where the user knows the brick he/she would 
like to be the source but he/she is worried that old brick which comes 
back up would cause split-brains so he/she had to erase the whole brick 
which was down and bring it back up.
Instead we can suggest him/her to use 'gluster volume heal <VOLNAME> 
split-brain source-brick <brick_name>' after bringing the brick back up 
so that not all the contents needs to be healed.

Next steps for this solution is to implement something similar to what 
Joe/Emmanuel suggested where the stat/pending-matrix etc info is 
presented to the user and they need to pick the file or write the 
decisions to a file and then we can resolve the split-brains.

Thanks Harsha/Joe/Emmanuel for providing inputs. Very good inputs :-).

Some initial thoughts on the solution based on Harsha/Joe/Emmanuel's 
inputs are:
1) gluster volume heal <volname> info split-brain should give output in 
some 'format' giving stat/pending-matrix etc for all the files in 
split-brain.
  - Unfortunately we still don't have a way to provide with file paths 
without doing 'find' on the bricks.
2) User saves this output to a file and makes modifications to this file 
giving his choices of files he wants as sources.
3) 'gluster volume heal <volname> split-brain input-file <file>' will 
take the inputs and fixes the files.

Question is, is it worthwhile to implement the two commands proposed by 
Ravi to begin with and implement the solution above in subsequent releases?
Because these things are easier to implement and I feel they definitely 
address some of the pain points I have observed dealing with users.

Pranith
- if you are saying that 'info split-brain' to print gfid's what
purpose does it solve anyways?  I would even get rid of that 'info
split-brain'  - why would anyone needs to see which files are in split
brain when all we are printing is 'gfid' ?
- Trust is on us when a user copies their data into GlusterFS and we
are solely responsibly for it. If we cannot make valid decisions about
the files which we are supposed to manage, how do you expect a normal
user to make better decisions than us?

Here is an example we came across - there was suggestion i made to
Pranithk based out of Avati's idea that even a file in metadata split
brain can be made readable which is not the case today. This came out
of the fact that there are some important details which we know wholly
as a system which is not present with the user himself.

Since this has been a perpetuating misery for years, i would like to
see this fixed in a more convincing manner.

Excuse me being blunt about it!

So we now have the command:
# gluster volume heal <VOLNAME> [full | info [split-brain] | split-brain
{bigger-file  |  source-brick <brick_name>} [<file>] ]

The relevant new extension being:
gluster volume heal  <VOLNAME> split-brain {bigger-file  | source-brick
<brick_name>} [<file>]

This looks good.

_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-devel