Re: Automated split-brain resolution

Ravishankar N <ravishankar@xxxxxxxxxx> · Fri, 08 Aug 2014 15:32:45 +0530

On 08/08/2014 01:09 PM, Harshavardhana wrote:
On Thu, Aug 7, 2014 at 1:35 AM, Ravishankar N <ravishankar@xxxxxxxxxx> wrote:
Manual resolution of split-brains [1] has been a tedious task involving
understanding and modifying AFR's changelog extended attributes. To simplify
and to an extent automate this task, we are proposing a new CLI command with
which the user can  specify  what the source brick/file is, and
automatically heal the files in the appropriate direction.

Command: gluster volume resolve-split-brain <VOLNAME> {<bigger_file>  |
source-brick <brick_name> [<file>] }

Breaking up the command into its possible options, we have:

a) gluster volume resolve-split-brain <VOLNAME> <bigger_file>
When this command is executed, AFR will consider the brick having the
highest file size as the source and heal it to all other bricks (including
all other sources and sinks) in that replica subvolume. If the file size is
same in all the bricks, it does *not* heal the file.

b) gluster volume resolve-split-brain <VOLNAME > source-brick <brick_name >
[<file>]

When this command is executed, if <file> is specified, AFR heals the file
from the source-brick <brick_name> to all other bricks of that replica
subvolume. For resolving multiple files, the command must be run
iteratively, once per file.
If <file> is not specified, AFR heals all the files that have an entry in
.glusterfs/indices/xattrop *and* are in split-brain. As before, heals happen
from source-brick <brick_name> to all other bricks.

Future work could also include extending the command to add other policies
like choosing the file having the latest mtime as the source, integration
with trash xlator wherein the files deleted from the sink are moved to the
trash dir etc.

I have a few queries regarding the overall design itself.

Here are the caveats

    - Adding a new option rather than extending an existing option
'gluster volume heal'.

This does make sense.

    - Asking user to input the filename which is not necessary as
default since such files are already
      available through the 'gluster volume heal <volname> info split-brain'

As of today, `info split-brain` is not 100% accurate. It does not list 
entries that are in gfid split-brain (but we are not attempting to heal 
that now anyway using a gluster CLI), and for the files that are in 
(meta)data split-brain, it lists only the last 1024 entries and 
sometimes contains stale entries. But this will be fixed soon with a 
gfapi based implementation, much like  `heal info` command (glfs-heal.c) 
in the 3.5 release.

What would be ideal is the following making it seamless and much more
user friendly

Extend the existing CLI as following

  - 'gluster volume heal <volname> split-brain'

Agreed.

Healing split-brained files is more palpable and has a rather more
convincing tone for a sys-admin IMHO.

An example version of this extension would be.

'gluster volume heal <volname> split-brain [<file>|<gfid as canonical form>]

In-fact since we already know the list of split-brained files we can
just loop through them and ask interactive questions

# gluster volume heal <volname> split-brain
WARNING: About to start fixing split brained files on an active
GlusterFS volume, do you wish to proceed? y

WARNING: files removed would be actively backed up in '.trash' under
your brick path for future recovery.
...
WARNING: Found 1000 files in split brain
...
File on pair 'host1:host2' is in split brain, file with latest
time-stamp found on host1 - Fix? y
File on pair 'host3:host5' is in split brain. file with biggest size
found on host5 - Fix? y
....
....
....
....
************ Fixed (1000 split brain files) ************

While we could extend the existing heal command, we also need to provide 
a policy flag. Entering "y/n" for 1000 files does not make the process 
any easier.

# gluster volume heal <volname> split-brain
INFO: no split brains present on this <volume>

The real pain point of fixing the split brain is not taking getfattr
outputs and figuring out what is the file under conflict, the real
pain point is doing the gfid to the actual file translation when there
are millions of files. Gathering this list takes more time than
actually fixing the split brain and i have personally spent countless
hrs doing these.
I don't follow this part completely. If `info split-brain` gives you the 
gfid instead of file path, you could just go to the .glusterfs/<gfid 
hardlink> and do a setfattr there.

Now this list is easily available to GlusterFS and also its gfid to
path translation - why isn't it simple enough for us to ask the user
what we think is the right choice - we do certainly know which is the
bigger file too.

My general contention is when we know what is the right thing to do
under certain conditions we should be making it easier for example:
Directory metadata split brains - we just fix it automatically today
but certainly wasn't the case in the past. We learnt to do the right
thing when its necessary from experience.

Sure, we have info on which the bigger file is or the one with the 
latest ctime but the bigger file need not always be the source (a 
truncated file could be the pristine copy). So the choice has to be 
given to the user via a policy.  To make automation easier, it makes 
more sense to apply the policy to all files as a whole, or run the 
command once per file, with a policy of choice. Running the command only 
once and asking the policy (per file) in the itermediate execution is 
not amenable to automation. The user can redirect `info split-brain` to 
a file and then script something to run the command for each entry in 
the file. Also makes it easy to integrate with a GUI: Click 'get files 
in sb' and you have a scroll-down list of files with polices against 
each file. Select a file, tick the policy and click 'resolve-sb' and done!

A greater UI experience make it really 'automated' as you intend to
do, to make larger decisions ourselves and users are left with simple
choices to be made so that its not confusing.

So we now have the command:
# gluster volume heal <VOLNAME> [full | info [split-brain] | split-brain 
{bigger-file  |  source-brick <brick_name>} [<file>] ]

The relevant new extension being:
gluster volume heal  <VOLNAME> split-brain {bigger-file  | source-brick 
<brick_name>} [<file>]

Does this look good? Thanks for your feedback :)

-Ravi
_______________________________________________
Gluster-devel mailing list
Gluster-devel@xxxxxxxxxxx
http://supercolony.gluster.org/mailman/listinfo/gluster-devel