Re: Selfheal is not working? Once more

Martin Fick <mogulguy@xxxxxxxxx> · Wed, 30 Jul 2008 11:52:05 -0700 (PDT)

--- On Wed, 7/30/08, Łukasz Osipiuk <lukasz@xxxxxxxxxxx> wrote:

>> Step1: Client1:  cp test_file.txt /mnt/gluster/
>> Step2: Brick1 and Brick4: has test_file.txt in
>> /mnt/gluster/ directory
>> Sept3: Client1: ls /mnt/gluster - test_file.txt is
present
>>
>> Step4: Brick1: rm /mnt/gluster/test_file.txt
>> Step5. Client1: cat /mnt/gluster/test_file.txt
-> we will get contents of file from brick4
>>
>> Step6. Brick1 ls /home/export is empty. Selfheal
>> not recovered file.
> >
> > I suspect that this is normal, you are not suppose to
> > modify the bricks manually from underneath AFR.  AFR uses
> > extended attributes to keep file version metadata.  When you
> > manually deleted the file in step4 the directory version
> > metadata should not have been updated so I suspect that caused the
> > mismatch to go undetected.  The self heal would have occurred
> > if the brick node were down and the file was deleted by
> > client and> then the brick node returned to operation.
> >
> > -Martin
> 
> ------
> 
> Martin. It is obvious that one normally should not modify
> AFR backend directly. The experiment Tomáš (and me also) 
> made, was a simulation of reallife problem when you
> loose some data on one of data bricks.

I understand, I am not sure that AFR is equipped to handle
all of these types of failures, some of them, yes, but not 
all.  Mostly, the versionning mechanisms are aimed at 
healing from network/node outages, but not from disk 
corruptions.  If you want this, you will probably have to
put raid under your local filesystems.  Although, someone
did mention in a post a while ago an alpha-stage translator
that will do checksumming on a local device.

> The more extreme example is: on of data bricks explodes and
> You replace it with new one, configured as one which gone off 
> but with empty HD. This is the same as above
> experiment but all data is gone, not just one file.

AFR should actually handle this case fine.  When you install 
a new brick and it is empty, there will be no metadata for 
any files or directories on it so it will self(lazy) heal.  
The problem that you described above occurs because you have 
metadata saying that your files (directory actually) is 
up to date, but the directory is not since it was modified
manually under the hood.  AFR cannot detect this (yet), it
trusts its metadata.

> Is there a way to make GlusterFS "heal" so the
> new node contains replicated data from its mirror?

In your case, yes, if you either delete the attributes
for the out of date files/directories or set them
to a lower version than its peer, it should heal on
the next find/access.

> I tried the find-head pattern but it doesn't help :(

See above, AFR does not know they are out of date, so 
this won't help.

It does seem like it would be fairly easy to add another 
metadata attribute to each file/directory that would hold
a checksum for it.  This way, AFR itself could be 
configured to check/compute the checksum anytime the file 
is read/written.  Since this would slow AFR down, I would
suggest a configuration option to turn this on.  If the
checksum is wrong, it could heal to the version of the
other brick if the other brick's checksum is correct.

Another alternative would be to create an offline 
checksummer that updates such an attribute if it does not
exist, and checks the checksum if it does exist.  If when
it checks the checksum it fails, it would simply delete the
file and its attributes (and potentially the directory
attributes up the tree) so that AFR will then heal it.

The only modification needed by AFR to support this
would be to delete the checksum attribute anytime the 
file/directory is updated so that the offline checksummer
will recreate it instead of thinking it is corrupt.  
In fact, even this could be eliminated so that the 
offline checksummer is completely "self-powered",
anytime it calculates a checksum it could copy the 
glusterfs version and timestamp attributes to two new
"checksummer" attributes.  If these become out of date the 
cheksummer will know to recompute the checksum instead of
assuming that the file has been corrupted.

The one risk with this is that if a file gets corrupted
on both nodes, it will get deleted on both nodes so you 
will not have a corrupted file to at least look at.  
This too could be overcome by saving any deleted files 
in a separate "trash can" and cleaning the trash can 
once  the files in it have been healed, sort of a self cleaning lost+found directory.

I know this may not be the answers that you were 
looking for, but I hope it helps clarify things 
a little.

-Martin