[List-hacking] [bug #25207] an rm of a file should not cause that file to be replicated with afr self-heal.

john at brightbox.co.uk (John Leach) · Mon, 05 Jan 2009 16:15:28 +0000

On Mon, 2009-01-05 at 02:30 -0800, Anand Babu Periasamy wrote:
> Christopher, main issue with self-heal is its complexity. Handling self-healing
> logic in a non-blocking asynchronous code path is difficult. Replicating a missing
> sounds simple, but holding off a lookup call and initiating a new series of calls
> to heal the file and then resuming back normal operation is tricky. Much of the
> bugs we faced in 1.3 is related to self-heal. We have handled most of these cases
> over a period of time. Self-healing is decent now, but not good enough. We feel that
> it has only complicated the code base. It is hard to test and maintain this part of
> the code base.
> 
> Plan is to drop self-heal code all together once the active healing tool gets ready.
> Unlike self-healing, this active healing can be run by the user on a mounted file system
> (online) any time. By moving the code out of the file system, into a tool (that is
> synchronous and linear), we can implement sophisticated healing techniques.

Hi Anand,

the active healing tool sounds good - I'm hoping the more sophisticated
healing techniques might include rsync style sync :)

the dropping of self-heal looks to be worrying a few people - maybe you
can elaborate a little (I'm assuming it's not as bad as it sounds).

For example, with aft/replicate but without self-healing, what will the
behaviour of the cluster be when a brick is stopped, a file updated and
then the brick restarted?  Will Gluster will serve the most "recent"
file available (from the other bricks) until the active healing tool is
run to update the first brick? (then allowing full read balancing)

Thanks,

John.
-- 
Serious Rails Hosting: http://www.brightbox.co.uk