On Mon, 2009-01-05 at 02:30 -0800, Anand Babu Periasamy wrote: > Christopher, main issue with self-heal is its complexity. Handling self-healing > logic in a non-blocking asynchronous code path is difficult. Replicating a missing > sounds simple, but holding off a lookup call and initiating a new series of calls > to heal the file and then resuming back normal operation is tricky. Much of the > bugs we faced in 1.3 is related to self-heal. We have handled most of these cases > over a period of time. Self-healing is decent now, but not good enough. We feel that > it has only complicated the code base. It is hard to test and maintain this part of > the code base. > > Plan is to drop self-heal code all together once the active healing tool gets ready. > Unlike self-healing, this active healing can be run by the user on a mounted file system > (online) any time. By moving the code out of the file system, into a tool (that is > synchronous and linear), we can implement sophisticated healing techniques. Hi Anand, the active healing tool sounds good - I'm hoping the more sophisticated healing techniques might include rsync style sync :) the dropping of self-heal looks to be worrying a few people - maybe you can elaborate a little (I'm assuming it's not as bad as it sounds). For example, with aft/replicate but without self-healing, what will the behaviour of the cluster be when a brick is stopped, a file updated and then the brick restarted? Will Gluster will serve the most "recent" file available (from the other bricks) until the active healing tool is run to update the first brick? (then allowing full read balancing) Thanks, John. -- Serious Rails Hosting: http://www.brightbox.co.uk