At 02:30 AM 1/5/2009, Anand Babu Periasamy wrote: >Christopher, main issue with self-heal is its complexity. Handling >self-healing >logic in a non-blocking asynchronous code path is difficult. >Replicating a missing >sounds simple, but holding off a lookup call and initiating a new >series of calls >to heal the file and then resuming back normal operation is tricky. >Much of the >bugs we faced in 1.3 is related to self-heal. We have handled most >of these cases >over a period of time. Self-healing is decent now, but not good >enough. We feel that >it has only complicated the code base. It is hard to test and >maintain this part of >the code base. > >Plan is to drop self-heal code all together once the active healing >tool gets ready. >Unlike self-healing, this active healing can be run by the user on a >mounted file system >(online) any time. By moving the code out of the file system, into a >tool (that is >synchronous and linear), we can implement sophisticated healing techniques. > >Code is not in the repository yet. Hopefully in a month, it will be >ready for use. >You can simply turn off self-heal and run this utility while the >file system is mounted. I realize this is perhaps a bit premature, but am I to understand you'll be doing away with auto self-healing in replicate? this seems to eliminate much of the value of glusters AFR component. if we have to manually heal with some tool, there's always a risk of a data integrity problem while this healing process is being excuted after a server interruption. if it's going to be optional to turn on/off, that's fine, I suppose, but please, if you're considering removing this feature altogether, reconsider. Unless this active healing tol is something that would be run automatically anytime there's a disconnect between AFR servers. While I certainly do realize that the self-heal code is a HUGE performance issue as it's currently written (at least that's what I'm noticing on my servers), it's function is necessary to make the AFR useful.