Re: A healing translator

Xavier Hernandez <xhernandez@xxxxxxxxxx> · Tue, 22 May 2012 10:51:22 +0200



    On 05/22/2012 09:48 AM, Anand Avati wrote:
    
      
            I've tried to understand how AFR works and, in some way,
            some of the ideas have been taken from it. However it is
            very complex and a lot of changes have been carried out in
            the master branch over the latest months. It's hard for me
            to follow them while actively working on my translator.
            Nevertheless, the main reason to take a separate path was
            that AFR is strongly bound to replication (at least from
            what I saw when I analyzed it more deeply. Maybe things have
            changed now, but haven't had time to review them).

          
        Have you reviewed the proactive self-heal daemon (+
          changelog indexing translator) which is a potential functional
          replacement for what you might be attempting?
        

        Avati
      
    
    I must admit that I've read something about it but I haven't had
    time to explore it in detail.

    
    If I understand it correctly, the self-heal daemon works as a client
    process but can be executed on server nodes. I suppose that multiple
    self-heal daemons can be running on different nodes. Then, each
    daemon detects invalid files (not sure exactly how) and replicates
    the changes from one good node to the bad nodes.

    
    The problem is that in the translator I'm working on, the
    information is dispersed among multiple nodes, so there isn't a
    single server node that contains the whole data. To repair a node,
    data must be read from at least two other nodes (it depends on
    configuration). From what I've read from AFR and the self-healing
    daemon, it's not straightforward to adapt them to this mechanism
    because they would need to know a subset of nodes with consistent
    data, not only one. Each daemon would have to contact all other
    nodes, read data from each one, determine which ones are valid,
    rebuild the data and send it to the bad nodes. This means that the
    daemon will have to be as complex as the clients.

    
    My impression (but I may be wrong) is that AFR and the self-healing
    daemon are closely bound to the replication schema, so it is very
    hard to try to use them for other purposes. The healing translator
    I'm writing tries to offer generic server side helpers for the
    healing process, but it is the client side who really manages the
    healing operation (though heavily simplified) and could use it to
    replicate data, to disperse data, or some other schema.

    
    Xavi