Replies inline On Sun, Apr 29, 2007 at 02:10:04PM +0200, Danson Michael Joseph wrote: > Hi all, > > To come back to the issue of "self-healing" in an AFR situation. > Consider the rather complex situation below where A=AFR, S=stripe and > U=unify: > > /---- Server1---\ > Client3 > S---- Server2 ----S > / > /-------U---< > >---U--------\ > / S---- Server3 ----S > \ > / \---- Server4 ---/ > \ > Client1 ------A--< > >---A------Client2 > \ /---- Server5---\ > / > \ S---- Server6 ----S > / > \-------U---< > >---U-------/ > / S---- Server7 ----S > Client4 \---- Server8 ---/ > > > Client1 and 2: AFR of two separate unions of two separate stripes > Client3 and 4: Union of two separate stripes > > I think this is quite a complex arrangement and could probably account > for 80% of large installation cases. The obvious question here is what > would the method of healing be for a server failure. In the above scenario, you are using the storage cluster with different 'views' from differnt client. this is basically 'not supported' by glusterfs. you are expected to use the same spec file on all clients. the configuration described above would work ONLY if you are extremely careful and you know what you are doing (which is why the possibility is retained) > Some thoughts: > > 1) As mentioned later on in this thread, the flexibility of gluster is > great but it is somewhat rediculouos to imagine that this flexibility > frees one from using good cluster design. For instance, the following > configuration is probably of little use, the clients must have a useful > configuration, possible like the larger one above: > /---Server1---\ > Client1---S---< >---U---Client2 > \---Server2---/ we the developers are trying to push the image of glusterfs as a 'programmable filesystem', where the user is given a bunch of functionalities as translators, and some glue code to mount and daemonize. having a programmable system also implies having responsibility. along with the strength you get from building a server and client independantly, you also get the other side of the coin that if they are built differntly and inconsitantly, you risk the possibility of getting a useless setup. (a rough example would be to complain that programming language C lets you derefer pointers without a validity check. while it gives you tremendous power, it also comes with the risk that if you do a (*NULL) your code segfaults) > 2) If a server is replaced, healing must take place from any or all > clients, otherwise the distributed nature of the system is lost. if each client has a differnt view of the cluster (which somehow you have managed to keep the overall system sane) then yes, each client (atleast one from a view-type) should run consistancy check of its view. else just one client is sufficient. > 3) No client should exist below a striping such as: > Client2 > \ /---- Server1... > U---- Server2... > ...Client1---S---< > U---- Server3.... > \---- Server4.... > Correct me if I'm wrong, but trying to read striped data as the above > drawing shows for client2 would not be very useful to client2. the above configuration *could* exist. if for some strange reason you want client2 to see only certain stripes of a file (rest of the file is seen as 'holes') the above configuration works, ofcourse the assumption is that client2 knows it is seeing just a few stripes of the entire file and conforms its usage of the file with that fact. > 4) A suggestion here is to have each AFR client with a self-heal > filter/translator. ONLY AFR clients should have self-healing for > replication. Other clients such as the union clients can have > self-healing filters but for different filesystem health checks. When a > server fails and is replaced, all AFR clients get stuck in and attempt > to reconstruct the data. Thus in this situation, Clients 1 and 2 will > heal the system. Clients 3 and 4 cannot because they don't have a full > set of data from which to work. Ofcourse, self-heal is not a 'single entity'. each translator _contributes_ a chunk of sanity check (from its level of view) to the overall filesystem check. AFR only checks for proper replication. unify checks for uniform directory structure and file resides on only one child, etc. > > 5) Who is the dominant reconstruction client? A simple possible > solution is to have a "pre-healing" lock for each file to be > reconstructed. For instance, Client1 finds "hello.c" in bad shape > because of the failure. Client1 placed a lock file in the directory > identifying itself with a timestamp. Client2 also notices that > "hello.c" is in bad shape and moves to fix, but notices a lock file with > a timestamp on it, and so will move on to another file/folder. If > Client2 notices that the timestamp has not been updated in 20s or > something reasonable, that means that Client1 has crashed or failed in > some manner and is no longer healing "hello.c". Therefore Client2 will > continue to heal "hello.c". Obviously, during healing, nothing else > should access the file for fear of further corruption. Comments on that > may run far, but so be it. your suggestion is valid. noted. > 6) What it all comes down to is: 1) do not make the system's distributed > nature worthless; let all clients get stuck in as if they were all > trying to make breakfast. If someone is making the eggs, don't make > eggs, go make the toast. If the eggs start burning because the cook > went to the toilet, take over and finish the eggs. Soon enough, with > clever co-operation, the breakfast will be done. its breakfast time for me now :) thanks! avati -- ultimate_answer_t deep_thought (void) { sleep (years2secs (7500000)); return 42; }