Hi All, There has been lot of mails in the recent times on AFR and HA, I guess the reason was lack of documentation on self-heal which has been pending for long. In case any user thinks that his mail was not responded to or if there was an incomplete response, please follow it up in the mailing-list or IRC. There is a change in AFR's functionality regarding the "option replicate" feature. We realised that design wise it is not good to have this inside AFR and that it is better to have it outside AFR as a separate translator. This will not affect the users who have been using "option replicate *:n" where n is the number of subvols. For people making use of this option the inconvenience caused is regretted. The pattern matching translator will be available in 1.4 release. So the over all functionality is not being compromised. The reasons for taking "option replicate" feature out of AFR: * It does not belong there. You can guess by the fact that the "option replicate" option has to be exactly same across all the AFRs, if it is same then we have to be able to specify it at a place common to all AFRs instead of specifying the same option in each AFR. * "option replicate" was making the working of selfheal more complicated. We had to come up with workarounds to make it work. In the long run workarounds are not good. Here is the document which will be put up in the wiki. Any feedback regarding what should be added or any form of suggestions will be appreciated. ========= AFR provides RAID-1 like functionality. AFR replicates files and directories across the subvolumes. Hence if AFR has four subvolumes, there will be four copies of all files and directories. AFR provides HA, i.e in case one of the subvolumes go down (ex. server crash, network disconnection) AFR will still service the requests from the redundant copies. AFR also provides self-heal functionality, i.e in case the crashed servers comeup, the outdated files and directories will be updated with the latest versions. AFR uses extended attributes of the backend file system to track the versioning of files and directories to provide the self-heal feature. * Note that previously supported "option replicate *html:2,*txt:1" pattern matching feature is moved out of AFR. It will be provided as a separate translator in 1.4 volume afr-example type cluster/afr subvolumes brick1 brick2 brick3 end-volume This sample configuration will replicate all directories and files on brick1, brick2 and brick3. The subvolumes can be another translator (storage/posix or protocol/client) All the read() operations happen from the first alive child. If all the three subvols are up, read() will be done on brick1, if brick1 is down read() will be done on brick2. In case read() was being done on brick1 and it goes down, we fallback to brick2 which will be completely transparent to the user applications. In 1.4 we will have: * a feature where user can specify the subvol from which AFR has to do read() operations (this will help users who have one of the subvols as local storage/posix) * feature to allow scheduling of read() operations amongst the subvols in round-robin fashion. The order of the subvolumes list should be same across all the AFRs as they will be used as lock servers. TODO: details on working of locking. Self-Heal AFR has self-heal feature, which updates the outdated file and directory copies by the most recent versions. For example consider the following config: volume afr-example type cluster/afr subvolumes brick1 brick2 end-volume File self-heal Now if we create a file foo.txt on afr-example, the file will be created on brick1 and brick2. The file will have two extended attributes associated with it in the backend filesystem. One is trusted.afr.createtime and the other is trusted.afr.version. The trusted.afr.createtime xattr has the create time (in terms of seconds since epoch) and trusted.afr.version is a number that is incremented each time a file is modified. This increment happens during close (incase any write was done before close). If brick1 goes down, we edit foo.txt the version gets incremented. Now the brick1 comes back up, when we open() on foo.txt AFR will check if their versions are same. If they are not same, the outdated copy is replaced by the latest copy and its version is updated. After the sync the open() proceeds in the usual manner and the application calling open() can continue on its access to the file. If brick1 goes down, we delete foo.txt and create a file with the same name again i.e foo.txt. Now brick1 comes back up, clearly there is a chance that the version on brick1 being more than the version on brick2, this is where createtime extended attribute helps in deciding which the outdated copy is. Hence we need to consider both createtime and version to decide on the latest copy. The version attribute is incremented during the close() call. Version will not be incremented in case there was no write() done. In case the fd that the close() gets was got by create() call, we also create the createtime extended attribute. Directory self-heal Suppose brick1 goes down, we delete foo.txt, brick1 comes back up, now we should not create foo.txt on brick2 but we should delete foo.txt on brick1. We handle this situation by having the createtime and version attribute on the directory similar to the file. when lookup() is done on the directory, we compare the createtime/version attributes of the copies and see which files needs to be deleted and delete those files and update the extended attributes of the outdated directory copy. Each time a directory is modified (a file or a subdirectory is created or deleted inside the directory) and one of the subvols is down, we increment the directory's version. lookup() is a call initiated by the kernel on a file or directory just before any access to that file or directory. In glusterfs, by default, lookup() will not be called in case it was called in the past one second on that particular file or directory. The extended attributes can be seen in the backend filesystem using the getfattr command. (getfattr -n trusted.afr.version <file>) ========