John Mark Walker asked me to re-post this here as well as on cloudfs-devel. Feedback is most welcome, but please be aware that some discussion has already occurred there. Here's the archive link to see the early discussion. https://fedorahosted.org/pipermail/cloudfs-devel/2011-September/000148.html = HekaFS Improved Replication = == Background and Requirements == One of the most serious internal complaints about GlusterFS is performance for small synchronous requests when using their filesystem-level replication (AFR). This problem particularly afflicts virtual-machine-image and database workloads, reducing performance to about a third of what it "should" be (compared on a per-server basis to NFS on the same hardware). The fundamental problem is that the AFR approach to making writes crash-proof involves the following operations: 1. Lock on the primary (first) server 2. Record operation-pending state (using extended attributes) on all servers 3. Issue write to all servers 4. As writes complete, update operation-pending state on other servers 5. Unlock on primary server Even with some operations in parallel, this requires a minimum of five network round trips to/from the primary server - possibly more as step 4 might be repeated if there are more than two replicas. Even with pending changes to AFR, such as coalescing step 4 updates, AFR's per-request latency is likely to remain terrible. Externally, users seem to focus on a different problem: the timeliness and observability of replica repair after a server has failed and been restored[1][2]. AFR was built on the assumption that on-demand repair of individual files or directories as they're accessed would be sufficient. The message from users ever since has been unequivocal: leaving unknown numbers of unrepaired files vulnerable to a second failure for an indefinite period is unacceptable. These users require immediate repair with explicit notification of return to a fully protected state, but here they run into a second snag: the time required to do a full xattr scan of a multi-terabyte filesystem through a single node is also unacceptable. Patches were submitted almost a year ago[3] to implement precise recovery by maintaining a list of files that are partially written and might therefore require repair, but those have never been adopted. The recently introduced "proactive self heal" functionality is only slightly better. It is triggered automatically and runs inside one of the server daemons - avoiding many machine to machine and user to kernel round trips - but it's still single-threaded and drags all data through one server that might be neither source nor destination. Worse, if a second failure occurs while the lengthy repair process for a previous failure is still ongoing, a new repair cycle will be scheduled but might not even start for days while the previous repair scans millions of perfectly healthy files. The primary requirements, therefore, are: * Improve performance for synchronous small requests * Provide efficient "minimal" replica repair with a positive indication of replica status In addition to these requirements, compatibility with planned enhancements to distribution and wide-area replication would also be highly desirable. == Proposed Solution == The origin of AFR's performance problems is that it requires extra operations (beyond the necessary N writes) in the non-failure case to ensure correct operation in the failure case. The basis of the proposed solution is therefore to be optimistic instead of pessimistic, expending minimal resources in the normal case and taking extra steps only after a failure. The basic write algorithm becomes: 1. Forward the write to all N replicas 2. If all N replicas indicate success, we're done 3. If any replica fails, add information about the failed request (e.g. file, offset, length) to journals on the replicas where it succeeded 4. As part of the startup process, defer completion of startup until brought up to date by replaying peers' journals Because the process relies on a journal, there's no need to maintain a separate list of files in need of repair; journal contents can be examined at any time, and if they're empty (the normal case) that serves as a positive indication that the volume is in a fully protected state. Doing repair as part of the startup process means that, if the failure is a network partition rather than a server failure[4], then neither side will go through the startup process. Each server must therefore initiate repair upon being notified of another server coming up as well as during startup. Journal entries are pushed rather than pulled, from the servers that have them to the newly booted or reconnected server. Each server must also be a client, both to receive peer-status notifications (which currently go only to clients) and to issue journal-related requests. In the case of a network partition, a second problem also arises: split brain. Writes might continue to be received and entered into the journal on both sides of the partition. When journal entries are being propagated in both directions between two servers, establishing the correct combined order for writes that overlap would require additional information (e.g. version vectors) not currently present in the GlusterFS network protocol. This is a problem we will have to solve when we get to wide-area replication, but not right now. To keep things simpler in this release, we can instead enforce quorum as has already been suggested[5] and implemented for AFR. Although the description so far has mostly concentrated on writes, other modifications - e.g. create, symlink, setxattr - mostly work the same way. In the case of namespace operations followed by data operations - e.g. rename followed by write - ordinary care must be taken to ensure that the second operation is applied to the correct object. In the worst case, we might need to store UUIDs in the journal and use a UUID-to-path mapping maintained on each server (which would be useful for other reasons). [1] "Experience with GlusterFS" http://www.devco.net/archives/2010/09/22/experience_with_glusterfs.php [2] "Why GlusterFS is Glusterfsck'd Too" http://chip.typepad.com/weblog/2011/09/why-glusterfs-is-glusterfsckd-too.html [3] http://bugs.gluster.com/show_bug.cgi?id=2088 [4] Yes, partitions do occur even in a network environment. [5] http://bugs.gluster.com/show_bug.cgi?id=3533