RFC (HekaFS): improved replication

Jeff Darcy <jdarcy@xxxxxxxxxx> · Wed, 14 Sep 2011 11:06:26 -0400

John Mark Walker asked me to re-post this here as well as on cloudfs-devel.
Feedback is most welcome, but please be aware that some discussion has already
occurred there.  Here's the archive link to see the early discussion.

https://fedorahosted.org/pipermail/cloudfs-devel/2011-September/000148.html

= HekaFS Improved Replication =

== Background and Requirements ==

One of the most serious internal complaints about GlusterFS is performance for
small synchronous requests when using their filesystem-level replication (AFR).
This problem particularly afflicts virtual-machine-image and database
workloads, reducing performance to about a third of what it "should" be
(compared on a per-server basis to NFS on the same hardware).  The fundamental
problem is that the AFR approach to making writes crash-proof involves the
following operations:

	1. Lock on the primary (first) server
	2. Record operation-pending state (using extended attributes) on all
	   servers
	3. Issue write to all servers
	4. As writes complete, update operation-pending state on other servers
	5. Unlock on primary server

Even with some operations in parallel, this requires a minimum of five network
round trips to/from the primary server - possibly more as step 4 might be
repeated if there are more than two replicas.  Even with pending changes to
AFR, such as coalescing step 4 updates, AFR's per-request latency is likely to
remain terrible.

Externally, users seem to focus on a different problem: the timeliness and
observability of replica repair after a server has failed and been
restored[1][2].  AFR was built on the assumption that on-demand repair of
individual files or directories as they're accessed would be sufficient.  The
message from users ever since has been unequivocal: leaving unknown numbers of
unrepaired files vulnerable to a second failure for an indefinite period is
unacceptable.  These users require immediate repair with explicit notification
of return to a fully protected state, but here they run into a second snag: the
time required to do a full xattr scan of a multi-terabyte filesystem through a
single node is also unacceptable.  Patches were submitted almost a year ago[3]
to implement precise recovery by maintaining a list of files that are partially
written and might therefore require repair, but those have never been adopted.
The recently introduced "proactive self heal" functionality is only slightly
better.  It is triggered automatically and runs inside one of the server
daemons - avoiding many machine to machine and user to kernel round trips - but
it's still single-threaded and drags all data through one server that might be
neither source nor destination.  Worse, if a second failure occurs while the
lengthy repair process for a previous failure is still ongoing, a new repair
cycle will be scheduled but might not even start for days while the previous
repair scans millions of perfectly healthy files.

The primary requirements, therefore, are:

	* Improve performance for synchronous small requests

	* Provide efficient "minimal" replica repair with a positive indication
	  of replica status

In addition to these requirements, compatibility with planned enhancements to
distribution and wide-area replication would also be highly desirable.

== Proposed Solution ==

The origin of AFR's performance problems is that it requires extra operations
(beyond the necessary N writes) in the non-failure case to ensure correct
operation in the failure case.  The basis of the proposed solution is therefore
to be optimistic instead of pessimistic, expending minimal resources in the
normal case and taking extra steps only after a failure.  The basic write
algorithm becomes:

	1. Forward the write to all N replicas
	2. If all N replicas indicate success, we're done
	3. If any replica fails, add information about the failed request (e.g.
	   file, offset, length) to journals on the replicas where it succeeded
	4. As part of the startup process, defer completion of startup until
	   brought up to date by replaying peers' journals

Because the process relies on a journal, there's no need to maintain a
separate list of files in need of repair; journal contents can be examined at
any time, and if they're empty (the normal case) that serves as a positive
indication that the volume is in a fully protected state.

Doing repair as part of the startup process means that, if the failure is a
network partition rather than a server failure[4], then neither side will go
through the startup process.  Each server must therefore initiate  repair upon
being notified of another server coming up as well as during startup.  Journal
entries are pushed rather than pulled, from the servers that have them to the
newly booted or reconnected server.  Each server must also be a client, both to
receive peer-status notifications (which currently go only to clients) and to
issue journal-related requests.

In the case of a network partition, a second problem also arises: split brain.
Writes might continue to be received and entered into the journal on both sides
of the partition.  When journal entries are being propagated in both directions
between two servers, establishing the correct combined order for writes that
overlap would require additional information (e.g. version vectors) not
currently present in the GlusterFS network protocol.  This is a problem we will
have to solve when we get to wide-area replication, but not right now.  To keep
things simpler in this release, we can instead enforce quorum as has already
been suggested[5] and implemented for AFR.

Although the description so far has mostly concentrated on writes, other
modifications - e.g. create, symlink, setxattr - mostly work the same way.  In
the case of namespace operations followed by data operations - e.g. rename
followed by write - ordinary care must be taken to ensure that the second
operation is applied to the correct object.  In the worst case, we might need
to store UUIDs in the journal and use a UUID-to-path mapping maintained on each
server (which would be useful for other reasons).

[1] "Experience with GlusterFS"
http://www.devco.net/archives/2010/09/22/experience_with_glusterfs.php

[2] "Why GlusterFS is Glusterfsck'd Too"
http://chip.typepad.com/weblog/2011/09/why-glusterfs-is-glusterfsckd-too.html

[3] http://bugs.gluster.com/show_bug.cgi?id=2088

[4] Yes, partitions do occur even in a network environment.

[5] http://bugs.gluster.com/show_bug.cgi?id=3533