Hello developers,
I would like to expose some ideas we are working on to
create a new kind of translator that should be able to unify
and simplify to some extent the healing procedures of
complex translators.
Currently, the only translator with complex healing
capabilities that we are aware of is AFR. We are developing
another translator that will also need healing capabilities,
so we thought that it would be interesting to create a new
translator able to handle the common part of the healing
process and hence to simplify and avoid duplicated code in
other translators.
The basic idea of the new translator is to handle healing
tasks nearer the storage translator on the server nodes
instead to control everything from a translator on the
client nodes. Of course the heal translator is not able to
handle healing entirely by itself, it needs a client
translator which will coordinate all tasks. The heal
translator is intended to be used by translators that work
with multiple subvolumes.
I will try to explain how it works without entering into too
much details.
There is an important requisite for all client translators
that use healing: they must have exactly the same list of
subvolumes and in the same order. Currently, I think this is
not a problem.
The heal translator treats each file as an independent
entity, and each one can be in 3 modes:
1. Normal mode
This is the normal mode for a copy or fragment
of a file when it is synchronized and consistent with the
same file on other nodes (for example with other replicas.
It is the client translator who decides if it is
synchronized or not).
2. Healing mode
This is the mode used when a client detects an
inconsistency in the copy or fragment of the file stored
on this node and initiates the healing procedures.
3. Provider mode (I don't like very much this name, though)
This is the mode used by client translators when
an inconsistency is detected in this file, but the copy or
fragment stored in this node is considered good and it
will be used as a source to repair the contents of this
file on other nodes.
Initially, when a file is created, it is set in normal mode.
Client translators that make changes must guarantee that
they send the modification requests in the same order to all
the servers. This should be done using inodelk/entrylk.
When a change is sent to a server, the client must include a
bitmap mask of the clients to which the request is being
sent. Normally this is a bitmap containing all the clients,
however, when a server fails for some reason some bits will
be cleared. The heal translator uses this bitmap to early
detect failures on other nodes from the point of view of
each client. When this condition is detected, the request is
aborted with an error and the client is notified with the
remaining list of valid nodes. If the client considers the
request can be successfully server with the remaining list
of nodes, it can resend the request with the updated bitmap.
The heal translator also updates two file attributes for
each change request to mantain the "version" of the data and
metadata contents of the file. A similar task is currently
made by AFR using xattrop. This would not be needed anymore,
speeding write requests.
The version of data and metadata is returned to the client
for each read request, allowing it to detect inconsistent
data.
When a client detects an inconsistency, it initiates
healing. First of all, it must lock the entry and inode
(when necessary). Then, from the data collected from each
node, it must decide which nodes have good data and which
ones have bad data and hence need to be healed. There are
two possible cases:
1. File is not a regular file
In this case the reconstruction is very fast and
requires few requests, so it is done while the file is
locked. In this case, the heal translator does nothing
relevant.
2. File is a regular file
For regular files, the first step is to
synchronize the metadata to the bad nodes, including the
version information. Once this is done, the file is set in
healing mode on bad nodes, and provider mode on good
nodes. Then the entry and inode are unlocked.
When a file is in provider mode, it works as in normal mode,
but refuses to start another healing. Only one client can be
healing a file.
When a file is in healing mode, each normal write request
from any client are handled as if the file were in normal
mode, updating the version information and detecting
possible inconsistencies with the bitmap. Additionally, the
healing translator marks the written region of the file as
"good".
Each write request from the healing client intended to
repair the file must be marked with a special flag. In this
case, the area that wants to be written is filtered by the
list of "good" ranges (if there are any intersection with a
good range, it is removed from the request). The resulting
set of ranges are propagated to the lower translator and
added to the list of "good" ranges but the version
information is not updated.
Read requests are only served if the range requested is
entirely contained into the "good" regions list.
There are some additional details, but I think this is
enough to have a general idea of its purpose and how it
works.
The main advantages of this translator are:
1. Avoid duplicated code in client translators
2. Simplify and unify healing methods in client translators
3. xattrop is not needed anymore in client translators to
keep track of changes
4. Full file contents are repaired without locking the file
5. Better detection and prevention of some split brain
situations as soon as possible
I think it would be very useful. It seems to me that it
works correctly in all situations, however I don't have all
the experience that other developers have with the healing
functions of AFR, so I will be happy to answer any question
or suggestion to solve problems it may have or to improve
it.
What do you think about it ?