Re: Metadata replication in tabled

Jeff Darcy <jdarcy@xxxxxxxxxx> · Mon, 28 Jun 2010 08:37:51 -0400

On 06/24/2010 08:31 PM, Pete Zaitcev wrote:
> I worked on fixing the metadata replication in tabled. There were some
> difficulties in existing code, in particular the aliasing between the
> hostname used to identify nodes and the hostname used in bind() for
> listening was impossible to work around in repmgr. In the end I gave
> up on repmgr and switched tabled to the "Base" API. So, the replication
> works now... for some values of "works", which is still a progress.
> 
> We essentially have a tabled that can really be considered as replicated.
> Before, it was only data replication, which was great and all but
> useless against disk failues in the tabled's database. I think it's
> a major treshold for tabled.
> 
> Unfortunately, the code is rather ugly. I tried to create a kind
> of an optional replication layer, so that tdbadm could be built
> without it. Although I succeeded, the result is a hideous mess of
> methods and callbacks, functions with side effects, and a bunch
> of poorly laid out state machines. In places I cannot wrap my own
> head around what's going on without a help of pencil and paper.
> 
> So, while working, it's not ready for going in. Still, I'm going
> to throw it here in case I get hit by a bus, or if anyone wants
> an example of using db4 replication early.

Interesting stuff.  I have two questions/issues.

First, it seems like trying to do stuff "under" BDB replication, letting
them control the flow, is proving to be rather painful - over a thousand
lines in metarep.c plus other bits elsewhere, all constrained by their
expectations wrt blocking, dropping messages, etc.  Might it not be
simpler to handle the replication *above* BDB instead, shipping the
operations ourselves to single-node BDB instances?  Simpler still might
be to let a general framework like Gizzard handle the N-way replication
and failover, or switch to a data store that's designed from the ground
up around that need.  It's not like we make very advanced use of BDB's
transaction or query functionality, which would tie us to their
data/operational model.  Anyway, that's probably a whole different
discussion.

Second, I have a few concerns about the specific implementation and use
of cldu_get_master in the patch.

(a) The minor problem is that if the second (inner) check for
"nrp->length < 3" fails, then we return directly - leaking *nrp.
Perhaps we should jump to the ncld_read_free at the end instead.

(b) I'd also question whether checking nrp->length this way is necessary
at all, since cldu_parse_master should fail in those cases anyway.  Why
not just rearrange the loop to catch such errors that way?

(c) Lastly, regarding the comment about the gap between lock and write,
I think single retry of only the read doesn't buy us much.  Instead,
multiple retry of the entire trylock/read sequence would seem like the
way to go.  It's not too hard to imagine a node taking the lock and then
taking more than two seconds to do the write, e.g. due to a transient
network glitch, and in that case the other node that only waited two
seconds would abort.  At the other end of the scale, it's also not hard
to imagine a node managing to take the lock and then itself aborting
before the write, again causing other nodes to fail.  What should happen
in this second case, I'd argue, is that CLD should eventually detect the
failure and break the lock, which would allow another waiting node to
take it.  While I hesitate to suggest making any such loop truly
infinite, if its total duration is greater than the CLD session-failure
time then that should assure correct behavior in the cases of interest.
 It's also still possible that two or more nodes could, in succession,
take the lock and then fail before writing, so making the total duration
N times the session timeout might make even more sense.
--
To unsubscribe from this list: send the line "unsubscribe hail-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html