On Mon, 28 Jun 2010 08:37:51 -0400 Jeff Darcy <jdarcy@xxxxxxxxxx> wrote: > First, it seems like trying to do stuff "under" BDB replication, letting > them control the flow, is proving to be rather painful - over a thousand > lines in metarep.c plus other bits elsewhere, all constrained by their > expectations wrt blocking, dropping messages, etc. Might it not be > simpler to handle the replication *above* BDB instead, shipping the > operations ourselves to single-node BDB instances? Simpler still might > be to let a general framework like Gizzard handle the N-way replication > and failover, or switch to a data store that's designed from the ground > up around that need. [] I thought of it a little and decided to try the base API approach first, for a couple of reasons. First, I am ignorant of things like Gizzard, so when I started imagining how the update forwarding and leases would actually work, it started looking way longer than a 1000 lines of C. Second, I am afraid that people will point and ask "why didn't you use rep_start()". We already reap NIH critique with Zookeeper. Now if I tried, found bugs in db4/BDB, and documented that, it would be different and my conscience would be clear. Getting all of the replication exposed in tabled is really tempting. For one thing, if we do it, we can replace db4 with TC or anything else. But it's just... too much. I don't have the balls to tackle it now. Honestly I expected to finish it all in 1 week, but actually took 3+. The roll-my-own replication would take me forever (how about 6 months?). Do you want tabled working for you or always in progress? > (a) The minor problem is that if the second (inner) check for > "nrp->length < 3" fails, then we return directly - leaking *nrp. > Perhaps we should jump to the ncld_read_free at the end instead. Awww, that was silly. Thanks. > (b) I'd also question whether checking nrp->length this way is necessary > at all, since cldu_parse_master should fail in those cases anyway. Why > not just rearrange the loop to catch such errors that way? The idea was to special-case the "empty" so I can see a printout. A syntax error is different - maybe a version mismatch. I even wanted the would-be masters try and truncate the MASTER file before trying to lock it. > (c) Lastly, regarding the comment about the gap between lock and write, > I think single retry of only the read doesn't buy us much. [...] > At the other end of the scale, it's also not hard > to imagine a node managing to take the lock and then itself aborting > before the write, again causing other nodes to fail. What should happen > in this second case, I'd argue, is that CLD should eventually detect the > failure and break the lock, which would allow another waiting node to > take it. Well, yeah... I guess I was too lazy and reluctant to create yet another state machine for this. Maybe I should just bite the bullet and make tabled fully multi-threaded. It was likely to come next anyway since you complained about the abysmal performance (I do not know yet what the issues with performance are, but threads are likely to participate). But if so, a thread may just easily loop, as ncld API intends. -- Pete -- To unsubscribe from this list: send the line "unsubscribe hail-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html