Re: A dumb question about Hail

Pete Zaitcev <zaitcev@xxxxxxxxxx> · Fri, 4 Dec 2009 10:10:11 -0700

On Fri, 4 Dec 2009 11:19:42 -0500
Jeff Garzik <jgarzik@xxxxxxxxxx> wrote:
> On Thu, Dec 03, 2009 at 11:24:00PM -0700, Pete Zaitcev wrote:

I'm adding hail-devel to cc, because I'm going to explain where
the scalability screw-up comes from. Greg, feel free to drop off
by a reply.

> > I'm going to showcase a rather limited version (<= 10 nodes and
> > up to 1 million keys) by January. This is on par with what Eucaliptus
> > has _and_ has the data redundancy. So, someone could use it as
> > a replacement for a WebDAV server, I guess. So far, the best
> > market seems to be people who want to test their S3 applications
> > without setting up actual S3 accounts. That's about all it could do.

> Well, I think that is a degraded vision of what will be available.
> 
> tabled can already do high availability w/ failover of the front-end
> and database (ie metadata).  With your data replication patches,
> that gives object data high availability, too.
> 
> Nobody outside of Amazon themselves can claim that...  :)

The main issue is, we don't have a reverse index: there's no way
to know, given a node ID, what keys are affected by the node
going down. Therefore, in order to determine what keys have to
be re-replicated, we have to scan the whole database of keys.
Which is still not too bad if it can fit into RAM, but once
it grows bigger, it's a problem. So, to an extent you can trade
keys for nodes (and the total size, since we consider, say, 1TB
commodity disks per node). You can go with fatter nodes, too,
but that sort of defeats the purpose of cloud.

The naive solution is, let's add secondary index. I foresee a problem
with it: the index is going to be bigger than the database itself.
NIDs are very small, 4 bytes each, and each key has up to 3 of them.
So, a secondary index will push us out of RAM earlier, and I have
no idea what effects updates to it are going to have. It's something
to try once someone has a big enough deployment (say, 50 chunk nodes
and 200,000 keys+, or other ratio)

My plan to tackle this was to split the OID<->NID database away
from the KEY-->OID database, and use a compact RAM-based database
for OID<->NID. Now you see why OIDs are small and why tabled does
not use keys as keys in Chunk.

-- Pete

P.S. Actually, I may be able to compress keys in RAM with radix encoding,
if applications use filesystem-like key structure. If they use something
like SHA256 for keys, it won't work.
--
To unsubscribe from this list: send the line "unsubscribe hail-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html