chunkd design genesis, storage tech, and support for multiple key/value tables

Jeff Garzik <jeff@xxxxxxxxxx> · Tue, 10 Nov 2009 15:46:27 -0500

You wrote this insightful and pointed comment on IRC...
Comparing with "every k/v service out there" assumes that you're
growing a generic key/value service out of Chunk. You're essentially
admitting it openly.

This is an excellent point to raise.  So let the "begin at the 
beginning", cover the chunkd design thought process, and hope to explain 
how this matches up.

Let us consider storage technology, at the level I'm used to:  ATA, 
SCSI, and nbd protocols.

For decades, storage has been a run of fixed-length records (sectors and 
blocks), with the following API:

	key = offset + data length
		<-- "key" is minimum amount of data required to
		    uniquely describe a run of data
	PUT key, data
	data = GET key

Now the world has figured out giving a storage device the flexibility to 
manage data on a per-object granular basis simplifies applications, and 
gives underlying storage more ability to optimize.  Thus was born the 
object-based storage device (SCSI OSD), with the API

	key = 64-bit object id
	PUT key, data, data length
	data, data length = GET key

A key design decision of Project Hail was to follow this object-based 
storage model, when considering the two alternatives:

1) Build cloud apps on top of multple block devices.  My conclusion: 
this is undesirable for the same reason why sector-based storage is 
undesirable:  applications want more granularity, and with sector-based 
systems, must build their own filesystem-like data structures just to 
keep their own objects separated from one another.

2) Build cloud apps on top of filesystems.  I think(?) GlusterFS is 
taking this route.  This approach is workable, but may create a lot of 
unnecessary overhead.  Filesystem protocols are much more complicated 
than storage protocols, in particular.

Object-based storage devices sit in the middle:  not as complex as 
filesystems, but more useful than sector-based storage.

chunkd is thus designed to be a simple, straightforward, easy-to-use 
replacement for SCSI OSD, which has already been proven useful in 
distributed storage (Lustre, pNFS).

That is why chunkd originally used fixed-length hexidecimal keys:  It 
was modelled on the SCSI OSD object id.  However, it quickly became 
evident in practice that EVERY chunkd application would create its own 
scheme to map internal_object_id to chunkd_object_id.

Thus, moving to generic key/value storage actually simplified 
applications, by eliminating that mapping.

However, one glaring difference from SCSI OSD was chunkd's lack of 
administrative partitions.  SCSI OSDs provide "partitions" within each 
logical unit (LUN), each of contains a set of objects within a single 
object id namespace.  Therefore, if you consider SCSI OSD object id as 
the key, then SCSI OSD definitely has multiple key/value tables.

As you pointed out on IRC, it is possible to create administrative 
partitioning by running multiple chunkd instances.

But I think the Real World(tm) has shown that in-protocol partitioning 
of object namespace is the way to go.  Being able to create and destroy 
partitions within the protocol, on-demand, has a lot of value.

So, just as SCSI OSD has

	[ target + logical unit + ] partition + object

With chunkd we can have

	[ host + port + ] table + object

Amazon S3 has buckets.  Pretty much every protocol in production tends 
to have some sort of administrative separation ability.

	Jeff

--
To unsubscribe from this list: send the line "unsubscribe hail-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html