chunkd design genesis, storage tech, and support for multiple key/value tables

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




You wrote this insightful and pointed comment on IRC...
Comparing with "every k/v service out there" assumes that you're
growing a generic key/value service out of Chunk. You're essentially
admitting it openly.


This is an excellent point to raise. So let the "begin at the beginning", cover the chunkd design thought process, and hope to explain how this matches up.


Let us consider storage technology, at the level I'm used to: ATA, SCSI, and nbd protocols.

For decades, storage has been a run of fixed-length records (sectors and blocks), with the following API:

	key = offset + data length
		<-- "key" is minimum amount of data required to
		    uniquely describe a run of data
	PUT key, data
	data = GET key

Now the world has figured out giving a storage device the flexibility to manage data on a per-object granular basis simplifies applications, and gives underlying storage more ability to optimize. Thus was born the object-based storage device (SCSI OSD), with the API

	key = 64-bit object id
	PUT key, data, data length
	data, data length = GET key

A key design decision of Project Hail was to follow this object-based storage model, when considering the two alternatives:

1) Build cloud apps on top of multple block devices. My conclusion: this is undesirable for the same reason why sector-based storage is undesirable: applications want more granularity, and with sector-based systems, must build their own filesystem-like data structures just to keep their own objects separated from one another.

2) Build cloud apps on top of filesystems. I think(?) GlusterFS is taking this route. This approach is workable, but may create a lot of unnecessary overhead. Filesystem protocols are much more complicated than storage protocols, in particular.

Object-based storage devices sit in the middle: not as complex as filesystems, but more useful than sector-based storage.

chunkd is thus designed to be a simple, straightforward, easy-to-use replacement for SCSI OSD, which has already been proven useful in distributed storage (Lustre, pNFS).

That is why chunkd originally used fixed-length hexidecimal keys: It was modelled on the SCSI OSD object id. However, it quickly became evident in practice that EVERY chunkd application would create its own scheme to map internal_object_id to chunkd_object_id.

Thus, moving to generic key/value storage actually simplified applications, by eliminating that mapping.

However, one glaring difference from SCSI OSD was chunkd's lack of administrative partitions. SCSI OSDs provide "partitions" within each logical unit (LUN), each of contains a set of objects within a single object id namespace. Therefore, if you consider SCSI OSD object id as the key, then SCSI OSD definitely has multiple key/value tables.

As you pointed out on IRC, it is possible to create administrative partitioning by running multiple chunkd instances.

But I think the Real World(tm) has shown that in-protocol partitioning of object namespace is the way to go. Being able to create and destroy partitions within the protocol, on-demand, has a lot of value.

So, just as SCSI OSD has

	[ target + logical unit + ] partition + object

With chunkd we can have

	[ host + port + ] table + object

Amazon S3 has buckets. Pretty much every protocol in production tends to have some sort of administrative separation ability.

	Jeff



--
To unsubscribe from this list: send the line "unsubscribe hail-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[Index of Archives]     [Fedora Clound]     [Linux USB Devel]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]

  Powered by Linux