You wrote this insightful and pointed comment on IRC...
Comparing with "every k/v service out there" assumes that you're
growing a generic key/value service out of Chunk. You're essentially
admitting it openly.
This is an excellent point to raise. So let the "begin at the
beginning", cover the chunkd design thought process, and hope to explain
how this matches up.
Let us consider storage technology, at the level I'm used to: ATA,
SCSI, and nbd protocols.
For decades, storage has been a run of fixed-length records (sectors and
blocks), with the following API:
key = offset + data length
<-- "key" is minimum amount of data required to
uniquely describe a run of data
PUT key, data
data = GET key
Now the world has figured out giving a storage device the flexibility to
manage data on a per-object granular basis simplifies applications, and
gives underlying storage more ability to optimize. Thus was born the
object-based storage device (SCSI OSD), with the API
key = 64-bit object id
PUT key, data, data length
data, data length = GET key
A key design decision of Project Hail was to follow this object-based
storage model, when considering the two alternatives:
1) Build cloud apps on top of multple block devices. My conclusion:
this is undesirable for the same reason why sector-based storage is
undesirable: applications want more granularity, and with sector-based
systems, must build their own filesystem-like data structures just to
keep their own objects separated from one another.
2) Build cloud apps on top of filesystems. I think(?) GlusterFS is
taking this route. This approach is workable, but may create a lot of
unnecessary overhead. Filesystem protocols are much more complicated
than storage protocols, in particular.
Object-based storage devices sit in the middle: not as complex as
filesystems, but more useful than sector-based storage.
chunkd is thus designed to be a simple, straightforward, easy-to-use
replacement for SCSI OSD, which has already been proven useful in
distributed storage (Lustre, pNFS).
That is why chunkd originally used fixed-length hexidecimal keys: It
was modelled on the SCSI OSD object id. However, it quickly became
evident in practice that EVERY chunkd application would create its own
scheme to map internal_object_id to chunkd_object_id.
Thus, moving to generic key/value storage actually simplified
applications, by eliminating that mapping.
However, one glaring difference from SCSI OSD was chunkd's lack of
administrative partitions. SCSI OSDs provide "partitions" within each
logical unit (LUN), each of contains a set of objects within a single
object id namespace. Therefore, if you consider SCSI OSD object id as
the key, then SCSI OSD definitely has multiple key/value tables.
As you pointed out on IRC, it is possible to create administrative
partitioning by running multiple chunkd instances.
But I think the Real World(tm) has shown that in-protocol partitioning
of object namespace is the way to go. Being able to create and destroy
partitions within the protocol, on-demand, has a lot of value.
So, just as SCSI OSD has
[ target + logical unit + ] partition + object
With chunkd we can have
[ host + port + ] table + object
Amazon S3 has buckets. Pretty much every protocol in production tends
to have some sort of administrative separation ability.
Jeff
--
To unsubscribe from this list: send the line "unsubscribe hail-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html