Re: Radosgw - bucket index

Sage Weil <sage@xxxxxxxxxxx> · Sun, 18 May 2014 16:05:11 -0700 (PDT)

On Sun, 18 May 2014, Guang wrote:
> >> radosgw is using the omap key/value API for objects, which is more or less
> >> equivalent to what swift is doing with sqlite.  This data passes straight
> >> into leveldb on the backend (or whatever other backend you are using).
> >> Using something like rocksdb in its place is pretty simple and ther are
> >> unmerged patches to do that; the user would just need to adjust their
> >> crush map so that the rgw index pool is mapped to a different set of OSDs
> >> with the better k/v backend.
> Not sure if I miss anything, but the key difference with SWIFT?s 
> implementation is that they are using a table for bucket index and it 
> actually can be updated in parallel which makes more scalable for write, 
> though at certain point the sql table would result in performance 
> degradation as well.

As I understand it the same limitation is present there too: the index is 
in a single sqlite table.

> > My more well-formed opinion is that we need to come up with a good
> > design. It needs to be flexible enough to be able to grow (and maybe
> > shrink), and I assume there would be some kind of background operation
> > that will enable that. I also believe that making it hash based is the
> > way to go. It looks like that the more complicated issue is here is
> > how to handle the transition in which we shard buckets.
> Yeah I agree. I think the conflicting goals here are, we want a sorted 
> list (so that it enable prefix scan for listing purpose) and we want to 
> shard at the very beginning (the problem we are facing is parallel 
> writes updating the same bucket index object will need to be 
> serialized).

Given how infrequent container listings are, pre-sharding containers 
across several objects makes some sense.  Paying the cost of doing 
listings in parallel across N (where N is not too big) is not a big price 
to pay. However, there will always need to be a way to re-shard further 
when containers/buckets get extremely big.  Perhaps a starting point would 
be support for static sharding where the number of shards is specified at 
container/bucket creation time...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html