Re: Radosgw - bucket index

Guang Yang <yguang11@xxxxxxxxxxx> · Mon, 19 May 2014 14:18:06 +0800

On May 19, 2014, at 7:05 AM, Sage Weil <sage@xxxxxxxxxxx> wrote:

> On Sun, 18 May 2014, Guang wrote:
>>>> radosgw is using the omap key/value API for objects, which is more or less
>>>> equivalent to what swift is doing with sqlite.  This data passes straight
>>>> into leveldb on the backend (or whatever other backend you are using).
>>>> Using something like rocksdb in its place is pretty simple and ther are
>>>> unmerged patches to do that; the user would just need to adjust their
>>>> crush map so that the rgw index pool is mapped to a different set of OSDs
>>>> with the better k/v backend.
>> Not sure if I miss anything, but the key difference with SWIFT?s 
>> implementation is that they are using a table for bucket index and it 
>> actually can be updated in parallel which makes more scalable for write, 
>> though at certain point the sql table would result in performance 
>> degradation as well.
> 
> As I understand it the same limitation is present there too: the index is 
> in a single sqlite table.
> 
>>> My more well-formed opinion is that we need to come up with a good
>>> design. It needs to be flexible enough to be able to grow (and maybe
>>> shrink), and I assume there would be some kind of background operation
>>> that will enable that. I also believe that making it hash based is the
>>> way to go. It looks like that the more complicated issue is here is
>>> how to handle the transition in which we shard buckets.
>> Yeah I agree. I think the conflicting goals here are, we want a sorted 
>> list (so that it enable prefix scan for listing purpose) and we want to 
>> shard at the very beginning (the problem we are facing is parallel 
>> writes updating the same bucket index object will need to be 
>> serialized).
> 
> Given how infrequent container listings are, pre-sharding containers 
> across several objects makes some sense.  Paying the cost of doing 
> listings in parallel across N (where N is not too big) is not a big price 
> to pay. However, there will always need to be a way to re-shard further 
> when containers/buckets get extremely big.  Perhaps a starting point would 
> be support for static sharding where the number of shards is specified at 
> container/bucket creation time…
Considering the scope of the change, I also think this is a good starting point to make the bucket index updating more scalable.
Yehuda,
How do you think?
> 
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html