Re: Radosgw - bucket index

Guang Yang <yguang11@xxxxxxxxxxx> · Fri, 30 May 2014 08:35:28 +0800

Hi Yehuda,
I opened an issue here: http://tracker.ceph.com/issues/8473, please help to review and comment.

Thanks,
Guang

On May 19, 2014, at 2:47 PM, Yehuda Sadeh <yehuda@xxxxxxxxxxx> wrote:

> On Sun, May 18, 2014 at 11:18 PM, Guang Yang <yguang11@xxxxxxxxxxx> wrote:
>> On May 19, 2014, at 7:05 AM, Sage Weil <sage@xxxxxxxxxxx> wrote:
>> 
>>> On Sun, 18 May 2014, Guang wrote:
>>>>>> radosgw is using the omap key/value API for objects, which is more or less
>>>>>> equivalent to what swift is doing with sqlite.  This data passes straight
>>>>>> into leveldb on the backend (or whatever other backend you are using).
>>>>>> Using something like rocksdb in its place is pretty simple and ther are
>>>>>> unmerged patches to do that; the user would just need to adjust their
>>>>>> crush map so that the rgw index pool is mapped to a different set of OSDs
>>>>>> with the better k/v backend.
>>>> Not sure if I miss anything, but the key difference with SWIFT?s
>>>> implementation is that they are using a table for bucket index and it
>>>> actually can be updated in parallel which makes more scalable for write,
>>>> though at certain point the sql table would result in performance
>>>> degradation as well.
>>> 
>>> As I understand it the same limitation is present there too: the index is
>>> in a single sqlite table.
>>> 
>>>>> My more well-formed opinion is that we need to come up with a good
>>>>> design. It needs to be flexible enough to be able to grow (and maybe
>>>>> shrink), and I assume there would be some kind of background operation
>>>>> that will enable that. I also believe that making it hash based is the
>>>>> way to go. It looks like that the more complicated issue is here is
>>>>> how to handle the transition in which we shard buckets.
>>>> Yeah I agree. I think the conflicting goals here are, we want a sorted
>>>> list (so that it enable prefix scan for listing purpose) and we want to
>>>> shard at the very beginning (the problem we are facing is parallel
>>>> writes updating the same bucket index object will need to be
>>>> serialized).
>>> 
>>> Given how infrequent container listings are, pre-sharding containers
>>> across several objects makes some sense.  Paying the cost of doing
>>> listings in parallel across N (where N is not too big) is not a big price
>>> to pay. However, there will always need to be a way to re-shard further
>>> when containers/buckets get extremely big.  Perhaps a starting point would
>>> be support for static sharding where the number of shards is specified at
>>> container/bucket creation time…
>> Considering the scope of the change, I also think this is a good starting point to make the bucket index updating more scalable.
>> Yehuda,
>> How do you think?
> 
> Sharding it will help with scaling it up to a certain point. As Sage
> mentioned we can start with a static setting as a first simpler
> approach, and move into a dynamic approach later on.
> 
> Yehuda
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html