Re: Radosgw - bucket index

Guang Yang <yguang11@xxxxxxxxxxx> · Mon, 2 Jun 2014 21:37:30 +0800

Hi Yehuda and Sage,
Can you help to comment on the ticket, I would like to send out a pull request some time this week for you to review, but before that, it would be nice to see your comments in terms of the interface and any other concerns you may have for this. Thanks.

Thanks,
Guang

On May 30, 2014, at 8:35 AM, Guang Yang <yguang11@xxxxxxxxxxx> wrote:

> Hi Yehuda,
> I opened an issue here: http://tracker.ceph.com/issues/8473, please help to review and comment.
> 
> Thanks,
> Guang
> 
> On May 19, 2014, at 2:47 PM, Yehuda Sadeh <yehuda@xxxxxxxxxxx> wrote:
> 
>> On Sun, May 18, 2014 at 11:18 PM, Guang Yang <yguang11@xxxxxxxxxxx> wrote:
>>> On May 19, 2014, at 7:05 AM, Sage Weil <sage@xxxxxxxxxxx> wrote:
>>> 
>>>> On Sun, 18 May 2014, Guang wrote:
>>>>>>> radosgw is using the omap key/value API for objects, which is more or less
>>>>>>> equivalent to what swift is doing with sqlite.  This data passes straight
>>>>>>> into leveldb on the backend (or whatever other backend you are using).
>>>>>>> Using something like rocksdb in its place is pretty simple and ther are
>>>>>>> unmerged patches to do that; the user would just need to adjust their
>>>>>>> crush map so that the rgw index pool is mapped to a different set of OSDs
>>>>>>> with the better k/v backend.
>>>>> Not sure if I miss anything, but the key difference with SWIFT?s
>>>>> implementation is that they are using a table for bucket index and it
>>>>> actually can be updated in parallel which makes more scalable for write,
>>>>> though at certain point the sql table would result in performance
>>>>> degradation as well.
>>>> 
>>>> As I understand it the same limitation is present there too: the index is
>>>> in a single sqlite table.
>>>> 
>>>>>> My more well-formed opinion is that we need to come up with a good
>>>>>> design. It needs to be flexible enough to be able to grow (and maybe
>>>>>> shrink), and I assume there would be some kind of background operation
>>>>>> that will enable that. I also believe that making it hash based is the
>>>>>> way to go. It looks like that the more complicated issue is here is
>>>>>> how to handle the transition in which we shard buckets.
>>>>> Yeah I agree. I think the conflicting goals here are, we want a sorted
>>>>> list (so that it enable prefix scan for listing purpose) and we want to
>>>>> shard at the very beginning (the problem we are facing is parallel
>>>>> writes updating the same bucket index object will need to be
>>>>> serialized).
>>>> 
>>>> Given how infrequent container listings are, pre-sharding containers
>>>> across several objects makes some sense.  Paying the cost of doing
>>>> listings in parallel across N (where N is not too big) is not a big price
>>>> to pay. However, there will always need to be a way to re-shard further
>>>> when containers/buckets get extremely big.  Perhaps a starting point would
>>>> be support for static sharding where the number of shards is specified at
>>>> container/bucket creation time…
>>> Considering the scope of the change, I also think this is a good starting point to make the bucket index updating more scalable.
>>> Yehuda,
>>> How do you think?
>> 
>> Sharding it will help with scaling it up to a certain point. As Sage
>> mentioned we can start with a static setting as a first simpler
>> approach, and move into a dynamic approach later on.
>> 
>> Yehuda
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html