Re: Radosgw - bucket index

Yehuda Sadeh <yehuda@xxxxxxxxxxx> · Sun, 18 May 2014 23:47:17 -0700

On Sun, May 18, 2014 at 11:18 PM, Guang Yang <yguang11@xxxxxxxxxxx> wrote:
> On May 19, 2014, at 7:05 AM, Sage Weil <sage@xxxxxxxxxxx> wrote:
>
>> On Sun, 18 May 2014, Guang wrote:
>>>>> radosgw is using the omap key/value API for objects, which is more or less
>>>>> equivalent to what swift is doing with sqlite.  This data passes straight
>>>>> into leveldb on the backend (or whatever other backend you are using).
>>>>> Using something like rocksdb in its place is pretty simple and ther are
>>>>> unmerged patches to do that; the user would just need to adjust their
>>>>> crush map so that the rgw index pool is mapped to a different set of OSDs
>>>>> with the better k/v backend.
>>> Not sure if I miss anything, but the key difference with SWIFT?s
>>> implementation is that they are using a table for bucket index and it
>>> actually can be updated in parallel which makes more scalable for write,
>>> though at certain point the sql table would result in performance
>>> degradation as well.
>>
>> As I understand it the same limitation is present there too: the index is
>> in a single sqlite table.
>>
>>>> My more well-formed opinion is that we need to come up with a good
>>>> design. It needs to be flexible enough to be able to grow (and maybe
>>>> shrink), and I assume there would be some kind of background operation
>>>> that will enable that. I also believe that making it hash based is the
>>>> way to go. It looks like that the more complicated issue is here is
>>>> how to handle the transition in which we shard buckets.
>>> Yeah I agree. I think the conflicting goals here are, we want a sorted
>>> list (so that it enable prefix scan for listing purpose) and we want to
>>> shard at the very beginning (the problem we are facing is parallel
>>> writes updating the same bucket index object will need to be
>>> serialized).
>>
>> Given how infrequent container listings are, pre-sharding containers
>> across several objects makes some sense.  Paying the cost of doing
>> listings in parallel across N (where N is not too big) is not a big price
>> to pay. However, there will always need to be a way to re-shard further
>> when containers/buckets get extremely big.  Perhaps a starting point would
>> be support for static sharding where the number of shards is specified at
>> container/bucket creation time…
> Considering the scope of the change, I also think this is a good starting point to make the bucket index updating more scalable.
> Yehuda,
> How do you think?

Sharding it will help with scaling it up to a certain point. As Sage
mentioned we can start with a static setting as a first simpler
approach, and move into a dynamic approach later on.

Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html