On Sun, May 18, 2014 at 11:18 PM, Guang Yang <yguang11@xxxxxxxxxxx> wrote: > On May 19, 2014, at 7:05 AM, Sage Weil <sage@xxxxxxxxxxx> wrote: > >> On Sun, 18 May 2014, Guang wrote: >>>>> radosgw is using the omap key/value API for objects, which is more or less >>>>> equivalent to what swift is doing with sqlite. This data passes straight >>>>> into leveldb on the backend (or whatever other backend you are using). >>>>> Using something like rocksdb in its place is pretty simple and ther are >>>>> unmerged patches to do that; the user would just need to adjust their >>>>> crush map so that the rgw index pool is mapped to a different set of OSDs >>>>> with the better k/v backend. >>> Not sure if I miss anything, but the key difference with SWIFT?s >>> implementation is that they are using a table for bucket index and it >>> actually can be updated in parallel which makes more scalable for write, >>> though at certain point the sql table would result in performance >>> degradation as well. >> >> As I understand it the same limitation is present there too: the index is >> in a single sqlite table. >> >>>> My more well-formed opinion is that we need to come up with a good >>>> design. It needs to be flexible enough to be able to grow (and maybe >>>> shrink), and I assume there would be some kind of background operation >>>> that will enable that. I also believe that making it hash based is the >>>> way to go. It looks like that the more complicated issue is here is >>>> how to handle the transition in which we shard buckets. >>> Yeah I agree. I think the conflicting goals here are, we want a sorted >>> list (so that it enable prefix scan for listing purpose) and we want to >>> shard at the very beginning (the problem we are facing is parallel >>> writes updating the same bucket index object will need to be >>> serialized). >> >> Given how infrequent container listings are, pre-sharding containers >> across several objects makes some sense. Paying the cost of doing >> listings in parallel across N (where N is not too big) is not a big price >> to pay. However, there will always need to be a way to re-shard further >> when containers/buckets get extremely big. Perhaps a starting point would >> be support for static sharding where the number of shards is specified at >> container/bucket creation time… > Considering the scope of the change, I also think this is a good starting point to make the bucket index updating more scalable. > Yehuda, > How do you think? Sharding it will help with scaling it up to a certain point. As Sage mentioned we can start with a static setting as a first simpler approach, and move into a dynamic approach later on. Yehuda -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html