On May 19, 2014, at 7:05 AM, Sage Weil <sage@xxxxxxxxxxx> wrote: > On Sun, 18 May 2014, Guang wrote: >>>> radosgw is using the omap key/value API for objects, which is more or less >>>> equivalent to what swift is doing with sqlite. This data passes straight >>>> into leveldb on the backend (or whatever other backend you are using). >>>> Using something like rocksdb in its place is pretty simple and ther are >>>> unmerged patches to do that; the user would just need to adjust their >>>> crush map so that the rgw index pool is mapped to a different set of OSDs >>>> with the better k/v backend. >> Not sure if I miss anything, but the key difference with SWIFT?s >> implementation is that they are using a table for bucket index and it >> actually can be updated in parallel which makes more scalable for write, >> though at certain point the sql table would result in performance >> degradation as well. > > As I understand it the same limitation is present there too: the index is > in a single sqlite table. > >>> My more well-formed opinion is that we need to come up with a good >>> design. It needs to be flexible enough to be able to grow (and maybe >>> shrink), and I assume there would be some kind of background operation >>> that will enable that. I also believe that making it hash based is the >>> way to go. It looks like that the more complicated issue is here is >>> how to handle the transition in which we shard buckets. >> Yeah I agree. I think the conflicting goals here are, we want a sorted >> list (so that it enable prefix scan for listing purpose) and we want to >> shard at the very beginning (the problem we are facing is parallel >> writes updating the same bucket index object will need to be >> serialized). > > Given how infrequent container listings are, pre-sharding containers > across several objects makes some sense. Paying the cost of doing > listings in parallel across N (where N is not too big) is not a big price > to pay. However, there will always need to be a way to re-shard further > when containers/buckets get extremely big. Perhaps a starting point would > be support for static sharding where the number of shards is specified at > container/bucket creation time… Considering the scope of the change, I also think this is a good starting point to make the bucket index updating more scalable. Yehuda, How do you think? > > sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html