Hi Yehuda, I opened an issue here: http://tracker.ceph.com/issues/8473, please help to review and comment. Thanks, Guang On May 19, 2014, at 2:47 PM, Yehuda Sadeh <yehuda@xxxxxxxxxxx> wrote: > On Sun, May 18, 2014 at 11:18 PM, Guang Yang <yguang11@xxxxxxxxxxx> wrote: >> On May 19, 2014, at 7:05 AM, Sage Weil <sage@xxxxxxxxxxx> wrote: >> >>> On Sun, 18 May 2014, Guang wrote: >>>>>> radosgw is using the omap key/value API for objects, which is more or less >>>>>> equivalent to what swift is doing with sqlite. This data passes straight >>>>>> into leveldb on the backend (or whatever other backend you are using). >>>>>> Using something like rocksdb in its place is pretty simple and ther are >>>>>> unmerged patches to do that; the user would just need to adjust their >>>>>> crush map so that the rgw index pool is mapped to a different set of OSDs >>>>>> with the better k/v backend. >>>> Not sure if I miss anything, but the key difference with SWIFT?s >>>> implementation is that they are using a table for bucket index and it >>>> actually can be updated in parallel which makes more scalable for write, >>>> though at certain point the sql table would result in performance >>>> degradation as well. >>> >>> As I understand it the same limitation is present there too: the index is >>> in a single sqlite table. >>> >>>>> My more well-formed opinion is that we need to come up with a good >>>>> design. It needs to be flexible enough to be able to grow (and maybe >>>>> shrink), and I assume there would be some kind of background operation >>>>> that will enable that. I also believe that making it hash based is the >>>>> way to go. It looks like that the more complicated issue is here is >>>>> how to handle the transition in which we shard buckets. >>>> Yeah I agree. I think the conflicting goals here are, we want a sorted >>>> list (so that it enable prefix scan for listing purpose) and we want to >>>> shard at the very beginning (the problem we are facing is parallel >>>> writes updating the same bucket index object will need to be >>>> serialized). >>> >>> Given how infrequent container listings are, pre-sharding containers >>> across several objects makes some sense. Paying the cost of doing >>> listings in parallel across N (where N is not too big) is not a big price >>> to pay. However, there will always need to be a way to re-shard further >>> when containers/buckets get extremely big. Perhaps a starting point would >>> be support for static sharding where the number of shards is specified at >>> container/bucket creation time… >> Considering the scope of the change, I also think this is a good starting point to make the bucket index updating more scalable. >> Yehuda, >> How do you think? > > Sharding it will help with scaling it up to a certain point. As Sage > mentioned we can start with a static setting as a first simpler > approach, and move into a dynamic approach later on. > > Yehuda > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majordomo@xxxxxxxxxxxxxxx > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html