Blind-bucket would be perfect for us, as we don't need to list the objects. We only need to list the bucket when doing a bucket deletion. If we could clean out/delete all objects in a bucket (without iterating/listing them) that would be ideal.. On Mon, Mar 2, 2015 at 7:34 PM, GuangYang <yguang11@xxxxxxxxxxx> wrote: > We have had good experience so far keeping each bucket less than 0.5 million objects, by client side sharding. But I think it would be nice you can test at your scale, with your hardware configuration, as well as your expectation over the tail latency. > > Generally the bucket sharding should help, both for Write throughput and *stall with recovering/scrubbing*, but it comes with a prices - The X shards you have for each bucket, the listing/trimming would be X times weighted, from OSD's load's point of view. There was discussion to implement: 1) blind bucket (for use cases bucket listing is not needed). 2) Un-ordered listing, which could improve the problem I mentioned above. They are on the roadmap... > > Thanks, > Guang > > > ---------------------------------------- >> From: bhines@xxxxxxxxx >> Date: Mon, 2 Mar 2015 18:13:25 -0800 >> To: erdem.agaoglu@xxxxxxxxx >> CC: ceph-users@xxxxxxxxxxxxxx >> Subject: Re: Some long running ops may lock osd >> >> We're seeing a lot of this as well. (as i mentioned to sage at >> SCALE..) Is there a rule of thumb at all for how big is safe to let a >> RGW bucket get? >> >> Also, is this theoretically resolved by the new bucket-sharding >> feature in the latest dev release? >> >> -Ben >> >> On Mon, Mar 2, 2015 at 11:08 AM, Erdem Agaoglu <erdem.agaoglu@xxxxxxxxx> wrote: >>> Hi Gregory, >>> >>> We are not using listomapkeys that way or in any way to be precise. I used >>> it here just to reproduce the behavior/issue. >>> >>> What i am really interested in is if scrubbing-deep actually mitigates the >>> problem and/or is there something that can be further improved. >>> >>> Or i guess we should go upgrade now and hope for the best :) >>> >>> On Mon, Mar 2, 2015 at 8:10 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote: >>>> >>>> On Mon, Mar 2, 2015 at 7:56 AM, Erdem Agaoglu <erdem.agaoglu@xxxxxxxxx> >>>> wrote: >>>>> Hi all, especially devs, >>>>> >>>>> We have recently pinpointed one of the causes of slow requests in our >>>>> cluster. It seems deep-scrubs on pg's that contain the index file for a >>>>> large radosgw bucket lock the osds. Incresing op threads and/or disk >>>>> threads >>>>> helps a little bit, but we need to increase them beyond reason in order >>>>> to >>>>> completely get rid of the problem. A somewhat similar (and more severe) >>>>> version of the issue occurs when we call listomapkeys for the index >>>>> file, >>>>> and since the logs for deep-scrubbing was much harder read, this >>>>> inspection >>>>> was based on listomapkeys. >>>>> >>>>> In this example osd.121 is the primary of pg 10.c91 which contains file >>>>> .dir.5926.3 in .rgw.buckets pool. OSD has 2 op threads. Bucket contains >>>>> ~500k objects. Standard listomapkeys call take about 3 seconds. >>>>> >>>>> time rados -p .rgw.buckets listomapkeys .dir.5926.3> /dev/null >>>>> real 0m2.983s >>>>> user 0m0.760s >>>>> sys 0m0.148s >>>>> >>>>> In order to lock the osd we request 2 of them simultaneously with >>>>> something >>>>> like: >>>>> >>>>> rados -p .rgw.buckets listomapkeys .dir.5926.3> /dev/null & >>>>> sleep 1 >>>>> rados -p .rgw.buckets listomapkeys .dir.5926.3> /dev/null & >>>>> >>>>> 'debug_osd=30' logs show the flow like: >>>>> >>>>> At t0 some thread enqueue_op's my omap-get-keys request. >>>>> Op-Thread A locks pg 10.c91 and dequeue_op's it and starts reading ~500k >>>>> keys. >>>>> Op-Thread B responds to several other requests during that 1 second >>>>> sleep. >>>>> They're generally extremely fast subops on other pgs. >>>>> At t1 (about a second later) my second omap-get-keys request gets >>>>> enqueue_op'ed. But it does not start probably because of the lock held >>>>> by >>>>> Thread A. >>>>> After that point other threads enqueue_op other requests on other pgs >>>>> too >>>>> but none of them starts processing, in which i consider the osd is >>>>> locked. >>>>> At t2 (about another second later) my first omap-get-keys request is >>>>> finished. >>>>> Op-Thread B locks pg 10.c91 and dequeue_op's my second request and >>>>> starts >>>>> reading ~500k keys again. >>>>> Op-Thread A continues to process the requests enqueued in t1-t2. >>>>> >>>>> It seems Op-Thread B is waiting on the lock held by Op-Thread A while it >>>>> can >>>>> process other requests for other pg's just fine. >>>>> >>>>> My guess is a somewhat larger scenario happens in deep-scrubbing, like >>>>> on >>>>> the pg containing index for the bucket of>20M objects. A disk/op thread >>>>> starts reading through the omap which will take say 60 seconds. During >>>>> the >>>>> first seconds, other requests for other pgs pass just fine. But in 60 >>>>> seconds there are bound to be other requests for the same pg, especially >>>>> since it holds the index file. Each of these requests lock another >>>>> disk/op >>>>> thread to the point where there are no free threads left to process any >>>>> requests for any pg. Causing slow-requests. >>>>> >>>>> So first of all thanks if you can make it here, and sorry for the >>>>> involved >>>>> mail, i'm exploring the problem as i go. >>>>> Now, is that deep-scrubbing situation i tried to theorize even possible? >>>>> If >>>>> not can you point us where to look further. >>>>> We are currently running 0.72.2 and know about newer ioprio settings in >>>>> Firefly and such. While we are planning to upgrade in a few weeks but i >>>>> don't think those options will help us in any way. Am i correct? >>>>> Are there any other improvements that we are not aware? >>>> >>>> This is all basically correct; it's one of the reasons you don't want >>>> to let individual buckets get too large. >>>> >>>> That said, I'm a little confused about why you're running listomapkeys >>>> that way. RGW throttles itself by getting only a certain number of >>>> entries at a time (1000?) and any system you're also building should >>>> do the same. That would reduce the frequency of any issues, and I >>>> *think* that scrubbing has some mitigating factors to help (although >>>> maybe not; it's been a while since I looked at any of that stuff). >>>> >>>> Although I just realized that my vague memory of deep scrubbing >>>> working better might be based on improvements that only got in for >>>> firefly...not sure. >>>> -Greg >>> >>> >>> >>> >>> -- >>> erdem agaoglu >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@xxxxxxxxxxxxxx >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com