Re: Some long running ops may lock osd

Ben Hines <bhines@xxxxxxxxx> · Mon, 2 Mar 2015 19:43:27 -0800

Blind-bucket would be perfect for us, as we don't need to list the objects.

We only need to list the bucket when doing a bucket deletion. If we
could clean out/delete all objects in a bucket (without
iterating/listing them) that would be ideal..

On Mon, Mar 2, 2015 at 7:34 PM, GuangYang <yguang11@xxxxxxxxxxx> wrote:
> We have had good experience so far keeping each bucket less than 0.5 million objects, by client side sharding. But I think it would be nice you can test at your scale, with your hardware configuration, as well as your expectation over the tail latency.
>
> Generally the bucket sharding should help, both for Write throughput and *stall with recovering/scrubbing*, but it comes with a prices -  The X shards you have for each bucket, the listing/trimming would be X times weighted, from OSD's load's point of view. There was discussion to implement: 1) blind bucket (for use cases bucket listing is not needed). 2) Un-ordered listing, which could improve the problem I mentioned above. They are on the roadmap...
>
> Thanks,
> Guang
>
>
> ----------------------------------------
>> From: bhines@xxxxxxxxx
>> Date: Mon, 2 Mar 2015 18:13:25 -0800
>> To: erdem.agaoglu@xxxxxxxxx
>> CC: ceph-users@xxxxxxxxxxxxxx
>> Subject: Re:  Some long running ops may lock osd
>>
>> We're seeing a lot of this as well. (as i mentioned to sage at
>> SCALE..) Is there a rule of thumb at all for how big is safe to let a
>> RGW bucket get?
>>
>> Also, is this theoretically resolved by the new bucket-sharding
>> feature in the latest dev release?
>>
>> -Ben
>>
>> On Mon, Mar 2, 2015 at 11:08 AM, Erdem Agaoglu <erdem.agaoglu@xxxxxxxxx> wrote:
>>> Hi Gregory,
>>>
>>> We are not using listomapkeys that way or in any way to be precise. I used
>>> it here just to reproduce the behavior/issue.
>>>
>>> What i am really interested in is if scrubbing-deep actually mitigates the
>>> problem and/or is there something that can be further improved.
>>>
>>> Or i guess we should go upgrade now and hope for the best :)
>>>
>>> On Mon, Mar 2, 2015 at 8:10 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>>>>
>>>> On Mon, Mar 2, 2015 at 7:56 AM, Erdem Agaoglu <erdem.agaoglu@xxxxxxxxx>
>>>> wrote:
>>>>> Hi all, especially devs,
>>>>>
>>>>> We have recently pinpointed one of the causes of slow requests in our
>>>>> cluster. It seems deep-scrubs on pg's that contain the index file for a
>>>>> large radosgw bucket lock the osds. Incresing op threads and/or disk
>>>>> threads
>>>>> helps a little bit, but we need to increase them beyond reason in order
>>>>> to
>>>>> completely get rid of the problem. A somewhat similar (and more severe)
>>>>> version of the issue occurs when we call listomapkeys for the index
>>>>> file,
>>>>> and since the logs for deep-scrubbing was much harder read, this
>>>>> inspection
>>>>> was based on listomapkeys.
>>>>>
>>>>> In this example osd.121 is the primary of pg 10.c91 which contains file
>>>>> .dir.5926.3 in .rgw.buckets pool. OSD has 2 op threads. Bucket contains
>>>>> ~500k objects. Standard listomapkeys call take about 3 seconds.
>>>>>
>>>>> time rados -p .rgw.buckets listomapkeys .dir.5926.3> /dev/null
>>>>> real 0m2.983s
>>>>> user 0m0.760s
>>>>> sys 0m0.148s
>>>>>
>>>>> In order to lock the osd we request 2 of them simultaneously with
>>>>> something
>>>>> like:
>>>>>
>>>>> rados -p .rgw.buckets listomapkeys .dir.5926.3> /dev/null &
>>>>> sleep 1
>>>>> rados -p .rgw.buckets listomapkeys .dir.5926.3> /dev/null &
>>>>>
>>>>> 'debug_osd=30' logs show the flow like:
>>>>>
>>>>> At t0 some thread enqueue_op's my omap-get-keys request.
>>>>> Op-Thread A locks pg 10.c91 and dequeue_op's it and starts reading ~500k
>>>>> keys.
>>>>> Op-Thread B responds to several other requests during that 1 second
>>>>> sleep.
>>>>> They're generally extremely fast subops on other pgs.
>>>>> At t1 (about a second later) my second omap-get-keys request gets
>>>>> enqueue_op'ed. But it does not start probably because of the lock held
>>>>> by
>>>>> Thread A.
>>>>> After that point other threads enqueue_op other requests on other pgs
>>>>> too
>>>>> but none of them starts processing, in which i consider the osd is
>>>>> locked.
>>>>> At t2 (about another second later) my first omap-get-keys request is
>>>>> finished.
>>>>> Op-Thread B locks pg 10.c91 and dequeue_op's my second request and
>>>>> starts
>>>>> reading ~500k keys again.
>>>>> Op-Thread A continues to process the requests enqueued in t1-t2.
>>>>>
>>>>> It seems Op-Thread B is waiting on the lock held by Op-Thread A while it
>>>>> can
>>>>> process other requests for other pg's just fine.
>>>>>
>>>>> My guess is a somewhat larger scenario happens in deep-scrubbing, like
>>>>> on
>>>>> the pg containing index for the bucket of>20M objects. A disk/op thread
>>>>> starts reading through the omap which will take say 60 seconds. During
>>>>> the
>>>>> first seconds, other requests for other pgs pass just fine. But in 60
>>>>> seconds there are bound to be other requests for the same pg, especially
>>>>> since it holds the index file. Each of these requests lock another
>>>>> disk/op
>>>>> thread to the point where there are no free threads left to process any
>>>>> requests for any pg. Causing slow-requests.
>>>>>
>>>>> So first of all thanks if you can make it here, and sorry for the
>>>>> involved
>>>>> mail, i'm exploring the problem as i go.
>>>>> Now, is that deep-scrubbing situation i tried to theorize even possible?
>>>>> If
>>>>> not can you point us where to look further.
>>>>> We are currently running 0.72.2 and know about newer ioprio settings in
>>>>> Firefly and such. While we are planning to upgrade in a few weeks but i
>>>>> don't think those options will help us in any way. Am i correct?
>>>>> Are there any other improvements that we are not aware?
>>>>
>>>> This is all basically correct; it's one of the reasons you don't want
>>>> to let individual buckets get too large.
>>>>
>>>> That said, I'm a little confused about why you're running listomapkeys
>>>> that way. RGW throttles itself by getting only a certain number of
>>>> entries at a time (1000?) and any system you're also building should
>>>> do the same. That would reduce the frequency of any issues, and I
>>>> *think* that scrubbing has some mitigating factors to help (although
>>>> maybe not; it's been a while since I looked at any of that stuff).
>>>>
>>>> Although I just realized that my vague memory of deep scrubbing
>>>> working better might be based on improvements that only got in for
>>>> firefly...not sure.
>>>> -Greg
>>>
>>>
>>>
>>>
>>> --
>>> erdem agaoglu
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@xxxxxxxxxxxxxx
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com