Re: Some long running ops may lock osd

Erdem Agaoglu <erdem.agaoglu@xxxxxxxxx> · Tue, 3 Mar 2015 09:33:38 +0200

Thank you folks for bringing that up. I had some questions about sharding. We'd like blind buckets too, at least it's on the roadmap. For the current sharded implementation, what are the final details? Is number of shards defined per bucket or globally? Is there a way to split current indexes into shards?
On the other hand what i'd like to point here is not necessarily large-bucket-index specific. The problem is the mechanism around thread pools. Any request may require locks on a pg and this should not block the requests for other pgs. I'm no expert but the threads may be able to requeue the requests to a locked pg, processing others for other pgs. Or maybe a thread per pg design was possible. Because, you know, it is somewhat OK not being able to do anything for a locked resource. Then you can go and improve your processing or your locks. But it's a whole different problem when a locked pg blocks requests for a few hundred other pgs in other pools for no good reason.

On Tue, Mar 3, 2015 at 5:43 AM, Ben Hines <bhines@xxxxxxxxx> wrote:
Blind-bucket would be perfect for us, as we don't need to list the objects.

We only need to list the bucket when doing a bucket deletion. If we

could clean out/delete all objects in a bucket (without

iterating/listing them) that would be ideal..

On Mon, Mar 2, 2015 at 7:34 PM, GuangYang <yguang11@xxxxxxxxxxx> wrote:

> We have had good experience so far keeping each bucket less than 0.5 million objects, by client side sharding. But I think it would be nice you can test at your scale, with your hardware configuration, as well as your expectation over the tail latency.

>

> Generally the bucket sharding should help, both for Write throughput and *stall with recovering/scrubbing*, but it comes with a prices -  The X shards you have for each bucket, the listing/trimming would be X times weighted, from OSD's load's point of view. There was discussion to implement: 1) blind bucket (for use cases bucket listing is not needed). 2) Un-ordered listing, which could improve the problem I mentioned above. They are on the roadmap...

>

> Thanks,

> Guang

>

>

> ----------------------------------------

>> From: bhines@xxxxxxxxx

>> Date: Mon, 2 Mar 2015 18:13:25 -0800

>> To: erdem.agaoglu@xxxxxxxxx

>> CC: ceph-users@xxxxxxxxxxxxxx

>> Subject: Re:  Some long running ops may lock osd

>>

>> We're seeing a lot of this as well. (as i mentioned to sage at

>> SCALE..) Is there a rule of thumb at all for how big is safe to let a

>> RGW bucket get?

>>

>> Also, is this theoretically resolved by the new bucket-sharding

>> feature in the latest dev release?

>>

>> -Ben

>>

>> On Mon, Mar 2, 2015 at 11:08 AM, Erdem Agaoglu <erdem.agaoglu@xxxxxxxxx> wrote:

>>> Hi Gregory,

>>>

>>> We are not using listomapkeys that way or in any way to be precise. I used

>>> it here just to reproduce the behavior/issue.

>>>

>>> What i am really interested in is if scrubbing-deep actually mitigates the

>>> problem and/or is there something that can be further improved.

>>>

>>> Or i guess we should go upgrade now and hope for the best :)

>>>

>>> On Mon, Mar 2, 2015 at 8:10 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:

>>>>

>>>> On Mon, Mar 2, 2015 at 7:56 AM, Erdem Agaoglu <erdem.agaoglu@xxxxxxxxx>

>>>> wrote:

>>>>> Hi all, especially devs,

>>>>>

>>>>> We have recently pinpointed one of the causes of slow requests in our

>>>>> cluster. It seems deep-scrubs on pg's that contain the index file for a

>>>>> large radosgw bucket lock the osds. Incresing op threads and/or disk

>>>>> threads

>>>>> helps a little bit, but we need to increase them beyond reason in order

>>>>> to

>>>>> completely get rid of the problem. A somewhat similar (and more severe)

>>>>> version of the issue occurs when we call listomapkeys for the index

>>>>> file,

>>>>> and since the logs for deep-scrubbing was much harder read, this

>>>>> inspection

>>>>> was based on listomapkeys.

>>>>>

>>>>> In this example osd.121 is the primary of pg 10.c91 which contains file

>>>>> .dir.5926.3 in .rgw.buckets pool. OSD has 2 op threads. Bucket contains

>>>>> ~500k objects. Standard listomapkeys call take about 3 seconds.

>>>>>

>>>>> time rados -p .rgw.buckets listomapkeys .dir.5926.3> /dev/null

>>>>> real 0m2.983s

>>>>> user 0m0.760s

>>>>> sys 0m0.148s

>>>>>

>>>>> In order to lock the osd we request 2 of them simultaneously with

>>>>> something

>>>>> like:

>>>>>

>>>>> rados -p .rgw.buckets listomapkeys .dir.5926.3> /dev/null &

>>>>> sleep 1

>>>>> rados -p .rgw.buckets listomapkeys .dir.5926.3> /dev/null &

>>>>>

>>>>> 'debug_osd=30' logs show the flow like:

>>>>>

>>>>> At t0 some thread enqueue_op's my omap-get-keys request.

>>>>> Op-Thread A locks pg 10.c91 and dequeue_op's it and starts reading ~500k

>>>>> keys.

>>>>> Op-Thread B responds to several other requests during that 1 second

>>>>> sleep.

>>>>> They're generally extremely fast subops on other pgs.

>>>>> At t1 (about a second later) my second omap-get-keys request gets

>>>>> enqueue_op'ed. But it does not start probably because of the lock held

>>>>> by

>>>>> Thread A.

>>>>> After that point other threads enqueue_op other requests on other pgs

>>>>> too

>>>>> but none of them starts processing, in which i consider the osd is

>>>>> locked.

>>>>> At t2 (about another second later) my first omap-get-keys request is

>>>>> finished.

>>>>> Op-Thread B locks pg 10.c91 and dequeue_op's my second request and

>>>>> starts

>>>>> reading ~500k keys again.

>>>>> Op-Thread A continues to process the requests enqueued in t1-t2.

>>>>>

>>>>> It seems Op-Thread B is waiting on the lock held by Op-Thread A while it

>>>>> can

>>>>> process other requests for other pg's just fine.

>>>>>

>>>>> My guess is a somewhat larger scenario happens in deep-scrubbing, like

>>>>> on

>>>>> the pg containing index for the bucket of>20M objects. A disk/op thread

>>>>> starts reading through the omap which will take say 60 seconds. During

>>>>> the

>>>>> first seconds, other requests for other pgs pass just fine. But in 60

>>>>> seconds there are bound to be other requests for the same pg, especially

>>>>> since it holds the index file. Each of these requests lock another

>>>>> disk/op

>>>>> thread to the point where there are no free threads left to process any

>>>>> requests for any pg. Causing slow-requests.

>>>>>

>>>>> So first of all thanks if you can make it here, and sorry for the

>>>>> involved

>>>>> mail, i'm exploring the problem as i go.

>>>>> Now, is that deep-scrubbing situation i tried to theorize even possible?

>>>>> If

>>>>> not can you point us where to look further.

>>>>> We are currently running 0.72.2 and know about newer ioprio settings in

>>>>> Firefly and such. While we are planning to upgrade in a few weeks but i

>>>>> don't think those options will help us in any way. Am i correct?

>>>>> Are there any other improvements that we are not aware?

>>>>

>>>> This is all basically correct; it's one of the reasons you don't want

>>>> to let individual buckets get too large.

>>>>

>>>> That said, I'm a little confused about why you're running listomapkeys

>>>> that way. RGW throttles itself by getting only a certain number of

>>>> entries at a time (1000?) and any system you're also building should

>>>> do the same. That would reduce the frequency of any issues, and I

>>>> *think* that scrubbing has some mitigating factors to help (although

>>>> maybe not; it's been a while since I looked at any of that stuff).

>>>>

>>>> Although I just realized that my vague memory of deep scrubbing

>>>> working better might be based on improvements that only got in for

>>>> firefly...not sure.

>>>> -Greg

>>>

>>>

>>>

>>>

>>> --

>>> erdem agaoglu

>>>

>>> _______________________________________________

>>> ceph-users mailing list

>>> ceph-users@xxxxxxxxxxxxxx

>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>

>> _______________________________________________

>> ceph-users mailing list

>> ceph-users@xxxxxxxxxxxxxx

>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>

-- 
erdem agaoglu

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com