Re: Some long running ops may lock osd

Ben Hines <bhines@xxxxxxxxx> · Mon, 2 Mar 2015 18:13:25 -0800

We're seeing a lot of this as well. (as i mentioned to sage at
SCALE..) Is there a rule of thumb at all for how big is safe to let a
RGW bucket get?

Also, is this theoretically resolved by the new bucket-sharding
feature in the latest dev release?

-Ben

On Mon, Mar 2, 2015 at 11:08 AM, Erdem Agaoglu <erdem.agaoglu@xxxxxxxxx> wrote:
> Hi Gregory,
>
> We are not using listomapkeys that way or in any way to be precise. I used
> it here just to reproduce the behavior/issue.
>
> What i am really interested in is if scrubbing-deep actually mitigates the
> problem and/or is there something that can be further improved.
>
> Or i guess we should go upgrade now and hope for the best :)
>
> On Mon, Mar 2, 2015 at 8:10 PM, Gregory Farnum <greg@xxxxxxxxxxx> wrote:
>>
>> On Mon, Mar 2, 2015 at 7:56 AM, Erdem Agaoglu <erdem.agaoglu@xxxxxxxxx>
>> wrote:
>> > Hi all, especially devs,
>> >
>> > We have recently pinpointed one of the causes of slow requests in our
>> > cluster. It seems deep-scrubs on pg's that contain the index file for a
>> > large radosgw bucket lock the osds. Incresing op threads and/or disk
>> > threads
>> > helps a little bit, but we need to increase them beyond reason in order
>> > to
>> > completely get rid of the problem. A somewhat similar (and more severe)
>> > version of the issue occurs when we call listomapkeys for the index
>> > file,
>> > and since the logs for deep-scrubbing was much harder read, this
>> > inspection
>> > was based on listomapkeys.
>> >
>> > In this example osd.121 is the primary of pg 10.c91 which contains file
>> > .dir.5926.3 in .rgw.buckets pool. OSD has 2 op threads. Bucket contains
>> > ~500k objects. Standard listomapkeys call take about 3 seconds.
>> >
>> > time rados -p .rgw.buckets listomapkeys .dir.5926.3 > /dev/null
>> > real 0m2.983s
>> > user 0m0.760s
>> > sys 0m0.148s
>> >
>> > In order to lock the osd we request 2 of them simultaneously with
>> > something
>> > like:
>> >
>> > rados -p .rgw.buckets listomapkeys .dir.5926.3 > /dev/null &
>> > sleep 1
>> > rados -p .rgw.buckets listomapkeys .dir.5926.3 > /dev/null &
>> >
>> > 'debug_osd=30' logs show the flow like:
>> >
>> > At t0 some thread enqueue_op's my omap-get-keys request.
>> > Op-Thread A locks pg 10.c91 and dequeue_op's it and starts reading ~500k
>> > keys.
>> > Op-Thread B responds to several other requests during that 1 second
>> > sleep.
>> > They're generally extremely fast subops on other pgs.
>> > At t1 (about a second later) my second omap-get-keys request gets
>> > enqueue_op'ed. But it does not start probably because of the lock held
>> > by
>> > Thread A.
>> > After that point other threads enqueue_op other requests on other pgs
>> > too
>> > but none of them starts processing, in which i consider the osd is
>> > locked.
>> > At t2 (about another second later) my first omap-get-keys request is
>> > finished.
>> > Op-Thread B locks pg 10.c91 and dequeue_op's my second request and
>> > starts
>> > reading ~500k keys again.
>> > Op-Thread A continues to process the requests enqueued in t1-t2.
>> >
>> > It seems Op-Thread B is waiting on the lock held by Op-Thread A while it
>> > can
>> > process other requests for other pg's just fine.
>> >
>> > My guess is a somewhat larger scenario happens in deep-scrubbing, like
>> > on
>> > the pg containing index for the bucket of >20M objects. A disk/op thread
>> > starts reading through the omap which will take say 60 seconds. During
>> > the
>> > first seconds, other requests for other pgs pass just fine. But in 60
>> > seconds there are bound to be other requests for the same pg, especially
>> > since it holds the index file. Each of these requests lock another
>> > disk/op
>> > thread to the point where there are no free threads left to process any
>> > requests for any pg. Causing slow-requests.
>> >
>> > So first of all thanks if you can make it here, and sorry for the
>> > involved
>> > mail, i'm exploring the problem as i go.
>> > Now, is that deep-scrubbing situation i tried to theorize even possible?
>> > If
>> > not can you point us where to look further.
>> > We are currently running 0.72.2 and know about newer ioprio settings in
>> > Firefly and such. While we are planning to upgrade in a few weeks but i
>> > don't think those options will help us in any way. Am i correct?
>> > Are there any other improvements that we are not aware?
>>
>> This is all basically correct; it's one of the reasons you don't want
>> to let individual buckets get too large.
>>
>> That said, I'm a little confused about why you're running listomapkeys
>> that way. RGW throttles itself by getting only a certain number of
>> entries at a time (1000?) and any system you're also building should
>> do the same. That would reduce the frequency of any issues, and I
>> *think* that scrubbing has some mitigating factors to help (although
>> maybe not; it's been a while since I looked at any of that stuff).
>>
>> Although I just realized that my vague memory of deep scrubbing
>> working better might be based on improvements that only got in for
>> firefly...not sure.
>> -Greg
>
>
>
>
> --
> erdem agaoglu
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com