Re: Some long running ops may lock osd

Gregory Farnum <greg@xxxxxxxxxxx> · Mon, 2 Mar 2015 10:10:34 -0800

On Mon, Mar 2, 2015 at 7:56 AM, Erdem Agaoglu <erdem.agaoglu@xxxxxxxxx> wrote:
> Hi all, especially devs,
>
> We have recently pinpointed one of the causes of slow requests in our
> cluster. It seems deep-scrubs on pg's that contain the index file for a
> large radosgw bucket lock the osds. Incresing op threads and/or disk threads
> helps a little bit, but we need to increase them beyond reason in order to
> completely get rid of the problem. A somewhat similar (and more severe)
> version of the issue occurs when we call listomapkeys for the index file,
> and since the logs for deep-scrubbing was much harder read, this inspection
> was based on listomapkeys.
>
> In this example osd.121 is the primary of pg 10.c91 which contains file
> .dir.5926.3 in .rgw.buckets pool. OSD has 2 op threads. Bucket contains
> ~500k objects. Standard listomapkeys call take about 3 seconds.
>
> time rados -p .rgw.buckets listomapkeys .dir.5926.3 > /dev/null
> real 0m2.983s
> user 0m0.760s
> sys 0m0.148s
>
> In order to lock the osd we request 2 of them simultaneously with something
> like:
>
> rados -p .rgw.buckets listomapkeys .dir.5926.3 > /dev/null &
> sleep 1
> rados -p .rgw.buckets listomapkeys .dir.5926.3 > /dev/null &
>
> 'debug_osd=30' logs show the flow like:
>
> At t0 some thread enqueue_op's my omap-get-keys request.
> Op-Thread A locks pg 10.c91 and dequeue_op's it and starts reading ~500k
> keys.
> Op-Thread B responds to several other requests during that 1 second sleep.
> They're generally extremely fast subops on other pgs.
> At t1 (about a second later) my second omap-get-keys request gets
> enqueue_op'ed. But it does not start probably because of the lock held by
> Thread A.
> After that point other threads enqueue_op other requests on other pgs too
> but none of them starts processing, in which i consider the osd is locked.
> At t2 (about another second later) my first omap-get-keys request is
> finished.
> Op-Thread B locks pg 10.c91 and dequeue_op's my second request and starts
> reading ~500k keys again.
> Op-Thread A continues to process the requests enqueued in t1-t2.
>
> It seems Op-Thread B is waiting on the lock held by Op-Thread A while it can
> process other requests for other pg's just fine.
>
> My guess is a somewhat larger scenario happens in deep-scrubbing, like on
> the pg containing index for the bucket of >20M objects. A disk/op thread
> starts reading through the omap which will take say 60 seconds. During the
> first seconds, other requests for other pgs pass just fine. But in 60
> seconds there are bound to be other requests for the same pg, especially
> since it holds the index file. Each of these requests lock another disk/op
> thread to the point where there are no free threads left to process any
> requests for any pg. Causing slow-requests.
>
> So first of all thanks if you can make it here, and sorry for the involved
> mail, i'm exploring the problem as i go.
> Now, is that deep-scrubbing situation i tried to theorize even possible? If
> not can you point us where to look further.
> We are currently running 0.72.2 and know about newer ioprio settings in
> Firefly and such. While we are planning to upgrade in a few weeks but i
> don't think those options will help us in any way. Am i correct?
> Are there any other improvements that we are not aware?

This is all basically correct; it's one of the reasons you don't want
to let individual buckets get too large.

That said, I'm a little confused about why you're running listomapkeys
that way. RGW throttles itself by getting only a certain number of
entries at a time (1000?) and any system you're also building should
do the same. That would reduce the frequency of any issues, and I
*think* that scrubbing has some mitigating factors to help (although
maybe not; it's been a while since I looked at any of that stuff).

Although I just realized that my vague memory of deep scrubbing
working better might be based on improvements that only got in for
firefly...not sure.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com