Some long running ops may lock osd

Erdem Agaoglu <erdem.agaoglu@xxxxxxxxx> · Mon, 2 Mar 2015 17:56:08 +0200

Hi all, especially devs,

We have recently pinpointed one of the causes of slow requests in our cluster. It seems deep-scrubs on pg's that contain the index file for a large radosgw bucket lock the osds. Incresing op threads and/or disk threads helps a little bit, but we need to increase them beyond reason in order to completely get rid of the problem. A somewhat similar (and more severe) version of the issue occurs when we call listomapkeys for the index file, and since the logs for deep-scrubbing was much harder read, this inspection was based on listomapkeys.

In this example osd.121 is the primary of pg 10.c91 which contains file .dir.5926.3 in .rgw.buckets pool. OSD has 2 op threads. Bucket contains ~500k objects. Standard listomapkeys call take about 3 seconds.

time rados -p .rgw.buckets listomapkeys .dir.5926.3 > /dev/null
real	0m2.983s
user	0m0.760s
sys	0m0.148s

In order to lock the osd we request 2 of them simultaneously with something like:

rados -p .rgw.buckets listomapkeys .dir.5926.3 > /dev/null &
sleep 1
rados -p .rgw.buckets listomapkeys .dir.5926.3 > /dev/null &

'debug_osd=30' logs show the flow like:

At t0 some thread enqueue_op's my omap-get-keys request.
Op-Thread A locks pg 10.c91 and dequeue_op's it and starts reading ~500k keys.
Op-Thread B responds to several other requests during that 1 second sleep. They're generally extremely fast subops on other pgs.
At t1 (about a second later) my second omap-get-keys request gets enqueue_op'ed. But it does not start probably because of the lock held by Thread A.
After that point other threads enqueue_op other requests on other pgs too but none of them starts processing, in which i consider the osd is locked.
At t2 (about another second later) my first omap-get-keys request is finished.
Op-Thread B locks pg 10.c91 and dequeue_op's my second request and starts reading ~500k keys again.
Op-Thread A continues to process the requests enqueued in t1-t2.

It seems Op-Thread B is waiting on the lock held by Op-Thread A while it can process other requests for other pg's just fine.

My guess is a somewhat larger scenario happens in deep-scrubbing, like on the pg containing index for the bucket of >20M objects. A disk/op thread starts reading through the omap which will take say 60 seconds. During the first seconds, other requests for other pgs pass just fine. But in 60 seconds there are bound to be other requests for the same pg, especially since it holds the index file. Each of these requests lock another disk/op thread to the point where there are no free threads left to process any requests for any pg. Causing slow-requests.

So first of all thanks if you can make it here, and sorry for the involved mail, i'm exploring the problem as i go.
Now, is that deep-scrubbing situation i tried to theorize even possible? If not can you point us where to look further.
We are currently running 0.72.2 and know about newer ioprio settings in Firefly and such. While we are planning to upgrade in a few weeks but i don't think those options will help us in any way. Am i correct?
Are there any other improvements that we are not aware?

Regards,

-- 
erdem agaoglu

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com