On Mon, Mar 2, 2015 at 7:56 AM, Erdem Agaoglu <erdem.agaoglu@xxxxxxxxx> wrote: > Hi all, especially devs, > > We have recently pinpointed one of the causes of slow requests in our > cluster. It seems deep-scrubs on pg's that contain the index file for a > large radosgw bucket lock the osds. Incresing op threads and/or disk threads > helps a little bit, but we need to increase them beyond reason in order to > completely get rid of the problem. A somewhat similar (and more severe) > version of the issue occurs when we call listomapkeys for the index file, > and since the logs for deep-scrubbing was much harder read, this inspection > was based on listomapkeys. > > In this example osd.121 is the primary of pg 10.c91 which contains file > .dir.5926.3 in .rgw.buckets pool. OSD has 2 op threads. Bucket contains > ~500k objects. Standard listomapkeys call take about 3 seconds. > > time rados -p .rgw.buckets listomapkeys .dir.5926.3 > /dev/null > real 0m2.983s > user 0m0.760s > sys 0m0.148s > > In order to lock the osd we request 2 of them simultaneously with something > like: > > rados -p .rgw.buckets listomapkeys .dir.5926.3 > /dev/null & > sleep 1 > rados -p .rgw.buckets listomapkeys .dir.5926.3 > /dev/null & > > 'debug_osd=30' logs show the flow like: > > At t0 some thread enqueue_op's my omap-get-keys request. > Op-Thread A locks pg 10.c91 and dequeue_op's it and starts reading ~500k > keys. > Op-Thread B responds to several other requests during that 1 second sleep. > They're generally extremely fast subops on other pgs. > At t1 (about a second later) my second omap-get-keys request gets > enqueue_op'ed. But it does not start probably because of the lock held by > Thread A. > After that point other threads enqueue_op other requests on other pgs too > but none of them starts processing, in which i consider the osd is locked. > At t2 (about another second later) my first omap-get-keys request is > finished. > Op-Thread B locks pg 10.c91 and dequeue_op's my second request and starts > reading ~500k keys again. > Op-Thread A continues to process the requests enqueued in t1-t2. > > It seems Op-Thread B is waiting on the lock held by Op-Thread A while it can > process other requests for other pg's just fine. > > My guess is a somewhat larger scenario happens in deep-scrubbing, like on > the pg containing index for the bucket of >20M objects. A disk/op thread > starts reading through the omap which will take say 60 seconds. During the > first seconds, other requests for other pgs pass just fine. But in 60 > seconds there are bound to be other requests for the same pg, especially > since it holds the index file. Each of these requests lock another disk/op > thread to the point where there are no free threads left to process any > requests for any pg. Causing slow-requests. > > So first of all thanks if you can make it here, and sorry for the involved > mail, i'm exploring the problem as i go. > Now, is that deep-scrubbing situation i tried to theorize even possible? If > not can you point us where to look further. > We are currently running 0.72.2 and know about newer ioprio settings in > Firefly and such. While we are planning to upgrade in a few weeks but i > don't think those options will help us in any way. Am i correct? > Are there any other improvements that we are not aware? This is all basically correct; it's one of the reasons you don't want to let individual buckets get too large. That said, I'm a little confused about why you're running listomapkeys that way. RGW throttles itself by getting only a certain number of entries at a time (1000?) and any system you're also building should do the same. That would reduce the frequency of any issues, and I *think* that scrubbing has some mitigating factors to help (although maybe not; it's been a while since I looked at any of that stuff). Although I just realized that my vague memory of deep scrubbing working better might be based on improvements that only got in for firefly...not sure. -Greg _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com