Re: slow request, waiting for rw locks / subops from osd doing deep scrub of pg in rgw.buckets.index

Samuel Just <sjust@xxxxxxxxxx> · Tue, 21 Jun 2016 13:12:25 -0700



.rgw.bucket.index.pool is the pool with rgw's index objects, right?
The actual on-disk directory for one of those pgs would contain only
empty files -- the actual index data is stored in the osd's leveldb
instance.  I suspect your index objects are very large (because the
buckets contain many objects) and are taking a long time to scrub.
iirc, there is a way to make rgw split up those index objects into
smaller ones.
-Sam

On Tue, Jun 21, 2016 at 11:58 AM, Trygve Vea
<trygve.vea@xxxxxxxxxxxxxxxxxx> wrote:
> Hi,
>
> I believe I've stumbled on a bug in Ceph, and I'm currently trying to figure out if this is a new bug, some behaviour caused by our cluster being in the midst of a hammer(0.94.6)->jewel(10.2.2) upgrade, or other factors.
>
> The state of the cluster at the time of the incident:
>
> - All monitor nodes are running 10.2.2.
> - One OSD-server (4 osds) is up with 10.2.2 and with all pg's in active+clean.
> - One OSD-server (4 osds) is up with 10.2.2 and undergoing backfills (however: nobackfill was set, as we try to keep backfills running during night time).
>
> We have 4 OSD-servers with 4 osds each with 0.94.6.
> We have 3 OSD-servers with 2 osds each with 0.94.6.
>
>
> We experienced something that heavily affected our RGW-users.  Some requests interfacing with 0.94.6 nodes were slow.
>
> During a 10 minute window, our RGW-nodes ran out of available workers and ceased to respond.
>
> Some nodes logged some lines like these (only 0.94.6 nodes):
>
> 2016-06-21 09:51:08.053886 7f54610d8700  0 log_channel(cluster) log [WRN] : 2 slow requests, 1 included below; oldest blocked for > 74.368036 secs
> 2016-06-21 09:51:08.053951 7f54610d8700  0 log_channel(cluster) log [WRN] : slow request 30.056333 seconds old, received at 2016-06-21 09:50:37.997327: osd_op(client.9433496.0:1089298249 somergwuser.buckets [call user.set_buckets_info] 12.da8df901 ondisk+write+known_if_redirected e9906) currently waiting for rw locks
>
>
> Some nodes logged some lines like these (there were some, but not 100% overlap between osds that logged these and the beforementioned lines - only 0.94.6 nodes):
>
> 2016-06-21 09:51:48.677474 7f8cb6628700  0 log_channel(cluster) log [WRN] : 2 slow requests, 1 included below; oldest blocked for > 42.033650 secs
> 2016-06-21 09:51:48.677565 7f8cb6628700  0 log_channel(cluster) log [WRN] : slow request 30.371173 seconds old, received at 2016-06-21 09:51:18.305770: osd_op(client.9525441.0:764274789 gc.1164 [call lock.lock] 7.7b4f1779 ondisk+write+known_if_redirected e9906) currently waiting for subops from 40,50
>
> All of the osds that logged these lines, were waiting for subops from osd.50
>
>
> Investigating what's going on this osd during that window:
>
> 2016-06-21 09:48:22.064630 7f1cbb41d700  0 log_channel(cluster) log [INF] : 5.b5 deep-scrub starts
> 2016-06-21 09:59:56.640012 7f1c90163700  0 -- 10.21.9.22:6800/2003521 >> 10.20.9.21:6805/7755 pipe(0x1e47a000 sd=298 :39448 s=2 pgs=23 cs=1 l=0 c=0x1033ba20).fault with nothing to send, going to standby
> 2016-06-21 09:59:56.997763 7f1c700f8700  0 -- 10.21.9.22:6808/3521 >> 10.21.9.12:0/1028533 pipe(0x1f30f000 sd=87 :6808 s=0 pgs=0 cs=0 l=1 c=0x743c840).accept replacing existing (lossy) channel (new one lossy=1)
> 2016-06-21 10:00:39.938700 7f1cd9828700  0 log_channel(cluster) log [WRN] : 33 slow requests, 33 included below; oldest blocked for > 727.862759 secs
> 2016-06-21 10:00:39.938708 7f1cd9828700  0 log_channel(cluster) log [WRN] : slow request 670.918857 seconds old, received at 2016-06-21 09:49:29.019653: osd_op(client.9403437.0:1209613500 TZ1A91MYDE1LO63AQCM3 [getxattrs,stat] 9.442585e6 ack+read+known_if_redirected e9906) currently no flag points reached
> 2016-06-21 10:00:39.938800 7f1cd9828700  0 log_channel(cluster) log [WRN] : slow request 689.815851 seconds old, received at 2016-06-21 09:49:10.122660: osd_op(client.9403437.0:1209611533 TZ1A91MYDE1LO63AQCM3 [getxattrs,stat] 9.442585e6 ack+read+known_if_redirected e9906) currently no flag points reached
> 2016-06-21 10:00:39.938807 7f1cd9828700  0 log_channel(cluster) log [WRN] : slow request 670.895353 seconds old, received at 2016-06-21 09:49:29.043158: osd_op(client.9403437.0:1209613505 prod.arkham [call version.read,getxattrs,stat] 2.4da23de6 ack+read+known_if_redirected e9906) currently no flag points reached
> 2016-06-21 10:00:39.938810 7f1cd9828700  0 log_channel(cluster) log [WRN] : slow request 688.612303 seconds old, received at 2016-06-21 09:49:11.326207: osd_op(client.20712623.0:137251515 TZ1A91MYDE1LO63AQCM3 [getxattrs,stat] 9.442585e6 ack+read+known_if_redirected e9906) currently no flag points reached
> 2016-06-21 10:00:39.938813 7f1cd9828700  0 log_channel(cluster) log [WRN] : slow request 658.605163 seconds old, received at 2016-06-21 09:49:41.333348: osd_op(client.20712623.0:137254412 TZ1A91MYDE1LO63AQCM3 [getxattrs,stat] 9.442585e6 ack+read+known_if_redirected e9906) currently no flag points reached
> 2016-06-21 10:00:39.960300 7f1cbb41d700  0 log_channel(cluster) log [INF] : 5.b5 deep-scrub ok
>
> Looking at the contents of 5.b5 (which is in our .rgw.buckets.index pool, if relevant) and it's almost empty (12KB of files on the disk) I find it unlikely for a scrubbing to take that long.  Which is why I suspect we've ran into a bug.
>
> With the information I've provided here, can anyone shed some light on what this may be, and if it's a bug that is not fixed in HEAD; What information would be useful to include in a bug report?
>
>
> Regards
> --
> Trygve Vea
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com