Re: radosgw: scrub causing slow requests in the md log

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Mon, 19 Jun 2017 17:01:05 +0200

On Thu, Jun 15, 2017 at 7:56 PM, Casey Bodley <cbodley@xxxxxxxxxx> wrote:
>
> On 06/14/2017 05:59 AM, Dan van der Ster wrote:
>>
>> Dear ceph users,
>>
>> Today we had O(100) slow requests which were caused by deep-scrubbing
>> of the metadata log:
>>
>> 2017-06-14 11:07:55.373184 osd.155
>> [2001:1458:301:24::100:d]:6837/3817268 7387 : cluster [INF] 24.1d
>> deep-scrub starts
>> ...
>> 2017-06-14 11:22:04.143903 osd.155
>> [2001:1458:301:24::100:d]:6837/3817268 8276 : cluster [WRN] slow
>> request 480.140904 seconds old, received at 2017-06-14
>> 11:14:04.002913: osd_op(client.3192010.0:11872455 24.be8b305d
>> meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54 [call log.add] snapc
>> 0=[] ondisk+write+known_if_redirected e7752) currently waiting for
>> scrub
>> ...
>> 2017-06-14 11:22:06.729306 osd.155
>> [2001:1458:301:24::100:d]:6837/3817268 8277 : cluster [INF] 24.1d
>> deep-scrub ok
>>
>> We have log_meta: true, log_data: false on this (our only) region [1],
>> which IIRC we setup to enable indexless buckets.
>>
>> I'm obviously unfamiliar with rgw meta and data logging, and have a
>> few questions:
>>
>>   1. AFAIU, it is used by the rgw multisite feature. Is it safe to turn
>> it off when not using multisite?
>
>
> It's a good idea to turn that off, yes.
>
> First, make sure that you have configured a default realm/zonegroup/zone:
>
> $ radosgw-admin realm default --rgw-realm <realm name>  (you can determine
> realm name from 'radosgw-admin realm list')
> $ radosgw-admin zonegroup default --rgw-zonegroup default
> $ radosgw-admin zone default --rgw-zone default
>

Thanks. This had already been done, as confirmed with radosgw-admin
realm get-default.

> Then you can modify the zonegroup (aka region):
>
> $ radosgw-admin zonegroup get > zonegroup.json
> $ sed -i 's/log_meta": "true/log_meta":"false/' zonegroup.json
> $ radosgw-admin zonegroup set < zonegroup.json
>
> Then commit the updated period configuration:
>
> $ radosgw-admin period update --commit
>
> Verify that the resulting period contains "log_meta": "false". Take care
> with future radosgw-admin commands on the zone/zonegroup, as they may revert
> log_meta back to true [1].
>

Great, this worked. FYI (and for others trying this in future), the
period update --commit blocks all rgws for ~30s while they reload the
realm.

>>
>>   2. I started dumping the output of radosgw-admin mdlog list, and
>> cancelled it after a few minutes. It had already dumped 3GB of json
>> and I don't know how much more it would have written. Is something
>> supposed to be trimming the mdlog automatically?
>
>
> There is automated mdlog trimming logic in master, but not jewel/kraken. And
> this logic won't be triggered if there is only one zone [2].
>
>>
>>   3. ceph df doesn't show the space occupied by omap objects -- is
>> there an indirect way to see how much space these are using?
>
>
> You can inspect the osd's omap directory: du -sh
> /var/lib/ceph/osd/osd0/current/omap
>

Cool. osd.155 (holding shard 54) has 3.3GB of omap, compared with
~100-300MB on other OSDs.

>>   4. mdlog status has markers going back to 2016-10, see [2]. I suppose
>> we're not using this feature correctly? :-/
>>
>>   5. Suppose I were to set log_meta: false -- how would I delete these
>> log entries now that they are not needed?
>
>
> There is a 'radosgw-admin mdlog trim' command that can be used to trim them
> one --shard-id (from 0 to 63) at a time. An entire log shard can be trimmed
> with:
>
> $ radosgw-admin mdlog trim --shard-id 0 --period
> 8d4fcb63-c314-4f9a-b3b3-0e61719ec258 --end-time 2020-1-1
>
> *However*, there is a risk that bulk operations on large omaps will affect
> cluster health by taking down OSDs. Not only can this bulk deletion take
> long enough to trigger the osd/filestore suicide timeouts, the resulting
> leveldb compaction after deletion is likely to block other omap operations
> and hit the timeouts as well. This seems likely in your case, based on the
> fact that you're already having issues with scrub.

We did this directly on shard 54, and indeed the command is taking a
looong time (but with no slow requests or osds being marked down).
After 45 minutes, du is still 3.3GB, so I can't tell if it's
progressing. I see ~1000 _omap_rmkeys messages every ~2 seconds:

2017-06-19 16:57:34.347222 7fc602640700 15
filestore(/var/lib/ceph/osd/ceph-155) _omap_rmkeys
24.1d_head/#24:ba0cd17d:::met
a.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:head#
2017-06-19 16:57:34.347319 7fc602640700 10 filestore oid:
#24:ba0cd17d:::meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:h
ead# not skipping op, *spos 67765185.0.0
2017-06-19 16:57:34.347326 7fc602640700 10 filestore  > header.spos 0.0.0
2017-06-19 16:57:34.347351 7fc602640700 15
filestore(/var/lib/ceph/osd/ceph-155) _omap_rmkeys
24.1d_head/#24:ba0cd17d:::met
a.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:head#
2017-06-19 16:57:34.347373 7fc602640700 10 filestore oid:
#24:ba0cd17d:::meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:h
ead# not skipping op, *spos 67765185.0.1
2017-06-19 16:57:34.347379 7fc602640700 10 filestore  > header.spos 0.0.0
...

Does that look correct?

Thanks for all the help!

-- Dan
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com