Re: radosgw: scrub causing slow requests in the md log

Dan van der Ster <dan@xxxxxxxxxxxxxx> · Thu, 22 Jun 2017 10:00:38 +0200

I'm now running the three relevant OSDs with that patch. (Recompiled,
replaced /usr/lib64/rados-classes/libcls_log.so with the new version,
then restarted the osds).

It's working quite well, trimming 10 entries at a time instead of
1000, and no more timeouts.

Do you think it would be worth decreasing this hardcoded value in ceph proper?

-- Dan

On Wed, Jun 21, 2017 at 3:51 PM, Casey Bodley <cbodley@xxxxxxxxxx> wrote:
> That patch looks reasonable. You could also try raising the values of
> osd_op_thread_suicide_timeout and filestore_op_thread_suicide_timeout on
> that osd in order to trim more at a time.
>
>
> On 06/21/2017 09:27 AM, Dan van der Ster wrote:
>>
>> Hi Casey,
>>
>> I managed to trim up all shards except for that big #54. The others
>> all trimmed within a few seconds.
>>
>> But 54 is proving difficult. It's still going after several days, and
>> now I see that the 1000-key trim is indeed causing osd timeouts. I've
>> manually compacted the relevant osd leveldbs, but haven't found any
>> way to speed up the trimming. It's now going at ~1-2Hz, so 1000 trims
>> per op locks things up for quite awhile.
>>
>> I'm thinking of running those ceph-osd's with this patch:
>>
>> # git diff
>> diff --git a/src/cls/log/cls_log.cc b/src/cls/log/cls_log.cc
>> index 89745bb..7dcd933 100644
>> --- a/src/cls/log/cls_log.cc
>> +++ b/src/cls/log/cls_log.cc
>> @@ -254,7 +254,7 @@ static int cls_log_trim(cls_method_context_t hctx,
>> bufferlist *in, bufferlist *o
>>       to_index = op.to_marker;
>>     }
>>
>> -#define MAX_TRIM_ENTRIES 1000
>> +#define MAX_TRIM_ENTRIES 10
>>     size_t max_entries = MAX_TRIM_ENTRIES;
>>
>>     int rc = cls_cxx_map_get_vals(hctx, from_index, log_index_prefix,
>> max_entries, &keys);
>>
>>
>> What do you think?
>>
>> -- Dan
>>
>>
>>
>>
>> On Mon, Jun 19, 2017 at 5:32 PM, Casey Bodley <cbodley@xxxxxxxxxx> wrote:
>>>
>>> Hi Dan,
>>>
>>> That's good news that it can remove 1000 keys at a time without hitting
>>> timeouts. The output of 'du' will depend on when the leveldb compaction
>>> runs. If you do find that compaction leads to suicide timeouts on this
>>> osd
>>> (you would see a lot of 'leveldb:' output in the log), consider running
>>> offline compaction by adding 'leveldb compact on mount = true' to the osd
>>> config and restarting.
>>>
>>> Casey
>>>
>>>
>>> On 06/19/2017 11:01 AM, Dan van der Ster wrote:
>>>>
>>>> On Thu, Jun 15, 2017 at 7:56 PM, Casey Bodley <cbodley@xxxxxxxxxx>
>>>> wrote:
>>>>>
>>>>> On 06/14/2017 05:59 AM, Dan van der Ster wrote:
>>>>>>
>>>>>> Dear ceph users,
>>>>>>
>>>>>> Today we had O(100) slow requests which were caused by deep-scrubbing
>>>>>> of the metadata log:
>>>>>>
>>>>>> 2017-06-14 11:07:55.373184 osd.155
>>>>>> [2001:1458:301:24::100:d]:6837/3817268 7387 : cluster [INF] 24.1d
>>>>>> deep-scrub starts
>>>>>> ...
>>>>>> 2017-06-14 11:22:04.143903 osd.155
>>>>>> [2001:1458:301:24::100:d]:6837/3817268 8276 : cluster [WRN] slow
>>>>>> request 480.140904 seconds old, received at 2017-06-14
>>>>>> 11:14:04.002913: osd_op(client.3192010.0:11872455 24.be8b305d
>>>>>> meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54 [call log.add] snapc
>>>>>> 0=[] ondisk+write+known_if_redirected e7752) currently waiting for
>>>>>> scrub
>>>>>> ...
>>>>>> 2017-06-14 11:22:06.729306 osd.155
>>>>>> [2001:1458:301:24::100:d]:6837/3817268 8277 : cluster [INF] 24.1d
>>>>>> deep-scrub ok
>>>>>>
>>>>>> We have log_meta: true, log_data: false on this (our only) region [1],
>>>>>> which IIRC we setup to enable indexless buckets.
>>>>>>
>>>>>> I'm obviously unfamiliar with rgw meta and data logging, and have a
>>>>>> few questions:
>>>>>>
>>>>>>     1. AFAIU, it is used by the rgw multisite feature. Is it safe to
>>>>>> turn
>>>>>> it off when not using multisite?
>>>>>
>>>>>
>>>>> It's a good idea to turn that off, yes.
>>>>>
>>>>> First, make sure that you have configured a default
>>>>> realm/zonegroup/zone:
>>>>>
>>>>> $ radosgw-admin realm default --rgw-realm <realm name>  (you can
>>>>> determine
>>>>> realm name from 'radosgw-admin realm list')
>>>>> $ radosgw-admin zonegroup default --rgw-zonegroup default
>>>>> $ radosgw-admin zone default --rgw-zone default
>>>>>
>>>> Thanks. This had already been done, as confirmed with radosgw-admin
>>>> realm get-default.
>>>>
>>>>> Then you can modify the zonegroup (aka region):
>>>>>
>>>>> $ radosgw-admin zonegroup get > zonegroup.json
>>>>> $ sed -i 's/log_meta": "true/log_meta":"false/' zonegroup.json
>>>>> $ radosgw-admin zonegroup set < zonegroup.json
>>>>>
>>>>> Then commit the updated period configuration:
>>>>>
>>>>> $ radosgw-admin period update --commit
>>>>>
>>>>> Verify that the resulting period contains "log_meta": "false". Take
>>>>> care
>>>>> with future radosgw-admin commands on the zone/zonegroup, as they may
>>>>> revert
>>>>> log_meta back to true [1].
>>>>>
>>>> Great, this worked. FYI (and for others trying this in future), the
>>>> period update --commit blocks all rgws for ~30s while they reload the
>>>> realm.
>>>>
>>>>>>     2. I started dumping the output of radosgw-admin mdlog list, and
>>>>>> cancelled it after a few minutes. It had already dumped 3GB of json
>>>>>> and I don't know how much more it would have written. Is something
>>>>>> supposed to be trimming the mdlog automatically?
>>>>>
>>>>>
>>>>> There is automated mdlog trimming logic in master, but not
>>>>> jewel/kraken.
>>>>> And
>>>>> this logic won't be triggered if there is only one zone [2].
>>>>>
>>>>>>     3. ceph df doesn't show the space occupied by omap objects -- is
>>>>>> there an indirect way to see how much space these are using?
>>>>>
>>>>>
>>>>> You can inspect the osd's omap directory: du -sh
>>>>> /var/lib/ceph/osd/osd0/current/omap
>>>>>
>>>> Cool. osd.155 (holding shard 54) has 3.3GB of omap, compared with
>>>> ~100-300MB on other OSDs.
>>>>
>>>>>>     4. mdlog status has markers going back to 2016-10, see [2]. I
>>>>>> suppose
>>>>>> we're not using this feature correctly? :-/
>>>>>>
>>>>>>     5. Suppose I were to set log_meta: false -- how would I delete
>>>>>> these
>>>>>> log entries now that they are not needed?
>>>>>
>>>>>
>>>>> There is a 'radosgw-admin mdlog trim' command that can be used to trim
>>>>> them
>>>>> one --shard-id (from 0 to 63) at a time. An entire log shard can be
>>>>> trimmed
>>>>> with:
>>>>>
>>>>> $ radosgw-admin mdlog trim --shard-id 0 --period
>>>>> 8d4fcb63-c314-4f9a-b3b3-0e61719ec258 --end-time 2020-1-1
>>>>>
>>>>> *However*, there is a risk that bulk operations on large omaps will
>>>>> affect
>>>>> cluster health by taking down OSDs. Not only can this bulk deletion
>>>>> take
>>>>> long enough to trigger the osd/filestore suicide timeouts, the
>>>>> resulting
>>>>> leveldb compaction after deletion is likely to block other omap
>>>>> operations
>>>>> and hit the timeouts as well. This seems likely in your case, based on
>>>>> the
>>>>> fact that you're already having issues with scrub.
>>>>
>>>> We did this directly on shard 54, and indeed the command is taking a
>>>> looong time (but with no slow requests or osds being marked down).
>>>> After 45 minutes, du is still 3.3GB, so I can't tell if it's
>>>> progressing. I see ~1000 _omap_rmkeys messages every ~2 seconds:
>>>>
>>>> 2017-06-19 16:57:34.347222 7fc602640700 15
>>>> filestore(/var/lib/ceph/osd/ceph-155) _omap_rmkeys
>>>> 24.1d_head/#24:ba0cd17d:::met
>>>> a.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:head#
>>>> 2017-06-19 16:57:34.347319 7fc602640700 10 filestore oid:
>>>> #24:ba0cd17d:::meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:h
>>>> ead# not skipping op, *spos 67765185.0.0
>>>> 2017-06-19 16:57:34.347326 7fc602640700 10 filestore  > header.spos
>>>> 0.0.0
>>>> 2017-06-19 16:57:34.347351 7fc602640700 15
>>>> filestore(/var/lib/ceph/osd/ceph-155) _omap_rmkeys
>>>> 24.1d_head/#24:ba0cd17d:::met
>>>> a.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:head#
>>>> 2017-06-19 16:57:34.347373 7fc602640700 10 filestore oid:
>>>> #24:ba0cd17d:::meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:h
>>>> ead# not skipping op, *spos 67765185.0.1
>>>> 2017-06-19 16:57:34.347379 7fc602640700 10 filestore  > header.spos
>>>> 0.0.0
>>>> ...
>>>>
>>>> Does that look correct?
>>>>
>>>> Thanks for all the help!
>>>>
>>>> -- Dan
>>>
>>>
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@xxxxxxxxxxxxxx
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com