I'm now running the three relevant OSDs with that patch. (Recompiled, replaced /usr/lib64/rados-classes/libcls_log.so with the new version, then restarted the osds). It's working quite well, trimming 10 entries at a time instead of 1000, and no more timeouts. Do you think it would be worth decreasing this hardcoded value in ceph proper? -- Dan On Wed, Jun 21, 2017 at 3:51 PM, Casey Bodley <cbodley@xxxxxxxxxx> wrote: > That patch looks reasonable. You could also try raising the values of > osd_op_thread_suicide_timeout and filestore_op_thread_suicide_timeout on > that osd in order to trim more at a time. > > > On 06/21/2017 09:27 AM, Dan van der Ster wrote: >> >> Hi Casey, >> >> I managed to trim up all shards except for that big #54. The others >> all trimmed within a few seconds. >> >> But 54 is proving difficult. It's still going after several days, and >> now I see that the 1000-key trim is indeed causing osd timeouts. I've >> manually compacted the relevant osd leveldbs, but haven't found any >> way to speed up the trimming. It's now going at ~1-2Hz, so 1000 trims >> per op locks things up for quite awhile. >> >> I'm thinking of running those ceph-osd's with this patch: >> >> # git diff >> diff --git a/src/cls/log/cls_log.cc b/src/cls/log/cls_log.cc >> index 89745bb..7dcd933 100644 >> --- a/src/cls/log/cls_log.cc >> +++ b/src/cls/log/cls_log.cc >> @@ -254,7 +254,7 @@ static int cls_log_trim(cls_method_context_t hctx, >> bufferlist *in, bufferlist *o >> to_index = op.to_marker; >> } >> >> -#define MAX_TRIM_ENTRIES 1000 >> +#define MAX_TRIM_ENTRIES 10 >> size_t max_entries = MAX_TRIM_ENTRIES; >> >> int rc = cls_cxx_map_get_vals(hctx, from_index, log_index_prefix, >> max_entries, &keys); >> >> >> What do you think? >> >> -- Dan >> >> >> >> >> On Mon, Jun 19, 2017 at 5:32 PM, Casey Bodley <cbodley@xxxxxxxxxx> wrote: >>> >>> Hi Dan, >>> >>> That's good news that it can remove 1000 keys at a time without hitting >>> timeouts. The output of 'du' will depend on when the leveldb compaction >>> runs. If you do find that compaction leads to suicide timeouts on this >>> osd >>> (you would see a lot of 'leveldb:' output in the log), consider running >>> offline compaction by adding 'leveldb compact on mount = true' to the osd >>> config and restarting. >>> >>> Casey >>> >>> >>> On 06/19/2017 11:01 AM, Dan van der Ster wrote: >>>> >>>> On Thu, Jun 15, 2017 at 7:56 PM, Casey Bodley <cbodley@xxxxxxxxxx> >>>> wrote: >>>>> >>>>> On 06/14/2017 05:59 AM, Dan van der Ster wrote: >>>>>> >>>>>> Dear ceph users, >>>>>> >>>>>> Today we had O(100) slow requests which were caused by deep-scrubbing >>>>>> of the metadata log: >>>>>> >>>>>> 2017-06-14 11:07:55.373184 osd.155 >>>>>> [2001:1458:301:24::100:d]:6837/3817268 7387 : cluster [INF] 24.1d >>>>>> deep-scrub starts >>>>>> ... >>>>>> 2017-06-14 11:22:04.143903 osd.155 >>>>>> [2001:1458:301:24::100:d]:6837/3817268 8276 : cluster [WRN] slow >>>>>> request 480.140904 seconds old, received at 2017-06-14 >>>>>> 11:14:04.002913: osd_op(client.3192010.0:11872455 24.be8b305d >>>>>> meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54 [call log.add] snapc >>>>>> 0=[] ondisk+write+known_if_redirected e7752) currently waiting for >>>>>> scrub >>>>>> ... >>>>>> 2017-06-14 11:22:06.729306 osd.155 >>>>>> [2001:1458:301:24::100:d]:6837/3817268 8277 : cluster [INF] 24.1d >>>>>> deep-scrub ok >>>>>> >>>>>> We have log_meta: true, log_data: false on this (our only) region [1], >>>>>> which IIRC we setup to enable indexless buckets. >>>>>> >>>>>> I'm obviously unfamiliar with rgw meta and data logging, and have a >>>>>> few questions: >>>>>> >>>>>> 1. AFAIU, it is used by the rgw multisite feature. Is it safe to >>>>>> turn >>>>>> it off when not using multisite? >>>>> >>>>> >>>>> It's a good idea to turn that off, yes. >>>>> >>>>> First, make sure that you have configured a default >>>>> realm/zonegroup/zone: >>>>> >>>>> $ radosgw-admin realm default --rgw-realm <realm name> (you can >>>>> determine >>>>> realm name from 'radosgw-admin realm list') >>>>> $ radosgw-admin zonegroup default --rgw-zonegroup default >>>>> $ radosgw-admin zone default --rgw-zone default >>>>> >>>> Thanks. This had already been done, as confirmed with radosgw-admin >>>> realm get-default. >>>> >>>>> Then you can modify the zonegroup (aka region): >>>>> >>>>> $ radosgw-admin zonegroup get > zonegroup.json >>>>> $ sed -i 's/log_meta": "true/log_meta":"false/' zonegroup.json >>>>> $ radosgw-admin zonegroup set < zonegroup.json >>>>> >>>>> Then commit the updated period configuration: >>>>> >>>>> $ radosgw-admin period update --commit >>>>> >>>>> Verify that the resulting period contains "log_meta": "false". Take >>>>> care >>>>> with future radosgw-admin commands on the zone/zonegroup, as they may >>>>> revert >>>>> log_meta back to true [1]. >>>>> >>>> Great, this worked. FYI (and for others trying this in future), the >>>> period update --commit blocks all rgws for ~30s while they reload the >>>> realm. >>>> >>>>>> 2. I started dumping the output of radosgw-admin mdlog list, and >>>>>> cancelled it after a few minutes. It had already dumped 3GB of json >>>>>> and I don't know how much more it would have written. Is something >>>>>> supposed to be trimming the mdlog automatically? >>>>> >>>>> >>>>> There is automated mdlog trimming logic in master, but not >>>>> jewel/kraken. >>>>> And >>>>> this logic won't be triggered if there is only one zone [2]. >>>>> >>>>>> 3. ceph df doesn't show the space occupied by omap objects -- is >>>>>> there an indirect way to see how much space these are using? >>>>> >>>>> >>>>> You can inspect the osd's omap directory: du -sh >>>>> /var/lib/ceph/osd/osd0/current/omap >>>>> >>>> Cool. osd.155 (holding shard 54) has 3.3GB of omap, compared with >>>> ~100-300MB on other OSDs. >>>> >>>>>> 4. mdlog status has markers going back to 2016-10, see [2]. I >>>>>> suppose >>>>>> we're not using this feature correctly? :-/ >>>>>> >>>>>> 5. Suppose I were to set log_meta: false -- how would I delete >>>>>> these >>>>>> log entries now that they are not needed? >>>>> >>>>> >>>>> There is a 'radosgw-admin mdlog trim' command that can be used to trim >>>>> them >>>>> one --shard-id (from 0 to 63) at a time. An entire log shard can be >>>>> trimmed >>>>> with: >>>>> >>>>> $ radosgw-admin mdlog trim --shard-id 0 --period >>>>> 8d4fcb63-c314-4f9a-b3b3-0e61719ec258 --end-time 2020-1-1 >>>>> >>>>> *However*, there is a risk that bulk operations on large omaps will >>>>> affect >>>>> cluster health by taking down OSDs. Not only can this bulk deletion >>>>> take >>>>> long enough to trigger the osd/filestore suicide timeouts, the >>>>> resulting >>>>> leveldb compaction after deletion is likely to block other omap >>>>> operations >>>>> and hit the timeouts as well. This seems likely in your case, based on >>>>> the >>>>> fact that you're already having issues with scrub. >>>> >>>> We did this directly on shard 54, and indeed the command is taking a >>>> looong time (but with no slow requests or osds being marked down). >>>> After 45 minutes, du is still 3.3GB, so I can't tell if it's >>>> progressing. I see ~1000 _omap_rmkeys messages every ~2 seconds: >>>> >>>> 2017-06-19 16:57:34.347222 7fc602640700 15 >>>> filestore(/var/lib/ceph/osd/ceph-155) _omap_rmkeys >>>> 24.1d_head/#24:ba0cd17d:::met >>>> a.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:head# >>>> 2017-06-19 16:57:34.347319 7fc602640700 10 filestore oid: >>>> #24:ba0cd17d:::meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:h >>>> ead# not skipping op, *spos 67765185.0.0 >>>> 2017-06-19 16:57:34.347326 7fc602640700 10 filestore > header.spos >>>> 0.0.0 >>>> 2017-06-19 16:57:34.347351 7fc602640700 15 >>>> filestore(/var/lib/ceph/osd/ceph-155) _omap_rmkeys >>>> 24.1d_head/#24:ba0cd17d:::met >>>> a.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:head# >>>> 2017-06-19 16:57:34.347373 7fc602640700 10 filestore oid: >>>> #24:ba0cd17d:::meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:h >>>> ead# not skipping op, *spos 67765185.0.1 >>>> 2017-06-19 16:57:34.347379 7fc602640700 10 filestore > header.spos >>>> 0.0.0 >>>> ... >>>> >>>> Does that look correct? >>>> >>>> Thanks for all the help! >>>> >>>> -- Dan >>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@xxxxxxxxxxxxxx >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com