Hi Dan,
That's good news that it can remove 1000 keys at a time without hitting
timeouts. The output of 'du' will depend on when the leveldb compaction
runs. If you do find that compaction leads to suicide timeouts on this
osd (you would see a lot of 'leveldb:' output in the log), consider
running offline compaction by adding 'leveldb compact on mount = true'
to the osd config and restarting.
Casey
On 06/19/2017 11:01 AM, Dan van der Ster wrote:
On Thu, Jun 15, 2017 at 7:56 PM, Casey Bodley <cbodley@xxxxxxxxxx> wrote:
On 06/14/2017 05:59 AM, Dan van der Ster wrote:
Dear ceph users,
Today we had O(100) slow requests which were caused by deep-scrubbing
of the metadata log:
2017-06-14 11:07:55.373184 osd.155
[2001:1458:301:24::100:d]:6837/3817268 7387 : cluster [INF] 24.1d
deep-scrub starts
...
2017-06-14 11:22:04.143903 osd.155
[2001:1458:301:24::100:d]:6837/3817268 8276 : cluster [WRN] slow
request 480.140904 seconds old, received at 2017-06-14
11:14:04.002913: osd_op(client.3192010.0:11872455 24.be8b305d
meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54 [call log.add] snapc
0=[] ondisk+write+known_if_redirected e7752) currently waiting for
scrub
...
2017-06-14 11:22:06.729306 osd.155
[2001:1458:301:24::100:d]:6837/3817268 8277 : cluster [INF] 24.1d
deep-scrub ok
We have log_meta: true, log_data: false on this (our only) region [1],
which IIRC we setup to enable indexless buckets.
I'm obviously unfamiliar with rgw meta and data logging, and have a
few questions:
1. AFAIU, it is used by the rgw multisite feature. Is it safe to turn
it off when not using multisite?
It's a good idea to turn that off, yes.
First, make sure that you have configured a default realm/zonegroup/zone:
$ radosgw-admin realm default --rgw-realm <realm name> (you can determine
realm name from 'radosgw-admin realm list')
$ radosgw-admin zonegroup default --rgw-zonegroup default
$ radosgw-admin zone default --rgw-zone default
Thanks. This had already been done, as confirmed with radosgw-admin
realm get-default.
Then you can modify the zonegroup (aka region):
$ radosgw-admin zonegroup get > zonegroup.json
$ sed -i 's/log_meta": "true/log_meta":"false/' zonegroup.json
$ radosgw-admin zonegroup set < zonegroup.json
Then commit the updated period configuration:
$ radosgw-admin period update --commit
Verify that the resulting period contains "log_meta": "false". Take care
with future radosgw-admin commands on the zone/zonegroup, as they may revert
log_meta back to true [1].
Great, this worked. FYI (and for others trying this in future), the
period update --commit blocks all rgws for ~30s while they reload the
realm.
2. I started dumping the output of radosgw-admin mdlog list, and
cancelled it after a few minutes. It had already dumped 3GB of json
and I don't know how much more it would have written. Is something
supposed to be trimming the mdlog automatically?
There is automated mdlog trimming logic in master, but not jewel/kraken. And
this logic won't be triggered if there is only one zone [2].
3. ceph df doesn't show the space occupied by omap objects -- is
there an indirect way to see how much space these are using?
You can inspect the osd's omap directory: du -sh
/var/lib/ceph/osd/osd0/current/omap
Cool. osd.155 (holding shard 54) has 3.3GB of omap, compared with
~100-300MB on other OSDs.
4. mdlog status has markers going back to 2016-10, see [2]. I suppose
we're not using this feature correctly? :-/
5. Suppose I were to set log_meta: false -- how would I delete these
log entries now that they are not needed?
There is a 'radosgw-admin mdlog trim' command that can be used to trim them
one --shard-id (from 0 to 63) at a time. An entire log shard can be trimmed
with:
$ radosgw-admin mdlog trim --shard-id 0 --period
8d4fcb63-c314-4f9a-b3b3-0e61719ec258 --end-time 2020-1-1
*However*, there is a risk that bulk operations on large omaps will affect
cluster health by taking down OSDs. Not only can this bulk deletion take
long enough to trigger the osd/filestore suicide timeouts, the resulting
leveldb compaction after deletion is likely to block other omap operations
and hit the timeouts as well. This seems likely in your case, based on the
fact that you're already having issues with scrub.
We did this directly on shard 54, and indeed the command is taking a
looong time (but with no slow requests or osds being marked down).
After 45 minutes, du is still 3.3GB, so I can't tell if it's
progressing. I see ~1000 _omap_rmkeys messages every ~2 seconds:
2017-06-19 16:57:34.347222 7fc602640700 15
filestore(/var/lib/ceph/osd/ceph-155) _omap_rmkeys
24.1d_head/#24:ba0cd17d:::met
a.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:head#
2017-06-19 16:57:34.347319 7fc602640700 10 filestore oid:
#24:ba0cd17d:::meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:h
ead# not skipping op, *spos 67765185.0.0
2017-06-19 16:57:34.347326 7fc602640700 10 filestore > header.spos 0.0.0
2017-06-19 16:57:34.347351 7fc602640700 15
filestore(/var/lib/ceph/osd/ceph-155) _omap_rmkeys
24.1d_head/#24:ba0cd17d:::met
a.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:head#
2017-06-19 16:57:34.347373 7fc602640700 10 filestore oid:
#24:ba0cd17d:::meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:h
ead# not skipping op, *spos 67765185.0.1
2017-06-19 16:57:34.347379 7fc602640700 10 filestore > header.spos 0.0.0
...
Does that look correct?
Thanks for all the help!
-- Dan
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com