Re: radosgw: scrub causing slow requests in the md log

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 




On 06/22/2017 04:00 AM, Dan van der Ster wrote:
I'm now running the three relevant OSDs with that patch. (Recompiled,
replaced /usr/lib64/rados-classes/libcls_log.so with the new version,
then restarted the osds).

It's working quite well, trimming 10 entries at a time instead of
1000, and no more timeouts.

Do you think it would be worth decreasing this hardcoded value in ceph proper?

-- Dan

I do, yeah. At least, the trim operation should be able to pass in its own value for that. I opened a ticket for that at http://tracker.ceph.com/issues/20382.

I'd also like to investigate using the ObjectStore's OP_OMAP_RMKEYRANGE operation to trim a range of keys in a single osd op, instead of generating a different op for each key. I have a PR that does this at https://github.com/ceph/ceph/pull/15183. But it's still hard to guarantee that leveldb can process the entire range inside of the suicide timeout.

Casey



On Wed, Jun 21, 2017 at 3:51 PM, Casey Bodley <cbodley@xxxxxxxxxx> wrote:
That patch looks reasonable. You could also try raising the values of
osd_op_thread_suicide_timeout and filestore_op_thread_suicide_timeout on
that osd in order to trim more at a time.


On 06/21/2017 09:27 AM, Dan van der Ster wrote:
Hi Casey,

I managed to trim up all shards except for that big #54. The others
all trimmed within a few seconds.

But 54 is proving difficult. It's still going after several days, and
now I see that the 1000-key trim is indeed causing osd timeouts. I've
manually compacted the relevant osd leveldbs, but haven't found any
way to speed up the trimming. It's now going at ~1-2Hz, so 1000 trims
per op locks things up for quite awhile.

I'm thinking of running those ceph-osd's with this patch:

# git diff
diff --git a/src/cls/log/cls_log.cc b/src/cls/log/cls_log.cc
index 89745bb..7dcd933 100644
--- a/src/cls/log/cls_log.cc
+++ b/src/cls/log/cls_log.cc
@@ -254,7 +254,7 @@ static int cls_log_trim(cls_method_context_t hctx,
bufferlist *in, bufferlist *o
       to_index = op.to_marker;
     }

-#define MAX_TRIM_ENTRIES 1000
+#define MAX_TRIM_ENTRIES 10
     size_t max_entries = MAX_TRIM_ENTRIES;

     int rc = cls_cxx_map_get_vals(hctx, from_index, log_index_prefix,
max_entries, &keys);


What do you think?

-- Dan




On Mon, Jun 19, 2017 at 5:32 PM, Casey Bodley <cbodley@xxxxxxxxxx> wrote:
Hi Dan,

That's good news that it can remove 1000 keys at a time without hitting
timeouts. The output of 'du' will depend on when the leveldb compaction
runs. If you do find that compaction leads to suicide timeouts on this
osd
(you would see a lot of 'leveldb:' output in the log), consider running
offline compaction by adding 'leveldb compact on mount = true' to the osd
config and restarting.

Casey


On 06/19/2017 11:01 AM, Dan van der Ster wrote:
On Thu, Jun 15, 2017 at 7:56 PM, Casey Bodley <cbodley@xxxxxxxxxx>
wrote:
On 06/14/2017 05:59 AM, Dan van der Ster wrote:
Dear ceph users,

Today we had O(100) slow requests which were caused by deep-scrubbing
of the metadata log:

2017-06-14 11:07:55.373184 osd.155
[2001:1458:301:24::100:d]:6837/3817268 7387 : cluster [INF] 24.1d
deep-scrub starts
...
2017-06-14 11:22:04.143903 osd.155
[2001:1458:301:24::100:d]:6837/3817268 8276 : cluster [WRN] slow
request 480.140904 seconds old, received at 2017-06-14
11:14:04.002913: osd_op(client.3192010.0:11872455 24.be8b305d
meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54 [call log.add] snapc
0=[] ondisk+write+known_if_redirected e7752) currently waiting for
scrub
...
2017-06-14 11:22:06.729306 osd.155
[2001:1458:301:24::100:d]:6837/3817268 8277 : cluster [INF] 24.1d
deep-scrub ok

We have log_meta: true, log_data: false on this (our only) region [1],
which IIRC we setup to enable indexless buckets.

I'm obviously unfamiliar with rgw meta and data logging, and have a
few questions:

     1. AFAIU, it is used by the rgw multisite feature. Is it safe to
turn
it off when not using multisite?

It's a good idea to turn that off, yes.

First, make sure that you have configured a default
realm/zonegroup/zone:

$ radosgw-admin realm default --rgw-realm <realm name>  (you can
determine
realm name from 'radosgw-admin realm list')
$ radosgw-admin zonegroup default --rgw-zonegroup default
$ radosgw-admin zone default --rgw-zone default

Thanks. This had already been done, as confirmed with radosgw-admin
realm get-default.

Then you can modify the zonegroup (aka region):

$ radosgw-admin zonegroup get > zonegroup.json
$ sed -i 's/log_meta": "true/log_meta":"false/' zonegroup.json
$ radosgw-admin zonegroup set < zonegroup.json

Then commit the updated period configuration:

$ radosgw-admin period update --commit

Verify that the resulting period contains "log_meta": "false". Take
care
with future radosgw-admin commands on the zone/zonegroup, as they may
revert
log_meta back to true [1].

Great, this worked. FYI (and for others trying this in future), the
period update --commit blocks all rgws for ~30s while they reload the
realm.

     2. I started dumping the output of radosgw-admin mdlog list, and
cancelled it after a few minutes. It had already dumped 3GB of json
and I don't know how much more it would have written. Is something
supposed to be trimming the mdlog automatically?

There is automated mdlog trimming logic in master, but not
jewel/kraken.
And
this logic won't be triggered if there is only one zone [2].

     3. ceph df doesn't show the space occupied by omap objects -- is
there an indirect way to see how much space these are using?

You can inspect the osd's omap directory: du -sh
/var/lib/ceph/osd/osd0/current/omap

Cool. osd.155 (holding shard 54) has 3.3GB of omap, compared with
~100-300MB on other OSDs.

     4. mdlog status has markers going back to 2016-10, see [2]. I
suppose
we're not using this feature correctly? :-/

     5. Suppose I were to set log_meta: false -- how would I delete
these
log entries now that they are not needed?

There is a 'radosgw-admin mdlog trim' command that can be used to trim
them
one --shard-id (from 0 to 63) at a time. An entire log shard can be
trimmed
with:

$ radosgw-admin mdlog trim --shard-id 0 --period
8d4fcb63-c314-4f9a-b3b3-0e61719ec258 --end-time 2020-1-1

*However*, there is a risk that bulk operations on large omaps will
affect
cluster health by taking down OSDs. Not only can this bulk deletion
take
long enough to trigger the osd/filestore suicide timeouts, the
resulting
leveldb compaction after deletion is likely to block other omap
operations
and hit the timeouts as well. This seems likely in your case, based on
the
fact that you're already having issues with scrub.
We did this directly on shard 54, and indeed the command is taking a
looong time (but with no slow requests or osds being marked down).
After 45 minutes, du is still 3.3GB, so I can't tell if it's
progressing. I see ~1000 _omap_rmkeys messages every ~2 seconds:

2017-06-19 16:57:34.347222 7fc602640700 15
filestore(/var/lib/ceph/osd/ceph-155) _omap_rmkeys
24.1d_head/#24:ba0cd17d:::met
a.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:head#
2017-06-19 16:57:34.347319 7fc602640700 10 filestore oid:
#24:ba0cd17d:::meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:h
ead# not skipping op, *spos 67765185.0.0
2017-06-19 16:57:34.347326 7fc602640700 10 filestore  > header.spos
0.0.0
2017-06-19 16:57:34.347351 7fc602640700 15
filestore(/var/lib/ceph/osd/ceph-155) _omap_rmkeys
24.1d_head/#24:ba0cd17d:::met
a.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:head#
2017-06-19 16:57:34.347373 7fc602640700 10 filestore oid:
#24:ba0cd17d:::meta.log.8d4fcb63-c314-4f9a-b3b3-0e61719ec258.54:h
ead# not skipping op, *spos 67765185.0.1
2017-06-19 16:57:34.347379 7fc602640700 10 filestore  > header.spos
0.0.0
...

Does that look correct?

Thanks for all the help!

-- Dan

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux