Re: Performance issues with deep-scrub since upgrading from v12.2.2 to v12.2.5

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 14 Jun 2018 13:45:55 -0400

Deep scrub needs to read every object in the pg. if some pgs are only taking 5 seconds they must be nearly empty (or maybe they only contain objects with small amounts of omap or something). Ten minutes is perfectly reasonable, but it is an added load on the cluster as it does all those object reads. Perhaps your configured scrub rates are using enough iops that you don’t have enough for your client workloads.
-Greg
On Thu, Jun 14, 2018 at 11:37 AM Sander van Schie / True <Sander.vanSchie@xxxxxxx> wrote:
Hello,

We recently upgraded Ceph from version 12.2.2 to version 12.2.5. Since the upgrade we've been having performance issues which seem to relate to when deep-scrub actions are performed.

Most of the time deep-scrub actions only takes a couple of seconds at most, however ocassionaly it takes 10 minutes. It's either a few seconds or 10 minutes, never a couple of minutes or anything else in between. This has been happening since the upgrade.

For example see the following:

2018-06-14 10:11:46.337086 7fbd3528b700  0 log_channel(cluster) log [DBG] : 15.2dc deep-scrub starts

2018-06-14 10:11:50.947843 7fbd3528b700  0 log_channel(cluster) log [DBG] : 15.2dc deep-scrub ok

2018-06-14 10:45:49.575042 7fbd32a86700  0 log_channel(cluster) log [DBG] : 14.1 deep-scrub starts

2018-06-14 10:55:53.326309 7fbd32a86700  0 log_channel(cluster) log [DBG] : 14.1 deep-scrub ok

2018-06-14 10:58:28.652360 7fbd33a88700  0 log_channel(cluster) log [DBG] : 15.5f deep-scrub starts

2018-06-14 10:58:33.411769 7fbd2fa80700  0 log_channel(cluster) log [DBG] : 15.5f deep-scrub ok

The scrub on PG 14.1 took pretty much exactly 10 minutes, the others only about 5 seconds. It matches the value of "osd scrub finalize thread timeout" which is currently set to 10 minutes, however I'm not sure if it's related or just a coincidence. It's not just this PG, but there's a bunch of them, also on difference nodes and OSD's.

PG dump for this problematic PG is as follows:

PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND BYTES      LOG  DISK_LOG STATE        STATE_STAMP                VERSION       REPORTED       UP         UP_PRIMARY ACTING     ACTING_PRIMARY LAST_SCRUB    SCRUB_STAMP                LAST_DEEP_SCRUB DEEP_SCRUB_STAMP           SNAPTRIMQ_LEN

14.1      10573                  0        0         0       0          0 1579     1579 active+clean 2018-06-14 15:47:32.832222  1215'1291261   1215:7062174   [3,8,20]          3   [3,8,20]              3  1179'1288697 2018-06-14 10:55:53.326320    1179'1288697 2018-06-14 10:55:53.326320             0

During the longer running deep-scrub actions we're also running into performance problems.

Any idea what's going wrong?

Thanks

Sander

_______________________________________________

ceph-users mailing list

ceph-users@xxxxxxxxxxxxxx

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com