I've increased the deep_scrub interval on the OSDs on our Nautilus cluster with the following added to the [osd] section:
osd_deep_scrub_interval = 2600000
And I started seeing
1518 pgs not deep-scrubbed in time
in ceph -s. So I added
mon_warn_pg_not_deep_scrubbed_ratio = 1
since the default would start warning with a whole week left to scrub. But the messages persist. The cluster has been running for a month with these settings. Here is an example of the output. As you can see, some of these are not even two weeks old, no where close to the 75% of 4 weeks.
pg 6.1f49 not deep-scrubbed since 2019-11-09 23:04:55.370373
pg 6.1f47 not deep-scrubbed since 2019-11-18 16:10:52.561204
pg 6.1f44 not deep-scrubbed since 2019-11-18 15:48:16.825569
pg 6.1f36 not deep-scrubbed since 2019-11-20 05:39:00.309340
pg 6.1f31 not deep-scrubbed since 2019-11-27 02:48:45.347680
pg 6.1f30 not deep-scrubbed since 2019-11-11 21:34:15.795622
pg 6.1f2d not deep-scrubbed since 2019-11-24 11:37:39.502829
pg 6.1f27 not deep-scrubbed since 2019-11-25 07:38:58.689315
pg 6.1f25 not deep-scrubbed since 2019-11-20 00:13:43.048569
pg 6.1f1a not deep-scrubbed since 2019-11-09 15:08:43.516666
pg 6.1f19 not deep-scrubbed since 2019-11-25 10:24:47.884332
1468 more pgs...
Mon Dec 9 08:12:01 PST 2019
pg 6.1f47 not deep-scrubbed since 2019-11-18 16:10:52.561204
pg 6.1f44 not deep-scrubbed since 2019-11-18 15:48:16.825569
pg 6.1f36 not deep-scrubbed since 2019-11-20 05:39:00.309340
pg 6.1f31 not deep-scrubbed since 2019-11-27 02:48:45.347680
pg 6.1f30 not deep-scrubbed since 2019-11-11 21:34:15.795622
pg 6.1f2d not deep-scrubbed since 2019-11-24 11:37:39.502829
pg 6.1f27 not deep-scrubbed since 2019-11-25 07:38:58.689315
pg 6.1f25 not deep-scrubbed since 2019-11-20 00:13:43.048569
pg 6.1f1a not deep-scrubbed since 2019-11-09 15:08:43.516666
pg 6.1f19 not deep-scrubbed since 2019-11-25 10:24:47.884332
1468 more pgs...
Mon Dec 9 08:12:01 PST 2019
There is very little data on the cluster, so it's not a problem of deep-scrubs taking too long:
$ ceph df
RAW STORAGE:
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 6.3 PiB 6.1 PiB 153 TiB 154 TiB 2.39
nvme 5.8 TiB 5.6 TiB 138 GiB 197 GiB 3.33
TOTAL 6.3 PiB 6.2 PiB 154 TiB 154 TiB 2.39
POOLS:
POOL ID STORED OBJECTS USED %USED MAX AVAIL
.rgw.root 1 3.0 KiB 7 3.0 KiB 0 1.8 PiB
default.rgw.control 2 0 B 8 0 B 0 1.8 PiB
default.rgw.meta 3 7.4 KiB 24 7.4 KiB 0 1.8 PiB
default.rgw.log 4 11 GiB 341 11 GiB 0 1.8 PiB
default.rgw.buckets.data 6 100 TiB 41.84M 100 TiB 1.82 4.2 PiB
default.rgw.buckets.index 7 33 GiB 574 33 GiB 0 1.8 PiB
default.rgw.buckets.non-ec 8 8.1 MiB 22 8.1 MiB 0 1.8 PiB
RAW STORAGE:
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 6.3 PiB 6.1 PiB 153 TiB 154 TiB 2.39
nvme 5.8 TiB 5.6 TiB 138 GiB 197 GiB 3.33
TOTAL 6.3 PiB 6.2 PiB 154 TiB 154 TiB 2.39
POOLS:
POOL ID STORED OBJECTS USED %USED MAX AVAIL
.rgw.root 1 3.0 KiB 7 3.0 KiB 0 1.8 PiB
default.rgw.control 2 0 B 8 0 B 0 1.8 PiB
default.rgw.meta 3 7.4 KiB 24 7.4 KiB 0 1.8 PiB
default.rgw.log 4 11 GiB 341 11 GiB 0 1.8 PiB
default.rgw.buckets.data 6 100 TiB 41.84M 100 TiB 1.82 4.2 PiB
default.rgw.buckets.index 7 33 GiB 574 33 GiB 0 1.8 PiB
default.rgw.buckets.non-ec 8 8.1 MiB 22 8.1 MiB 0 1.8 PiB
Please help me figure out what I'm doing wrong with these settings.
Thanks,
Robert LeBlanc
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com