Hi all, a little gem for Christmas. After going through the OSD code, scratching my had and doing a bit of maths, I seem to have found a way to tune the built-in scrub machine to work perfectly. Its only few knobs to turn, but its difficult to find out, because the documentation is misleading to incorrect or even missing entirely. I plan to write a bit more in the documentation to this script (https://github.com/frans42/ceph-goodies/blob/5e2016f0b00f8dbc3e51c7e9904a7386b037fd82/scripts/pool-scrub-report), so here only the executive summary. - global, set osd_max_scrubs=1, higher values do not have any effect except making your users angry - global, set osd_deep_scrub_randomize_ratio=0, this parameter is unnecessary for distributing deep scrubs and its only effect in the current implementation is to trigger a large amount of too early deep-scrubs increasing the overall deep-scrub load significantly without a useful effect - on the pools, set deep_scrub_interval according to needs and performance, scrubs will turn into deep-scrubs for every PG with a deep-scrub stamp older than deep_scrub_interval, here you will need to do some calculations what your hardware can do and how much average load you can tolerate - on the pools, also set scrub_min_interval and scrub_max_interval so that its only one place to look for these settings - per OSD device class, set osd_scrub_interval_randomize_ratio such that scrubs start within a reasonable window after scrub_min_interval. This parameter is very important for distributing scrubs as evenly as possible over time. The default of 0.5 is good for most cases. For the HDD pool used below I reduced it a bit, because scrub_min_interval is set to 66h and here 0.5 leads to a slightly too large start-interval. - per OSD device class, set osd_scrub_backoff_ratio to a value close to but not higher than 1-1/(largest replication factor [=size] of pools on this device class). This parameter is labelled dev, but is really important for effective scrub scheduling. OSDs need to allocate scrub reservations and this process is extremely racy specifically for EC pools with high replication factor. The default 0.66 probably has 3-times replicated pools in mind, but it triggers way too many attempts to allocate scrub reservations for pools with larger replication factor, causing dead-locks and blocking scrubs being executed even if plenty of OSDs are idle. I found that 1-0.75/max_size_on_device_class works very well. After having found out about what the parameters really do and adjusting them for my pools, I passed through a valley of tears and arrived now at the beautiful distribution of (deep-)scrub stamps for a pool on 16TB HDDs shown at the end. Everything gets scrubbed every 3-4 days and deep-scrubs start no earlier than 14 days after the last deep-scrub. The overall (deep)-scrub load is now half of what it was before the changes and I don't have the dreaded "PGs not (deep-)scrubbed in time" warnings any more. I calculated the (deep-)scrub time window configs such that about 30% of OSDs will be continuously busy when the disks reach 70% utilization (currently ca. 45%). No user will complain about that and there is enough spare performance left to catch up after high-load- or recovery episodes without having to do anything. Here is the scrub report generated for the pool I was looking at for weeks now, its exactly as I wanted it and I don't have to run cron jobs, it just works: # pool-scrub-report con-fs2-data2 Scrub info for pool con-fs2-data2 (id=19): dumped pgs Scrub report: 6% 566 PGs not scrubbed since 1 intervals ( 6h) 13% 528 PGs not scrubbed since 2 intervals ( 12h) 21% 640 PGs not scrubbed since 3 intervals ( 18h) 29% 668 PGs not scrubbed since 4 intervals ( 24h) 37% 677 PGs not scrubbed since 5 intervals ( 30h) 46% 729 PGs not scrubbed since 6 intervals ( 36h) 54% 631 PGs not scrubbed since 7 intervals ( 42h) 62% 662 PGs not scrubbed since 8 intervals ( 48h) 70% 663 PGs not scrubbed since 9 intervals ( 54h) 78% 660 PGs not scrubbed since 10 intervals ( 60h) 85% 571 PGs not scrubbed since 11 intervals ( 66h) 92% 585 PGs not scrubbed since 12 intervals ( 72h) [74 idle] 1 scrubbing 96% 358 PGs not scrubbed since 13 intervals ( 78h) [34 idle] [3 scrubbing+deep] 2 scrubbing 99% 181 PGs not scrubbed since 14 intervals ( 84h) [23 idle] [3 scrubbing+deep] 99% 70 PGs not scrubbed since 15 intervals ( 90h) [9 idle] 1 scrubbing 100% 3 PGs not scrubbed since 16 intervals ( 96h) [1 scrubbing+deep] 8192 PGs out of 8192 reported, 0 missing, 4 scrubbing, 140 idle, 0 unclean. Deep-scrub report: 3% 295 PGs not deep-scrubbed since 1 intervals ( 24h) 9% 461 PGs not deep-scrubbed since 2 intervals ( 48h) 16% 558 PGs not deep-scrubbed since 3 intervals ( 72h) 23% 613 PGs not deep-scrubbed since 4 intervals ( 96h) [1 scrubbing] 31% 619 PGs not deep-scrubbed since 5 intervals (120h) 39% 660 PGs not deep-scrubbed since 6 intervals (144h) 47% 726 PGs not deep-scrubbed since 7 intervals (168h) [1 scrubbing] 57% 743 PGs not deep-scrubbed since 8 intervals (192h) 65% 727 PGs not deep-scrubbed since 9 intervals (216h) [1 scrubbing] 73% 656 PGs not deep-scrubbed since 10 intervals (240h) 75% 107 PGs not deep-scrubbed since 11 intervals (264h) 82% 626 PGs not deep-scrubbed since 12 intervals (288h) 90% 588 PGs not deep-scrubbed since 13 intervals (312h) 94% 388 PGs not deep-scrubbed since 14 intervals (336h) [1 scrubbing] 96% 129 PGs not deep-scrubbed since 15 intervals (360h) 2 scrubbing+deep 98% 207 PGs not deep-scrubbed since 16 intervals (384h) 2 scrubbing+deep 99% 79 PGs not deep-scrubbed since 17 intervals (408h) 1 scrubbing+deep 100% 10 PGs not deep-scrubbed since 18 intervals (432h) 2 scrubbing+deep 8192 PGs out of 8192 reported, 0 missing, 7 scrubbing+deep, 0 unclean. con-fs2-data2 scrub_min_interval=66h (11i/84%/625PGs÷i) con-fs2-data2 scrub_max_interval=168h (7d) con-fs2-data2 deep_scrub_interval=336h (14d/~89%/~520PGs÷d) osd.338 osd_scrub_interval_randomize_ratio=0.363636 scrubs start after: 66h..90h osd.338 osd_deep_scrub_randomize_ratio=0.000000 osd.338 osd_max_scrubs=1 osd.338 osd_scrub_backoff_ratio=0.931900 rec. this pool: .9319 (class hdd, size 11) mon.ceph-01 mon_warn_pg_not_scrubbed_ratio=0.500000 warn: 10.5d (42.0i) mon.ceph-01 mon_warn_pg_not_deep_scrubbed_ratio=0.750000 warn: 24.5d Best regards, merry Christmas and a happy new year to everyone! ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx