Partially answering my own question. I think it is possible to tweak the existing parameters to achieve what I'm looking for on average. The main reason I want to use the internal scheduler is the high number of PGs on some pools, which I actually intend to increase even further. For such pools a simple calculation shows that it is impractical to do manual scrubbing with cron, I simply cannot execute cron jobs often enough to achieve a reasonable scrub distribution (looking at a script like https://gist.github.com/ethaniel/5db696d9c78516308b235b0cb904e4ad). Looking at scrub date stamp distributions for specific pools, PGs with old deep-scrub time stamps tend to correlate also with PGs with old scrub date stamps. The idea now id to tweak scrub_min_interval such that the scrub scheduler is forced to select PGs out of the 20-30% with the oldest scrub stamps. This should imply that, after a reasonable time interval, the age of the oldest deep-scrub stamps is reduced as long not deep-scrubbed PGs become much more likely to be scheduled for deep-scrub. This adjustment is done together with making osd_deep_scrub_randomize_ratio and osd_scrub_backoff_ratio more aggressive to cycle more frequently through the small list of PGs eligible for scrubbing. This is a bit like the reverse calculation for achieving the effect of a not implemented deep_scrub_min_interval. I made the following global changes (and hope the parameters do something like what their documentation says): global advanced osd_deep_scrub_randomize_ratio 0.330000 global dev osd_scrub_backoff_ratio 0.500000 With this setting, about 33% of all scrub should be deep-scrubs, meaning that on average after 3 scrub events a PG is also deep-scrubbed. This leads to this estimate for expected deep-scrub intervals: given scrub-interval is scrub_min_interval*[1, 1+osd_scrub_interval_randomize_ratio] = scrub_min_interval*[1, 1.5], the expected deep-scrub-interval is (assuming worst-case realisation of randomize_ratio for upper value): scrub_min_interval*[1, 1.5]*3 = scrub_min_interval*[3, 4.5] I can't calculate how the tail will look like, but I hope its not a fat tail. I will report back what I observe; see below. The idea now is to tune scrub_min_interval per pool such that only about 20-30% of PGs have a scrub-stamp older than scrub_min_interval. The scheduler will cycle only through these and a bit faster than default. As the stamp histograms included below indicate, the distribution is probably very sensitive to changes of this interval. I now changed these values on some pools and already see that PGs with much older deep-scrub stamps are now selected for deep-scrubbing. I will observe what these settings converge to and report back. It seems that it will lead to an improved stamp distribution and one only needs to issue manual deep-scrubs for very few PGs that are outliers of the random number generator (that's the tail I talked about above). My goal is to have a script schedule a deep-scrub on the outliers no more often than daily. The reports below have been pulled after changes to settings were applied for about 2-3h. There was already improvement in the right direction, but the original distribution issue is still very pronounced. Here two per-pool scrub stamp distributions for an SSD pool and an HDD pool, both with large number of PGs per OSD: === SSD pool: Scrub info for pool sr-rbd-data-one (id=2): dumped pgs Scrub report: 22% 941 PGs not scrubbed since 1 intervals ( 6h) 42% 795 PGs not scrubbed since 2 intervals ( 12h) 62% 829 PGs not scrubbed since 3 intervals ( 18h) 82% 823 PGs not scrubbed since 4 intervals ( 24h) 96% 576 PGs not scrubbed since 5 intervals ( 30h) 100% 132 PGs not scrubbed since 6 intervals ( 36h) 4096 PGs out of 4096 reported, 0 missing. Deep-scrub report: 13% 545 PGs not deep-scrubbed since 1 intervals ( 24h) 25% 508 PGs not deep-scrubbed since 2 intervals ( 48h) 1 scrubbing+deep 45% 797 PGs not deep-scrubbed since 3 intervals ( 72h) 1 scrubbing+deep 59% 587 PGs not deep-scrubbed since 4 intervals ( 96h) 70% 463 PGs not deep-scrubbed since 5 intervals (120h) 78% 312 PGs not deep-scrubbed since 6 intervals (144h) 84% 263 PGs not deep-scrubbed since 7 intervals (168h) 89% 173 PGs not deep-scrubbed since 8 intervals (192h) 92% 151 PGs not deep-scrubbed since 9 intervals (216h) 95% 106 PGs not deep-scrubbed since 10 intervals (240h) 96% 55 PGs not deep-scrubbed since 11 intervals (264h) 97% 50 PGs not deep-scrubbed since 12 intervals (288h) 98% 44 PGs not deep-scrubbed since 13 intervals (312h) 99% 24 PGs not deep-scrubbed since 14 intervals (336h) 100% 18 PGs not deep-scrubbed since 15 intervals (360h) 4096 PGs out of 4096 reported, 0 missing. PGs marked with a * are on busy OSDs and not eligible for scrubbing. sr-rbd-data-one scrub_min_interval=0h sr-rbd-data-one scrub_max_interval=0h sr-rbd-data-one deep_scrub_interval=0h === Here we see that after 24h 82% of PGs are scrubbed, but we have quite a tail of not deep-scrubbed PGs. Its long enough to trigger a warning with default parameters. In this case, reducing scrub_min_interval to a value around 18-20h could reduce the tail enough. The alternative is simply to schedule a deep-scrub on the oldest PGs manually (cron). This would start immediately since no OSDs are allocated to scrubbing/recovery. === HDD pool: Scrub info for pool con-fs2-data2 (id=19): dumped pgs Scrub report: 11% 939 PGs not scrubbed since 1 intervals ( 6h) 22% 936 PGs not scrubbed since 2 intervals ( 12h) 33% 874 PGs not scrubbed since 3 intervals ( 18h) 43% 821 PGs not scrubbed since 4 intervals ( 24h) 54% 931 PGs not scrubbed since 5 intervals ( 30h) 64% 766 PGs not scrubbed since 6 intervals ( 36h) 72% 646 PGs not scrubbed since 7 intervals ( 42h) 79% 559 PGs not scrubbed since 8 intervals ( 48h) 84% 411 PGs not scrubbed since 9 intervals ( 54h) 88% 346 PGs not scrubbed since 10 intervals ( 60h) 90% 222 PGs not scrubbed since 11 intervals ( 66h) 93% 213 PGs not scrubbed since 12 intervals ( 72h) 95% 160 PGs not scrubbed since 13 intervals ( 78h) 96% 87 PGs not scrubbed since 14 intervals ( 84h) 97% 77 PGs not scrubbed since 15 intervals ( 90h) 98% 57 PGs not scrubbed since 16 intervals ( 96h) 1 scrubbing 98% 42 PGs not scrubbed since 17 intervals (102h) 99% 32 PGs not scrubbed since 18 intervals (108h) 99% 19 PGs not scrubbed since 19 intervals (114h) 99% 19 PGs not scrubbed since 20 intervals (120h) 99% 10 PGs not scrubbed since 21 intervals (126h) 99% 1 PGs not scrubbed since 22 intervals (132h) 19.165f* 99% 5 PGs not scrubbed since 24 intervals (138h) 19.412* 19.75c* 19.140f* 19.134c* 19.fb7* 99% 5 PGs not scrubbed since 25 intervals (144h) 19.1714* 19.148d* 19.1fa9* 19.1f05* 19.1cda* 99% 1 PGs not scrubbed since 26 intervals (150h) 19.a3f* 99% 1 PGs not scrubbed since 27 intervals (156h) 19.a01* 99% 3 PGs not scrubbed since 28 intervals (162h) 19.12f2* 19.1284* 19.c90* 99% 1 PGs not scrubbed since 29 intervals (168h) 99% 1 PGs not scrubbed since 30 intervals (174h) 19.f13* 99% 2 PGs not scrubbed since 32 intervals (180h) 19.1f87* 19.67b* 99% 2 PGs not scrubbed since 36 intervals (186h) 19.133f* 19.1318* 99% 2 PGs not scrubbed since 40 intervals (192h) 19.12f4* 19.248* 100% 1 PGs not scrubbed since 43 intervals (198h) 19.1984* 8192 PGs out of 8192 reported, 0 missing. Deep-scrub report: 14% 1210 PGs not deep-scrubbed since 1 intervals ( 24h) 28% 1136 PGs not deep-scrubbed since 2 intervals ( 48h) 1 scrubbing+deep 40% 985 PGs not deep-scrubbed since 3 intervals ( 72h) 4 scrubbing+deep 51% 851 PGs not deep-scrubbed since 4 intervals ( 96h) 5 scrubbing+deep 59% 713 PGs not deep-scrubbed since 5 intervals (120h) 4 scrubbing+deep 63% 276 PGs not deep-scrubbed since 6 intervals (144h) 2 scrubbing+deep 70% 566 PGs not deep-scrubbed since 7 intervals (168h) 1 scrubbing+deep 76% 534 PGs not deep-scrubbed since 8 intervals (192h) 2 scrubbing+deep 82% 480 PGs not deep-scrubbed since 9 intervals (216h) 2 scrubbing+deep 87% 381 PGs not deep-scrubbed since 10 intervals (240h) 2 scrubbing+deep 90% 253 PGs not deep-scrubbed since 11 intervals (264h) 1 scrubbing+deep 92% 222 PGs not deep-scrubbed since 12 intervals (288h) 1 scrubbing+deep 94% 136 PGs not deep-scrubbed since 13 intervals (312h) 1 scrubbing+deep 96% 179 PGs not deep-scrubbed since 14 intervals (336h) 3 scrubbing+deep 98% 156 PGs not deep-scrubbed since 15 intervals (360h) 6 scrubbing+deep 99% 65 PGs not deep-scrubbed since 16 intervals (384h) 3 scrubbing+deep 99% 31 PGs not deep-scrubbed since 17 intervals (408h) 4 scrubbing+deep 99% 14 PGs not deep-scrubbed since 18 intervals (432h) 3 scrubbing+deep 99% 3 PGs not deep-scrubbed since 19 intervals (456h) 19.1d89* 19.fb7* 19.807* 100% 1 PGs not deep-scrubbed since 21 intervals (480h) 19.a01* 8192 PGs out of 8192 reported, 0 missing. PGs marked with a * are on busy OSDs and not eligible for scrubbing. con-fs2-data2 scrub_min_interval=42h con-fs2-data2 scrub_max_interval=0h con-fs2-data2 deep_scrub_interval=0h === This is the real deal, the pool I'm fighting with at the moment. I made a small change in scrub_min_interval (pool setting) from 24h to 42h, which resulted in the very good deep-scrub state allocation of the PGs in the pool. With scrub_min_interval=24h basically all scrubbing happened on PGs not deep-scrubbed within 1-6 days. After increasing this value to the time interval for which about 70% of PGs were scrubbed (leaving 30% eligible), the allocation of deep-scrub states is much much better. I expect both tails to get shorter and the overall deep-scrub load to go down as well. I hope to reach a state where I only need to issue a few deep-scrubs manually per day to get everything scrubbed within 1 week and deep-scrubbed within 3-4 weeks. For now I will wait what effect the global settings have on the SSD pools and what the HDD pool converges to. This will need 1-2 months observations and I will report back when significant changes show up. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Frank Schilder <frans@xxxxxx> Sent: Wednesday, November 15, 2023 11:14 AM To: ceph-users@xxxxxxx Subject: How to configure something like osd_deep_scrub_min_interval? Hi folks, I am fighting a bit with odd deep-scrub behavior on HDDs and discovered a likely cause of why the distribution of last_deep_scrub_stamps is so weird. I wrote a small script to extract a histogram of scrubs by "days not scrubbed" (more precisely, intervals not scrubbed; see code) to find out how (deep-) scrub times are distributed. Output below. What I expected is along the lines that HDD-OSDs try to scrub every 1-3 days, while they try to deep-scrub every 7-14 days. In other words, OSDs that have been deep-scrubbed within the last 7 days would *never* be in scrubbing+deep state. However, what I see is completely different. There seems to be no distinction between scrub- and deep-scrub start times. This is really unexpected as nobody would try to deep-scrub HDDs every day. Weekly to bi-weekly is normal, specifically for large drives. Is there a way to configure something like osd_deep_scrub_min_interval (no, I don't want to run cron jobs for scrubbing yet)? In the output below, I would like to be able to configure a minimum period of 1-2 weeks before the next deep-scrub happens. How can I do that? The observed behavior is very unusual for RAID systems (if its not a bug in the report script). With this behavior its not surprising that people complain about "not deep-scrubbed in time" messages and too high deep-scrub IO load when such a large percentage of OSDs is needlessly deep-scrubbed after 1-6 days again already. Sample output: # scrub-report dumped pgs Scrub report: 4121 PGs not scrubbed since 1 intervals (6h) 3831 PGs not scrubbed since 2 intervals (6h) 4012 PGs not scrubbed since 3 intervals (6h) 3986 PGs not scrubbed since 4 intervals (6h) 2998 PGs not scrubbed since 5 intervals (6h) 1488 PGs not scrubbed since 6 intervals (6h) 909 PGs not scrubbed since 7 intervals (6h) 771 PGs not scrubbed since 8 intervals (6h) 582 PGs not scrubbed since 9 intervals (6h) 2 scrubbing 431 PGs not scrubbed since 10 intervals (6h) 333 PGs not scrubbed since 11 intervals (6h) 1 scrubbing 265 PGs not scrubbed since 12 intervals (6h) 195 PGs not scrubbed since 13 intervals (6h) 116 PGs not scrubbed since 14 intervals (6h) 78 PGs not scrubbed since 15 intervals (6h) 1 scrubbing 72 PGs not scrubbed since 16 intervals (6h) 37 PGs not scrubbed since 17 intervals (6h) 5 PGs not scrubbed since 18 intervals (6h) 14.237* 19.5cd* 19.12cc* 19.1233* 14.40e* 33 PGs not scrubbed since 20 intervals (6h) 23 PGs not scrubbed since 21 intervals (6h) 16 PGs not scrubbed since 22 intervals (6h) 12 PGs not scrubbed since 23 intervals (6h) 8 PGs not scrubbed since 24 intervals (6h) 2 PGs not scrubbed since 25 intervals (6h) 19.eef* 19.bb3* 4 PGs not scrubbed since 26 intervals (6h) 19.b4c* 19.10b8* 19.f13* 14.1ed* 5 PGs not scrubbed since 27 intervals (6h) 19.43f* 19.231* 19.1dbe* 19.1788* 19.16c0* 6 PGs not scrubbed since 28 intervals (6h) 2 PGs not scrubbed since 30 intervals (6h) 19.10f6* 14.9d* 3 PGs not scrubbed since 31 intervals (6h) 19.1322* 19.1318* 8.a* 1 PGs not scrubbed since 32 intervals (6h) 19.133f* 1 PGs not scrubbed since 33 intervals (6h) 19.1103* 3 PGs not scrubbed since 36 intervals (6h) 19.19cc* 19.12f4* 19.248* 1 PGs not scrubbed since 39 intervals (6h) 19.1984* 1 PGs not scrubbed since 41 intervals (6h) 14.449* 1 PGs not scrubbed since 44 intervals (6h) 19.179f* Deep-scrub report: 3723 PGs not deep-scrubbed since 1 intervals (24h) 4621 PGs not deep-scrubbed since 2 intervals (24h) 8 scrubbing+deep 3588 PGs not deep-scrubbed since 3 intervals (24h) 8 scrubbing+deep 2929 PGs not deep-scrubbed since 4 intervals (24h) 3 scrubbing+deep 1705 PGs not deep-scrubbed since 5 intervals (24h) 4 scrubbing+deep 1904 PGs not deep-scrubbed since 6 intervals (24h) 5 scrubbing+deep 1540 PGs not deep-scrubbed since 7 intervals (24h) 7 scrubbing+deep 1304 PGs not deep-scrubbed since 8 intervals (24h) 7 scrubbing+deep 923 PGs not deep-scrubbed since 9 intervals (24h) 5 scrubbing+deep 557 PGs not deep-scrubbed since 10 intervals (24h) 7 scrubbing+deep 501 PGs not deep-scrubbed since 11 intervals (24h) 2 scrubbing+deep 363 PGs not deep-scrubbed since 12 intervals (24h) 2 scrubbing+deep 377 PGs not deep-scrubbed since 13 intervals (24h) 1 scrubbing+deep 383 PGs not deep-scrubbed since 14 intervals (24h) 2 scrubbing+deep 252 PGs not deep-scrubbed since 15 intervals (24h) 2 scrubbing+deep 116 PGs not deep-scrubbed since 16 intervals (24h) 5 scrubbing+deep 47 PGs not deep-scrubbed since 17 intervals (24h) 2 scrubbing+deep 10 PGs not deep-scrubbed since 18 intervals (24h) 2 PGs not deep-scrubbed since 19 intervals (24h) 19.1c6c* 19.a01* 1 PGs not deep-scrubbed since 20 intervals (24h) 14.1ed* 2 PGs not deep-scrubbed since 21 intervals (24h) 19.1322* 19.10f6* 1 PGs not deep-scrubbed since 23 intervals (24h) 19.19cc* 1 PGs not deep-scrubbed since 24 intervals (24h) 19.179f* PGs marked with a * are on busy OSDs and not eligible for scrubbing. The script (pasted here because attaching doesn't work): # cat bin/scrub-report #!/bin/bash # Compute last scrub interval count. Scrub interval 6h, deep-scrub interval 24h. # Print how many PGs have not been (deep-)scrubbed since #intervals. ceph -f json pg dump pgs 2>&1 > /root/.cache/ceph/pgs_dump.json echo "" T0="$(date +%s)" scrub_info="$(jq --arg T0 "$T0" -rc '.pg_stats[] | [ .pgid, (.last_scrub_stamp[:19]+"Z" | (($T0|tonumber) - fromdateiso8601)/(60*60*6)|ceil), (.last_deep_scrub_stamp[:19]+"Z" | (($T0|tonumber) - fromdateiso8601)/(60*60*24)|ceil), .state, (.acting | join(" ")) ] | @tsv ' /root/.cache/ceph/pgs_dump.json)" # less <<<"$scrub_info" # 1 2 3 4 5..NF # pg_id scrub-ints deep-scrub-ints status acting[] awk <<<"$scrub_info" '{ for(i=5; i<=NF; ++i) pg_osds[$1]=pg_osds[$1] " " $i if($4 == "active+clean") { si_mx=si_mx<$2 ? $2 : si_mx dsi_mx=dsi_mx<$3 ? $3 : dsi_mx pg_sn[$2]++ pg_sn_ids[$2]=pg_sn_ids[$2] " " $1 pg_dsn[$3]++ pg_dsn_ids[$3]=pg_dsn_ids[$3] " " $1 } else if($4 ~ /scrubbing\+deep/) { deep_scrubbing[$3]++ for(i=5; i<=NF; ++i) osd[$i]="busy" } else if($4 ~ /scrubbing/) { scrubbing[$2]++ for(i=5; i<=NF; ++i) osd[$i]="busy" } else { unclean[$2]++ unclean_d[$3]++ si_mx=si_mx<$2 ? $2 : si_mx dsi_mx=dsi_mx<$3 ? $3 : dsi_mx pg_sn[$2]++ pg_sn_ids[$2]=pg_sn_ids[$2] " " $1 pg_dsn[$3]++ pg_dsn_ids[$3]=pg_dsn_ids[$3] " " $1 for(i=5; i<=NF; ++i) osd[$i]="busy" } } END { print "Scrub report:" for(si=1; si<=si_mx; ++si) { if(pg_sn[si]==0 && scrubbing[si]==0 && unclean[si]==0) continue; printf("%7d PGs not scrubbed since %2d intervals (6h)", pg_sn[si], si) if(scrubbing[si]) printf(" %d scrubbing", scrubbing[si]) if(unclean[si]) printf(" %d unclean", unclean[si]) if(pg_sn[si]<=5) { split(pg_sn_ids[si], pgs) osds_busy=0 for(pg in pgs) { split(pg_osds[pgs[pg]], osds) for(o in osds) if(osd[osds[o]]=="busy") osds_busy=1 if(osds_busy) printf(" %s*", pgs[pg]) if(!osds_busy) printf(" %s", pgs[pg]) } } printf("\n") } print "" print "Deep-scrub report:" for(dsi=1; dsi<=dsi_mx; ++dsi) { if(pg_dsn[dsi]==0 && deep_scrubbing[dsi]==0 && unclean_d[dsi]==0) continue; printf("%7d PGs not deep-scrubbed since %2d intervals (24h)", pg_dsn[dsi], dsi) if(deep_scrubbing[dsi]) printf(" %d scrubbing+deep", deep_scrubbing[dsi]) if(unclean_d[dsi]) printf(" %d unclean", unclean_d[dsi]) if(pg_dsn[dsi]<=5) { split(pg_dsn_ids[dsi], pgs) osds_busy=0 for(pg in pgs) { split(pg_osds[pgs[pg]], osds) for(o in osds) if(osd[osds[o]]=="busy") osds_busy=1 if(osds_busy) printf(" %s*", pgs[pg]) if(!osds_busy) printf(" %s", pgs[pg]) } } printf("\n") } print "" print "PGs marked with a * are on busy OSDs and not eligible for scrubbing." } ' Don't forget the last "'" when copy-pasting. Thanks for any pointers. ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx