Hello Ceph community, Wanted to highlight one observation and gather any Squid users having similar experiences. Since upgrading to 19.2.0 (from 18.4.0) we have observed that pg deep scrubbing times have drastically increased. Some pgs take 2-5 days to complete deep scrubbing while others increase to 20+ days. This causes the deep scrubbing queue to fill up and the cluster almost constantly has 'pgs not deep-scrubbed in time' alerts. We have on average 67 pgs/osd: running on 15TB hdd disks this results in 200GB-ish pgs. While fairly large - these pgs did not cause such increase in deep scrubs when on Reef. "ceph pg dump | grep 'deep scrubbing for'" will always have a few entries of quite morbid scrubs like the following: 7.3e 121289 0 0 0 0 225333247207 0 0 127 0 127 active+clean+scrubbing+deep 2024-11-13T09:37:42.549418+0000 490179'5220664 490179:23902923 [268,27,122] 268 [268,27,122] 268 483850'5203141 2024-11-02T11:33:57.835277+0000 472713'5197481 2024-10-11T04:30:00.639763+0000 0 21873 deep scrubbing for 1169147s 34.247 62618 0 0 0 0 179797964677 0 0 101 50 101 active+clean+scrubbing+deep 2024-11-05T06:27:52.288785+0000 490179'22729571 490179:80672442 [34,97,25] 34 [34,97,25] 34 481331'22436869 2024-10-23T16:06:50.092439+0000 471395'22289914 2024-10-07T19:29:26.115047+0000 0 204864 deep scrubbing for 1871733s Not pointing any fingers but Squid release had "better scrub scheduling" announced. Though this is not scheduling directly, but maybe this change had any impact causing such behaviour? Scrubbing configurations: ceph config get osd | grep scrub global advanced osd_deep_scrub_interval 2678400.000000 global advanced osd_deep_scrub_large_omap_object_key_threshold 500000 global advanced osd_max_scrubs 5 global advanced osd_scrub_auto_repair true global advanced osd_scrub_max_interval 2678400.000000 global advanced osd_scrub_min_interval 172800.000000 Cluster details (backfilling expected and caused by some manual reweights): cluster: id: 96df99f6-fc1a-11ea-90a4-6cb3113cb732 health: HEALTH_WARN 24 pgs not deep-scrubbed in time services: mon: 5 daemons, quorum ceph-node004,ceph-node003,ceph-node001,ceph-node005,ceph-node002 (age 4d) mgr: ceph-node001.hgythj(active, since 11d), standbys: ceph-node002.jphtvg mds: 20/20 daemons up, 12 standby osd: 384 osds: 384 up (since 25h), 384 in (since 5d); 5 remapped pgs rbd-mirror: 2 daemons active (2 hosts) rgw: 64 daemons active (32 hosts, 1 zones) data: volumes: 1/1 healthy pools: 14 pools, 8681 pgs objects: 758.42M objects, 1.5 PiB usage: 4.6 PiB used, 1.1 PiB / 5.7 PiB avail pgs: 275177/2275254543 objects misplaced (0.012%) 6807 active+clean 989 active+clean+scrubbing+deep 880 active+clean+scrubbing 5 active+remapped+backfilling io: client: 37 MiB/s rd, 59 MiB/s wr, 1.72k op/s rd, 439 op/s wr recovery: 70 MiB/s, 38 objects/s One thread of other users experiencing same 19.2.0 prolonged deep scrub issues: https://www.reddit.com/r/ceph/comments/1guynak/strange_issue_where_scrubdeep_scrub_never_finishes/ Any hints or help would be greately appreciated! Thanks in advance, Laimis J. laimis.juzeliunas@xxxxxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx