Re: Squid: deep scrub issues

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Wed, 27 Nov 2024 11:49:24 -0500

Do you have osd_scrub_begin_hour / osd_scrub_end_hour set?  Constraining times when scrubs can run can result in them piling up.  

Are you saying that an individual PG may take 20+ elapsed days to perform a deep scrub?

> Might be the result of osd_scrub_chunk_max now being 15 instead of 25 previously. See [1] and [2].
> 
> 
> [1] https://tracker.ceph.com/issues/68057
> [2] https://github.com/ceph/ceph/pull/59791/commits/0841603023ba53923a986f2fb96ab7105630c9d3
> 
> ----- Le 26 Nov 24, à 23:36, Laimis Juzeliūnas laimis.juzeliunas@xxxxxxxxxx a écrit :
> 
>> Hello Ceph community,
>> 
>> Wanted to highlight one observation and gather any Squid users having similar
>> experiences.
>> Since upgrading to 19.2.0 (from 18.4.0) we have observed that pg deep scrubbing
>> times have drastically increased. Some pgs take 2-5 days to complete deep
>> scrubbing while others increase to 20+ days. This causes the deep scrubbing
>> queue to fill up and the cluster almost constantly has 'pgs not deep-scrubbed
>> in time' alerts.
>> We have on average 67 pgs/osd: running on 15TB hdd disks this results in
>> 200GB-ish pgs. While fairly large - these pgs did not cause such increase in
>> deep scrubs when on Reef.
>> 
>> "ceph pg dump | grep 'deep scrubbing for'" will always have a few entries of
>> quite morbid scrubs like the following:
>> 7.3e      121289                   0         0          0        0  225333247207
>> 0           0   127         0       127  active+clean+scrubbing+deep
>> 2024-11-13T09:37:42.549418+0000     490179'5220664    490179:23902923
>> [268,27,122]         268   [268,27,122]             268     483850'5203141
>> 2024-11-02T11:33:57.835277+0000     472713'5197481
>> 2024-10-11T04:30:00.639763+0000              0                21873  deep
>> scrubbing for 1169147s
>> 34.247     62618                   0         0          0        0  179797964677
>> 0           0   101        50       101  active+clean+scrubbing+deep
>> 2024-11-05T06:27:52.288785+0000    490179'22729571    490179:80672442
>> [34,97,25]          34     [34,97,25]              34    481331'22436869
>> 2024-10-23T16:06:50.092439+0000    471395'22289914
>> 2024-10-07T19:29:26.115047+0000              0               204864  deep
>> scrubbing for 1871733s
>> 
>> Not pointing any fingers but Squid release had "better scrub scheduling"
>> announced.
>> Though this is not scheduling directly, but maybe this change had any impact
>> causing such behaviour?
>> 
>> Scrubbing configurations:
>> ceph config get osd | grep scrub
>> global        advanced  osd_deep_scrub_interval
>> 2678400.000000
>> global        advanced  osd_deep_scrub_large_omap_object_key_threshold  500000
>> global        advanced  osd_max_scrubs                                  5
>> global        advanced  osd_scrub_auto_repair                           true
>> global        advanced  osd_scrub_max_interval
>> 2678400.000000
>> global        advanced  osd_scrub_min_interval
>> 172800.000000
>> 
>> 
>> Cluster details (backfilling expected and caused by some manual reweights):
>> cluster:
>>   id:     96df99f6-fc1a-11ea-90a4-6cb3113cb732
>>   health: HEALTH_WARN
>>           24 pgs not deep-scrubbed in time
>> 
>> services:
>>   mon:        5 daemons, quorum
>>   ceph-node004,ceph-node003,ceph-node001,ceph-node005,ceph-node002 (age 4d)
>>   mgr:        ceph-node001.hgythj(active, since 11d), standbys:
>>   ceph-node002.jphtvg
>>   mds:        20/20 daemons up, 12 standby
>>   osd:        384 osds: 384 up (since 25h), 384 in (since 5d); 5 remapped pgs
>>   rbd-mirror: 2 daemons active (2 hosts)
>>   rgw:        64 daemons active (32 hosts, 1 zones)
>> 
>> data:
>>   volumes: 1/1 healthy
>>   pools:   14 pools, 8681 pgs
>>   objects: 758.42M objects, 1.5 PiB
>>   usage:   4.6 PiB used, 1.1 PiB / 5.7 PiB avail
>>   pgs:     275177/2275254543 objects misplaced (0.012%)
>>            6807 active+clean
>>            989  active+clean+scrubbing+deep
>>            880  active+clean+scrubbing
>>            5    active+remapped+backfilling
>> 
>> io:
>>   client:   37 MiB/s rd, 59 MiB/s wr, 1.72k op/s rd, 439 op/s wr
>>   recovery: 70 MiB/s, 38 objects/s
>> 
>> 
>> One thread of other users experiencing same 19.2.0 prolonged deep scrub issues:
>> https://www.reddit.com/r/ceph/comments/1guynak/strange_issue_where_scrubdeep_scrub_never_finishes/
>> Any hints or help would be greately appreciated!
>> 
>> 
>> Thanks in advance,
>> Laimis J.
>> laimis.juzeliunas@xxxxxxxxxx
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx