Michel,
Glad to know that was it.
I was wondering when would per OSD osd_mclock_max_capacity_iops_hdd
value be set in cluster's config database since I don't have any set
in my lab.
Turns out the per OSD osd_mclock_max_capacity_iops_hdd is only set
when the calculated value is below
osd_mclock_iops_capacity_threshold_hdd, otherwise the OSD uses the
default value of 315.
Probably to rule out any insanely high calculated values. Would have
been nice to also rule out any insanely low measured values. :-)
Now either:
A/ these incredibly low values were calculated a while back with an
unmature version of the code or under some specific hardware
conditions and you can hope this won't happen again
OR
B/ you don't want to rely on hope to much and you'll prefer to
disable automatic calculation (osd_mclock_skip_benchmark = true) and
set osd_mclock_max_capacity_iops_[hdd,ssd] by yourself (globally or
using a rack/host mask) after a precise evaluation of the performance
of your OSDs.
B/ would be more deterministic :-)
Cheers,
Frédéric.
------------------------------------------------------------------------
*De: *Michel <michel.jouvin@xxxxxxxxxxxxxxx>
*à: *Frédéric <frederic.nass@xxxxxxxxxxxxxxxx>
*Cc: *Pierre <pierre@xxxxxxxxxxxx>; ceph-users <ceph-users@xxxxxxx>
*Envoyé: *vendredi 22 mars 2024 14:44 CET
*Sujet : *Re: Re: Reef (18.2): Some PG not
scrubbed/deep scrubbed for 1 month
Hi Frédéric,
I think you raise the right point, sorry if I misunderstood Pierre's
suggestion to look at OSD performances. Just before reading your
email,
I was implementing Pierre's suggestion for max_osd_scrubs and I
saw the
osd_mclock_max_capacity_iops_hdd for a few OSDs (I guess those
with a
value different from the default). For the suspect OSD, the value is
very low, 0.145327, and I suspect it is the cause of the problem.
A few
others have a value ~5 which I find also very low (all OSDs are
using
the same recent HW/HDD).
Thanks for these informations. I'll follow your suggestions to
rerun the
benchmark and report if it improved the situation.
Best regards,
Michel
Le 22/03/2024 à 12:18, Frédéric Nass a écrit :
> Hello Michel,
>
> Pierre also suggested checking the performance of this OSD's
device(s) which can be done by running a ceph tell osd.x bench.
>
> One think I can think of is how the scrubbing speed of this very
OSD could be influenced by mclock sheduling, would the max iops
capacity calculated by this OSD during its initialization be
significantly lower than other OSDs's.
>
> What I would do is check (from this OSD's log) the calculated
value for max iops capacity and compare it to other OSDs.
Eventually force a recalculation by setting 'ceph config set osd.x
osd_mclock_force_run_benchmark_on_init true' and restart this OSD.
>
> Also I would:
>
> - compare running OSD's mclock values (cephadm shell ceph daemon
osd.x config show | grep mclock) to other OSDs's.
> - compare ceph tell osd.x bench to other OSDs's benchmarks.
> - compare the rotational status of this OSD's db and data
devices to other OSDs, to make sure things are in order.
>
> Bests,
> Frédéric.
>
> PS: If mclock is the culprit here, then setting osd_op_queue
back to mpq for this only OSD would probably reveal it. Not sure
about the implication of having a signel OSD running a different
scheduler in the cluster though.
>
>
> ----- Le 22 Mar 24, à 10:11, Michel Jouvin
michel.jouvin@xxxxxxxxxxxxxxx a écrit :
>
>> Pierre,
>>
>> Yes, as mentioned in my initial email, I checked the OSD state
and found
>> nothing wrong either in the OSD logs or in the system logs
(SMART errors).
>>
>> Thanks for the advice of increasing osd_max_scrubs, I may try
it, but I
>> doubt it is a contention problem because it really only affects
a fixed
>> set of PGs (no new PGS have a "stucked scrub") and there is a
>> significant scrubbing activity going on continuously (~10K PGs
in the
>> cluster).
>>
>> Again, it is not a problem for me to try to kick out the
suspect OSD and
>> see it fixes the issue but as this cluster is pretty simple/low
in terms
>> of activity and I see nothing that may explain why we have this
>> situation on a pretty new cluster (9 months, created in Quincy)
and not
>> on our 2 other production clusters, much more used, one of them
being
>> the backend storage of a significant OpenStack clouds, a
cluster created
>> 10 years ago with Infernetis and upgraded since then, a better
candidate
>> for this kind of problems! So, I'm happy to contribute to
>> troubleshooting a potential issue in Reef if somebody finds it
useful
>> and can help. Else I'll try the approach that worked for Gunnar.
>>
>> Best regards,
>>
>> Michel
>>
>> Le 22/03/2024 à 09:59, Pierre Riteau a écrit :
>>> Hello Michel,
>>>
>>> It might be worth mentioning that the next releases of Reef
and Quincy
>>> should increase the default value of osd_max_scrubs from 1 to
3. See
>>> the Reef pull request: https://github.com/ceph/ceph/pull/55173
>>> You could try increasing this configuration setting if you
>>> haven't already, but note that it can impact client I/O
performance.
>>>
>>> Also, if the delays appear to be related to a single OSD, have
you
>>> checked the health and performance of this device?
>>>
>>> On Fri, 22 Mar 2024 at 09:29, Michel Jouvin
>>> <michel.jouvin@xxxxxxxxxxxxxxx> wrote:
>>>
>>> Hi,
>>>
>>> As I said in my initial message, I'd in mind to do exactly the
>>> same as I
>>> identified in my initial analysis that all the PGs with this
problem
>>> where sharing one OSD (but only 20 PGs had the problem over ~200
>>> hosted
>>> by the OSD). But as I don't feel I'm in an urgent situation, I
was
>>> wondering if collecting more information on the problem may
have some
>>> value and which one... If it helps, I add below the `pg dump`
for
>>> the 17
>>> PGs still with a "stucked scrub".
>>>
>>> I observed the "stucked scrubs" is lowering very slowly. In the
>>> last 12
>>> hours, 1 more PG was successfully scrubbed/deep scrubbed. In
case
>>> it was
>>> not clear in my initial message, the lists of PGs with a too old
>>> scrub
>>> and too old deep scrub are the same.
>>>
>>> Without an answer, next week i may consider doing what you did:
>>> remove
>>> the suspect OSD (instead of just restarting it) and see it
>>> unblocks the
>>> stucked scrubs.
>>>
>>> Best regards,
>>>
>>> Michel
>>>
>>> --------------------------------- "ceph pg dump pgs" for the 17
>>> PGs with
>>> a too old scrub and deep scrub (same list)
>>> ------------------------------------------------------------
>>>
>>> PG_STAT OBJECTS MISSING_ON_PRIMARY DEGRADED MISPLACED UNFOUND
>>> BYTES OMAP_BYTES* OMAP_KEYS* LOG LOG_DUPS DISK_LOG
STATE
>>> STATE_STAMP VERSION REPORTED
>>> UP UP_PRIMARY ACTING ACTING_PRIMARY
>>> LAST_SCRUB SCRUB_STAMP LAST_DEEP_SCRUB
>>> DEEP_SCRUB_STAMP SNAPTRIMQ_LEN LAST_SCRUB_DURATION
>>> SCRUB_SCHEDULING OBJECTS_SCRUBBED OBJECTS_TRIMMED
>>> 29.7e3 260 0 0 0 0
>>> 1090519040 0 0 1978 500
>>> 1978 active+clean
2024-03-21T18:28:53.369789+0000
>>> 39202'2478 83812:97136 [29,141,64,194] 29
>>> [29,141,64,194] 29 39202'2478
>>> 2024-02-17T19:56:34.413412+0000 39202'2478
>>> 2024-02-17T19:56:34.413412+0000 0 3 queued for
deep
>>> scrub
>>> 0 0
>>> 25.7cc 0 0 0 0 0
>>> 0 0 0 0 1076 0
>>> active+clean 2024-03-21T18:09:48.104279+0000 46253'548
>>> 83812:89843 [29,50,173] 29 [29,50,173]
>>> 29 39159'536 2024-02-17T18:14:54.950401+0000 39159'536
>>> 2024-02-17T18:14:54.950401+0000 0 1 queued for
deep
>>> scrub
>>> 0 0
>>> 25.70c 0 0 0 0 0
>>> 0 0 0 0 918 0
>>> active+clean 2024-03-21T18:00:57.942902+0000 46253'514
>>> 83812:95212 [29,195,185] 29
[29,195,185] 29
>>> 39159'530 2024-02-18T03:56:17.559531+0000 39159'530
>>> 2024-02-16T17:39:03.281785+0000 0 1 queued for
deep
>>> scrub
>>> 0 0
>>> 29.70c 249 0 0 0 0
>>> 1044381696 0 0 1987 600
>>> 1987 active+clean
2024-03-21T18:35:36.848189+0000
>>> 39202'2587 83812:99628 [29,138,63,12] 29
>>> [29,138,63,12] 29 39202'2587
>>> 2024-02-17T21:34:22.042560+0000 39202'2587
>>> 2024-02-17T21:34:22.042560+0000 0 1 queued for
deep
>>> scrub
>>> 0 0
>>> 29.705 231 0 0 0 0
>>> 968884224 0 0 1959 500 1959
>>> active+clean 2024-03-21T18:18:22.028551+0000 39202'2459
>>> 83812:91258 [29,147,173,61] 29 [29,147,173,61]
>>> 29 39202'2459 2024-02-17T16:41:40.421763+0000 39202'2459
>>> 2024-02-17T16:41:40.421763+0000 0 1 queued for
deep
>>> scrub
>>> 0 0
>>> 29.6b9 236 0 0 0 0
>>> 989855744 0 0 1956 500 1956
>>> active+clean 2024-03-21T18:11:29.912132+0000 39202'2456
>>> 83812:95607 [29,199,74,16] 29 [29,199,74,16]
>>> 29 39202'2456 2024-02-17T11:46:06.706625+0000 39202'2456
>>> 2024-02-17T11:46:06.706625+0000 0 1 queued for
deep
>>> scrub
>>> 0 0
>>> 25.56e 0 0 0 0 0
>>> 0 0 0 0 1158 0
>>> active+clean+scrubbing+deep 2024-03-22T08:09:38.840145+0000
>>> 46253'514 83812:637482 [111,29,128] 111
>>> [111,29,128] 111 39159'579
>>> 2024-03-06T17:57:53.158936+0000 39159'579
>>> 2024-03-06T17:57:53.158936+0000 0 1 queued for
deep
>>> scrub
>>> 0 0
>>> 25.56a 0 0 0 0 0
>>> 0 0 0 0 1055 0
>>> active+clean 2024-03-21T18:00:57.940851+0000 46253'545
>>> 83812:93475 [29,19,211] 29 [29,19,211]
>>> 29 46253'545 2024-03-07T11:12:45.881545+0000 46253'545
>>> 2024-03-07T11:12:45.881545+0000 0 28 queued for
deep
>>> scrub
>>> 0 0
>>> 25.55a 0 0 0 0 0
>>> 0 0 0 0 1022 0
>>> active+clean 2024-03-21T18:10:24.124914+0000 46253'565
>>> 83812:89876 [29,58,195] 29 [29,58,195]
>>> 29 46253'561 2024-02-17T06:54:35.320454+0000 46253'561
>>> 2024-02-17T06:54:35.320454+0000 0 28 queued for
deep
>>> scrub
>>> 0 0
>>> 29.c0 256 0 0 0 0
>>> 1073741824 0 0 1986 600 1986
>>> active+clean+scrubbing+deep 2024-03-22T08:09:12.849868+0000
>>> 39202'2586 83812:603625 [22,150,29,56] 22
>>> [22,150,29,56] 22 39202'2586
>>> 2024-03-07T18:53:22.952868+0000 39202'2586
>>> 2024-03-07T18:53:22.952868+0000 0 1 queued for
deep
>>> scrub
>>> 0 0
>>> 18.6 15501 0 0 0 0
>>> 63959444676 0 0 2068 3000 2068
>>> active+clean+scrubbing+deep 2024-03-22T02:29:24.508889+0000
>>> 81688'663900 83812:1272160 [187,29,211] 187
>>> [187,29,211] 187 52735'663878
>>> 2024-03-06T16:36:32.080259+0000 52735'663878
>>> 2024-03-06T16:36:32.080259+0000 0 684445 deep
scrubbing
>>> for 20373s 449 0
>>> 16.15 0 0 0 0 0
>>> 0 0 0 0 0 0
>>> active+clean 2024-03-21T18:20:29.632554+0000 0'0
>>> 83812:104893 [29,165,85] 29 [29,165,85]
>>> 29 0'0 2024-02-17T06:54:06.370647+0000
0'0
>>> 2024-02-17T06:54:06.370647+0000 0 28 queued for
deep
>>> scrub
>>> 0 0
>>> 25.45 0 0 0 0 0
>>> 0 0 0 0 1036 0
>>> active+clean 2024-03-21T18:10:24.125134+0000 39159'561
>>> 83812:93649 [29,13,58] 29 [29,13,58]
>>> 29 39159'512 2024-02-27T12:27:35.728176+0000 39159'512
>>> 2024-02-27T12:27:35.728176+0000 0 1 queued for
deep
>>> scrub
>>> 0 0
>>> 29.249 260 0 0 0 0
>>> 1090519040 0 0 1970 500
>>> 1970 active+clean
2024-03-21T18:29:22.588805+0000
>>> 39202'2470 83812:96016 [29,191,18,143] 29
>>> [29,191,18,143] 29 39202'2470
>>> 2024-02-17T13:32:42.910335+0000 39202'2470
>>> 2024-02-17T13:32:42.910335+0000 0 1 queued for
deep
>>> scrub
>>> 0 0
>>> 29.25a 248 0 0 0 0
>>> 1040187392 0 0 1952 600
>>> 1952 active+clean
2024-03-21T18:20:29.623422+0000
>>> 39202'2552 83812:99157 [29,200,85,164] 29
>>> [29,200,85,164] 29 39202'2552
>>> 2024-02-17T08:33:14.326087+0000 39202'2552
>>> 2024-02-17T08:33:14.326087+0000 0 1 queued for
deep
>>> scrub
>>> 0 0
>>> 25.3cf 0 0 0 0 0
>>> 0 0 0 0 1343 0
>>> active+clean 2024-03-21T18:16:00.933375+0000 46253'598
>>> 83812:91659 [29,75,175] 29 [29,75,175]
>>> 29 46253'598 2024-02-17T11:48:51.840600+0000 46253'598
>>> 2024-02-17T11:48:51.840600+0000 0 28 queued for
deep
>>> scrub
>>> 0 0
>>> 29.4ec 243 0 0 0 0
>>> 1019215872 0 0 1933 500
>>> 1933 active+clean
2024-03-21T18:15:35.389598+0000
>>> 39202'2433 83812:101501 [29,206,63,17] 29
>>> [29,206,63,17] 29 39202'2433
>>> 2024-02-17T15:10:41.027755+0000 39202'2433
>>> 2024-02-17T15:10:41.027755+0000 0 3 queued for
deep
>>> scrub
>>> 0 0
>>>
>>>
>>> Le 22/03/2024 à 08:16, Bandelow, Gunnar a écrit :
>>> > Hi Michael,
>>> >
>>> > i think yesterday i found the culprit in my case.
>>> >
>>> > After inspecting "ceph pg dump" and especially the column
>>> > "last_scrub_duration". I found, that any PG without proper
>>> scrubbing
>>> > was located on one of three OSDs (and all these OSDs share
the same
>>> > SSD for their DB). I put them on "out" and now after
backfill and
>>> > remapping everything seems to be fine.
>>> >
>>> > Only the log is still flooded with "scrub starts" and i
have no
>>> clue
>>> > why these OSDs are causing the problems.
>>> > Will investigate further.
>>> >
>>> > Best regards,
>>> > Gunnar
>>> >
>>> > ===================================
>>> >
>>> > Gunnar Bandelow
>>> > Universitätsrechenzentrum (URZ)
>>> > Universität Greifswald
>>> > Felix-Hausdorff-Straße 18
>>> > 17489 Greifswald
>>> > Germany
>>> >
>>> > Tel.: +49 3834 420 1450
>>> >
>>> >
>>> > --- Original Nachricht ---
>>> > *Betreff: * Re: Reef (18.2): Some PG not
scrubbed/deep
>>> > scrubbed for 1 month
>>> > *Von: *"Michel Jouvin" <michel.jouvin@xxxxxxxxxxxxxxx
>>> > <mailto:michel.jouvin@xxxxxxxxxxxxxxx>>
>>> > *An: *ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
>>> > *Datum: *21-03-2024 23:40
>>> >
>>> >
>>> >
>>> > Hi,
>>> >
>>> > Today we decided to upgrade from 18.2.0 to 18.2.2. No real
>>> hope of a
>>> > direct impact (nothing in the change log related to
something
>>> > similar)
>>> > but at least all daemons were restarted so we thought that
>>> may be
>>> > this
>>> > will clear the problem at least temporarily. Unfortunately
>>> it has not
>>> > been the case. The same pages are still stuck, despite
>>> continuous
>>> > activity of scrubbing/deep scrubbing in the cluster...
>>> >
>>> > I'm happy to provide more information if somebody tells me
>>> what to
>>> > look
>>> > at...
>>> >
>>> > Cheers,
>>> >
>>> > Michel
>>> >
>>> > Le 21/03/2024 à 14:40, Bernhard Krieger a écrit :
>>> > > Hi,
>>> > >
>>> > > i have the same issues.
>>> > > Deep scrub havent finished the jobs on some PGs.
>>> > >
>>> > > Using ceph 18.2.2.
>>> > > Initial installed version was 18.0.0
>>> > >
>>> > >
>>> > > In the logs i see a lot of scrub/deep-scrub starts
>>> > >
>>> > > Mar 21 14:21:09 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 13.b deep-scrubstarts
>>> > > Mar 21 14:21:10 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 13.1a deep-scrubstarts
>>> > > Mar 21 14:21:17 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 13.1c deep-scrubstarts
>>> > > Mar 21 14:21:19 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 11.1 scrubstarts
>>> > > Mar 21 14:21:27 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 14.6 scrubstarts
>>> > > Mar 21 14:21:30 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 10.c deep-scrubstarts
>>> > > Mar 21 14:21:35 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 12.3 deep-scrubstarts
>>> > > Mar 21 14:21:41 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 6.0 scrubstarts
>>> > > Mar 21 14:21:44 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 8.5 deep-scrubstarts
>>> > > Mar 21 14:21:45 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 5.66 deep-scrubstarts
>>> > > Mar 21 14:21:49 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 5.30 deep-scrubstarts
>>> > > Mar 21 14:21:50 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 13.b deep-scrubstarts
>>> > > Mar 21 14:21:52 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 13.1a deep-scrubstarts
>>> > > Mar 21 14:21:54 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 13.1c deep-scrubstarts
>>> > > Mar 21 14:21:55 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 11.1 scrubstarts
>>> > > Mar 21 14:21:58 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 14.6 scrubstarts
>>> > > Mar 21 14:22:01 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 10.c deep-scrubstarts
>>> > > Mar 21 14:22:04 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 12.3 scrubstarts
>>> > > Mar 21 14:22:13 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 6.0 scrubstarts
>>> > > Mar 21 14:22:15 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 8.5 deep-scrubstarts
>>> > > Mar 21 14:22:20 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 5.66 deep-scrubstarts
>>> > > Mar 21 14:22:27 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 5.30 scrubstarts
>>> > > Mar 21 14:22:30 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 13.b deep-scrubstarts
>>> > > Mar 21 14:22:32 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 13.1a deep-scrubstarts
>>> > > Mar 21 14:22:33 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 13.1c deep-scrubstarts
>>> > > Mar 21 14:22:35 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 11.1 deep-scrubstarts
>>> > > Mar 21 14:22:37 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 14.6 scrubstarts
>>> > > Mar 21 14:22:38 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 10.c scrubstarts
>>> > > Mar 21 14:22:39 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 12.3 scrubstarts
>>> > > Mar 21 14:22:41 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 6.0 deep-scrubstarts
>>> > > Mar 21 14:22:43 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 8.5 deep-scrubstarts
>>> > > Mar 21 14:22:46 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 5.66 deep-scrubstarts
>>> > > Mar 21 14:22:49 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 5.30 scrubstarts
>>> > > Mar 21 14:22:55 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 13.b deep-scrubstarts
>>> > > Mar 21 14:22:57 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 13.1a deep-scrubstarts
>>> > > Mar 21 14:22:58 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 13.1c deep-scrubstarts
>>> > > Mar 21 14:23:03 ceph-node10 ceph-osd[3804193]:
>>> log_channel(cluster)
>>> > > log [DBG] : 11.1 deep-scrubstarts
>>> > >
>>> > >
>>> > >
>>> > > *
>>> > > *The amount of scrubbed/deep-scrubbed pgs changes every
>>> few seconds.
>>> > >
>>> > > [root@ceph-node10 ~]# ceph -s | grep active+clean
>>> > > pgs: 214 active+clean
>>> > > 50 active+clean+scrubbing+deep
>>> > > 25 active+clean+scrubbing
>>> > > [root@ceph-node10 ~]# ceph -s | grep active+clean
>>> > > pgs: 208 active+clean
>>> > > 53 active+clean+scrubbing+deep
>>> > > 28 active+clean+scrubbing
>>> > > [root@ceph-node10 ~]# ceph -s | grep active+clean
>>> > > pgs: 208 active+clean
>>> > > 53 active+clean+scrubbing+deep
>>> > > 28 active+clean+scrubbing
>>> > > [root@ceph-node10 ~]# ceph -s | grep active+clean
>>> > > pgs: 207 active+clean
>>> > > 54 active+clean+scrubbing+deep
>>> > > 28 active+clean+scrubbing
>>> > > [root@ceph-node10 ~]# ceph -s | grep active+clean
>>> > > pgs: 202 active+clean
>>> > > 56 active+clean+scrubbing+deep
>>> > > 31 active+clean+scrubbing
>>> > > [root@ceph-node10 ~]# ceph -s | grep active+clean
>>> > > pgs: 213 active+clean
>>> > > 45 active+clean+scrubbing+deep
>>> > > 31 active+clean+scrubbing
>>> > >
>>> > > ceph pg dump showing PGs which are not deep scrubbed
since
>>> january.
>>> > > Some PGs deep scrubbing over 700000 seconds.
>>> > >
>>> > > *[ceph: root@ceph-node10 /]# ceph pg dump pgs | grep -e
>>> > 'scrubbing f'
>>> > > 5.6e 221223 0 0
0
>>> 0
>>> > > 927795290112 0 0 4073 3000
>>> 4073
>>> > > active+clean+scrubbing+deep
2024-03-20T01:07:21.196293+
>>> > > 0000 128383'15766927 128383:20517419
[2,4,18,16,14,21]
>>> > 2
>>> > > [2,4,18,16,14,21] 2 125519'12328877
>>> > > 2024-01-23T11:25:35.503811+0000 124844'11873951
>>> 2024-01-21T22:
>>> > > 24:12.620693+0000 0 5
deep
>>> > scrubbing
>>> > > for 270790s
53772
>>> > > 0
>>> > > 5.6c 221317 0 0
0
>>> 0
>>> > > 928173256704 0 0 6332 0
>>> 6332
>>> > > active+clean+scrubbing+deep
2024-03-18T09:29:29.233084+
>>> > > 0000 128382'15788196 128383:20727318
[6,9,12,14,1,4]
>>> > 6
>>> > > [6,9,12,14,1,4] 6 127180'14709746
>>> > > 2024-03-06T12:47:57.741921+0000 124817'11821502
>>> 2024-01-20T20:
>>> > > 59:40.566384+0000 0 13452
deep
>>> > scrubbing
>>> > > for 273519s
122803
>>> > > 0
>>> > > 5.6a 221325 0 0
0
>>> 0
>>> > > 928184565760 0 0 4649 3000
>>> 4649
>>> > > active+clean+scrubbing+deep
2024-03-13T03:48:54.065125+
>>> > > 0000 128382'16031499 128383:21221685
[13,11,1,2,9,8]
>>> > 13
>>> > > [13,11,1,2,9,8] 13 127181'14915404
>>> > > 2024-03-06T13:16:58.635982+0000 125967'12517899
>>> 2024-01-28T09:
>>> > > 13:08.276930+0000 0 10078
deep
>>> > scrubbing
>>> > > for 726001s
184819
>>> > > 0
>>> > > 5.54 221050 0 0
0
>>> 0
>>> > > 927036203008 0 0 4864 3000
>>> 4864
>>> > > active+clean+scrubbing+deep
2024-03-18T00:17:48.086231+
>>> > > 0000 128383'15584012 128383:20293678
[0,20,18,19,11,12]
>>> > 0
>>> > > [0,20,18,19,11,12] 0 127195'14651908
>>> > > 2024-03-07T09:22:31.078448+0000 124816'11813857
>>> 2024-01-20T16:
>>> > > 43:15.755200+0000 0 9808
deep
>>> > scrubbing
>>> > > for 306667s
142126
>>> > > 0
>>> > > 5.47 220849 0 0
0
>>> 0
>>> > > 926233448448 0 0 5592 0
>>> 5592
>>> > > active+clean+scrubbing+deep
2024-03-12T08:10:39.413186+
>>> > > 0000 128382'15653864 128383:20403071
[16,15,20,0,13,21]
>>> > 16
>>> > > [16,15,20,0,13,21] 16 127183'14600433
>>> > > 2024-03-06T18:21:03.057165+0000 124809'11792397
>>> 2024-01-20T05:
>>> > > 27:07.617799+0000 0 13066
deep
>>> > scrubbing
>>> > > for 796697s
209193
>>> > > 0
>>> > > dumped pgs
>>> > >
>>> > >
>>> > > *
>>> > >
>>> > >
>>> > > regards
>>> > > Bernhard
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > >
>>> > > On 20/03/2024 21:12, Bandelow, Gunnar wrote:
>>> > >> Hi,
>>> > >>
>>> > >> i just wanted to mention, that i am running a cluster
>>> with reef
>>> > >> 18.2.1 with the same issue.
>>> > >>
>>> > >> 4 PGs start to deepscrub but dont finish since mid
>>> february. In
>>> > the
>>> > >> pg dump they are shown as scheduled for deep scrub.
They
>>> sometimes
>>> > >> change their status from active+clean to
>>> > active+clean+scrubbing+deep
>>> > >> and back.
>>> > >>
>>> > >> Best regards,
>>> > >> Gunnar
>>> > >>
>>> > >> =======================================================
>>> > >>
>>> > >> Gunnar Bandelow
>>> > >> Universitätsrechenzentrum (URZ)
>>> > >> Universität Greifswald
>>> > >> Felix-Hausdorff-Straße 18
>>> > >> 17489 Greifswald
>>> > >> Germany
>>> > >>
>>> > >> Tel.: +49 3834 420 1450
>>> > >>
>>> > >>
>>> > >>
>>> > >>
>>> > >> --- Original Nachricht ---
>>> > >> *Betreff: * Re: Reef (18.2): Some PG not
>>> scrubbed/deep
>>> > >> scrubbed for 1 month
>>> > >> *Von: *"Michel Jouvin" <michel.jouvin@xxxxxxxxxxxxxxx
>>> > <mailto:michel.jouvin@xxxxxxxxxxxxxxx>
>>> > >> <michel.jouvin@xxxxxxxxxxxxxxx
>>> > <mailto:michel.jouvin@xxxxxxxxxxxxxxx>>>
>>> > >> *An: *ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
>>> > <ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>>
>>> > >> *Datum: *20-03-2024 20:00
>>> > >>
>>> > >>
>>> > >>
>>> > >> Hi Rafael,
>>> > >>
>>> > >> Good to know I am not alone!
>>> > >>
>>> > >> Additional information ~6h after the OSD restart:
>>> over the
>>> > 20 PGs
>>> > >> impacted, 2 have been processed successfully... I
don't
>>> > have a clear
>>> > >> picture on how Ceph prioritize the scrub of one
PG over
>>> > another, I
>>> > >> had
>>> > >> thought that the oldest/expired scrubs are taken
>>> first but
>>> > it may
>>> > >> not be
>>> > >> the case. Anyway, I have seen a very significant
>>> decrese of
>>> > the
>>> > >> scrub
>>> > >> activity this afternoon and the cluster is not
loaded
>>> at all
>>> > >> (almost no
>>> > >> users yet)...
>>> > >>
>>> > >> Michel
>>> > >>
>>> > >> Le 20/03/2024 à 17:55, quaglio@xxxxxxxxxx
>>> > <mailto:quaglio@xxxxxxxxxx>
>>> > >> <quaglio@xxxxxxxxxx <mailto:quaglio@xxxxxxxxxx>> a
>>> écrit :
>>> > >> > Hi,
>>> > >> > I upgraded a cluster 2 weeks ago here. The
>>> situation
>>> > is the
>>> > >> same
>>> > >> > as Michel.
>>> > >> > A lot of PGs no scrubbed/deep-scrubed.
>>> > >> >
>>> > >> > Rafael.
>>> > >> >
>>> > >> > _______________________________________________
>>> > >> > ceph-users mailing list -- ceph-users@xxxxxxx
>>> > <mailto:ceph-users@xxxxxxx>
>>> > >> <ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>>
>>> > >> > To unsubscribe send an email to
>>> ceph-users-leave@xxxxxxx
>>> > <mailto:ceph-users-leave@xxxxxxx>
>>> > >> <ceph-users-leave@xxxxxxx
>>> <mailto:ceph-users-leave@xxxxxxx>>
>>> > >> _______________________________________________
>>> > >> ceph-users mailing list -- ceph-users@xxxxxxx
>>> > <mailto:ceph-users@xxxxxxx>
>>> > >> <ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>>
>>> > >> To unsubscribe send an email to
ceph-users-leave@xxxxxxx
>>> > <mailto:ceph-users-leave@xxxxxxx>
>>> > >> <ceph-users-leave@xxxxxxx
>>> <mailto:ceph-users-leave@xxxxxxx>>
>>> > >>
>>> > >>
>>> > >> _______________________________________________
>>> > >> ceph-users mailing list --ceph-users@xxxxxxx
>>> > <mailto:ceph-users@xxxxxxx>
>>> > >> To unsubscribe send an email toceph-users-leave@xxxxxxx
>>> > <mailto:toceph-users-leave@xxxxxxx>
>>> > >
>>> > > _______________________________________________
>>> > > ceph-users mailing list -- ceph-users@xxxxxxx
>>> > <mailto:ceph-users@xxxxxxx>
>>> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> > <mailto:ceph-users-leave@xxxxxxx>
>>> > _______________________________________________
>>> > ceph-users mailing list -- ceph-users@xxxxxxx
>>> > <mailto:ceph-users@xxxxxxx>
>>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> > <mailto:ceph-users-leave@xxxxxxx>
>>> >
>>> >
>>> > _______________________________________________
>>> > ceph-users mailing list --ceph-users@xxxxxxx
>>> > To unsubscribe send an email toceph-users-leave@xxxxxxx
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx