Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Perhaps emitting an extremely low value could have value for identifying a compromised drive?

> On Mar 22, 2024, at 12:49, Michel Jouvin <michel.jouvin@xxxxxxxxxxxxxxx> wrote:
> 
> Frédéric,
> 
> We arrived at the same conclusions! I agree that an insane low value would be a good addition: the idea would be that the benchmark emits a warning about the value but the it will not put a value lower than the minimum defined. I don't have a precise idea of the possible bad side effects of such an approach...
> 
> Thanks for your help.
> 
> Michel
> 
> Le 22/03/2024 à 16:29, Frédéric Nass a écrit :
>> Michel,
>> Glad to know that was it.
>> I was wondering when would per OSD osd_mclock_max_capacity_iops_hdd value be set in cluster's config database since I don't have any set in my lab.
>> Turns out the per OSD osd_mclock_max_capacity_iops_hdd is only set when the calculated value is below osd_mclock_iops_capacity_threshold_hdd, otherwise the OSD uses the default value of 315.
>> Probably to rule out any insanely high calculated values. Would have been nice to also rule out any insanely low measured values. :-)
>> Now either:
>> A/ these incredibly low values were calculated a while back with an unmature version of the code or under some specific hardware conditions and you can hope this won't happen again
>> OR
>> B/ you don't want to rely on hope to much and you'll prefer to disable automatic calculation (osd_mclock_skip_benchmark = true) and set osd_mclock_max_capacity_iops_[hdd,ssd] by yourself (globally or using a rack/host mask) after a precise evaluation of the performance of your OSDs.
>> B/ would be more deterministic :-)
>> Cheers,
>> Frédéric.
>> 
>>    ------------------------------------------------------------------------
>>    *De: *Michel <michel.jouvin@xxxxxxxxxxxxxxx>
>>    *à: *Frédéric <frederic.nass@xxxxxxxxxxxxxxxx>
>>    *Cc: *Pierre <pierre@xxxxxxxxxxxx>; ceph-users <ceph-users@xxxxxxx>
>>    *Envoyé: *vendredi 22 mars 2024 14:44 CET
>>    *Sujet : *Re:  Re: Reef (18.2): Some PG not
>>    scrubbed/deep scrubbed for 1 month
>> 
>>    Hi Frédéric,
>> 
>>    I think you raise the right point, sorry if I misunderstood Pierre's
>>    suggestion to look at OSD performances. Just before reading your
>>    email,
>>    I was implementing Pierre's suggestion for max_osd_scrubs and I
>>    saw the
>>    osd_mclock_max_capacity_iops_hdd for a few OSDs (I guess those with a
>>    value different from the default). For the suspect OSD, the value is
>>    very low, 0.145327, and I suspect it is the cause of the problem.
>>    A few
>>    others have a value ~5 which I find also very low (all OSDs are using
>>    the same recent HW/HDD).
>> 
>>    Thanks for these informations. I'll follow your suggestions to
>>    rerun the
>>    benchmark and report if it improved the situation.
>> 
>>    Best regards,
>> 
>>    Michel
>> 
>>    Le 22/03/2024 à 12:18, Frédéric Nass a écrit :
>>    > Hello Michel,
>>    >
>>    > Pierre also suggested checking the performance of this OSD's
>>    device(s) which can be done by running a ceph tell osd.x bench.
>>    >
>>    > One think I can think of is how the scrubbing speed of this very
>>    OSD could be influenced by mclock sheduling, would the max iops
>>    capacity calculated by this OSD during its initialization be
>>    significantly lower than other OSDs's.
>>    >
>>    > What I would do is check (from this OSD's log) the calculated
>>    value for max iops capacity and compare it to other OSDs.
>>    Eventually force a recalculation by setting 'ceph config set osd.x
>>    osd_mclock_force_run_benchmark_on_init true' and restart this OSD.
>>    >
>>    > Also I would:
>>    >
>>    > - compare running OSD's mclock values (cephadm shell ceph daemon
>>    osd.x config show | grep mclock) to other OSDs's.
>>    > - compare ceph tell osd.x bench to other OSDs's benchmarks.
>>    > - compare the rotational status of this OSD's db and data
>>    devices to other OSDs, to make sure things are in order.
>>    >
>>    > Bests,
>>    > Frédéric.
>>    >
>>    > PS: If mclock is the culprit here, then setting osd_op_queue
>>    back to mpq for this only OSD would probably reveal it. Not sure
>>    about the implication of having a signel OSD running a different
>>    scheduler in the cluster though.
>>    >
>>    >
>>    > ----- Le 22 Mar 24, à 10:11, Michel Jouvin
>>    michel.jouvin@xxxxxxxxxxxxxxx a écrit :
>>    >
>>    >> Pierre,
>>    >>
>>    >> Yes, as mentioned in my initial email, I checked the OSD state
>>    and found
>>    >> nothing wrong either in the OSD logs or in the system logs
>>    (SMART errors).
>>    >>
>>    >> Thanks for the advice of increasing osd_max_scrubs, I may try
>>    it, but I
>>    >> doubt it is a contention problem because it really only affects
>>    a fixed
>>    >> set of PGs (no new PGS have a "stucked scrub") and there is a
>>    >> significant scrubbing activity going on continuously (~10K PGs
>>    in the
>>    >> cluster).
>>    >>
>>    >> Again, it is not a problem for me to try to kick out the
>>    suspect OSD and
>>    >> see it fixes the issue but as this cluster is pretty simple/low
>>    in terms
>>    >> of activity and I see nothing that may explain why we have this
>>    >> situation on a pretty new cluster (9 months, created in Quincy)
>>    and not
>>    >> on our 2 other production clusters, much more used, one of them
>>    being
>>    >> the backend storage of a significant OpenStack clouds, a
>>    cluster created
>>    >> 10 years ago with Infernetis and upgraded since then, a better
>>    candidate
>>    >> for this kind of problems! So, I'm happy to contribute to
>>    >> troubleshooting a potential issue in Reef if somebody finds it
>>    useful
>>    >> and can help. Else I'll try the approach that worked for Gunnar.
>>    >>
>>    >> Best regards,
>>    >>
>>    >> Michel
>>    >>
>>    >> Le 22/03/2024 à 09:59, Pierre Riteau a écrit :
>>    >>> Hello Michel,
>>    >>>
>>    >>> It might be worth mentioning that the next releases of Reef
>>    and Quincy
>>    >>> should increase the default value of osd_max_scrubs from 1 to
>>    3. See
>>    >>> the Reef pull request: https://github.com/ceph/ceph/pull/55173
>>    >>> You could try increasing this configuration setting if you
>>    >>> haven't already, but note that it can impact client I/O
>>    performance.
>>    >>>
>>    >>> Also, if the delays appear to be related to a single OSD, have
>>    you
>>    >>> checked the health and performance of this device?
>>    >>>
>>    >>> On Fri, 22 Mar 2024 at 09:29, Michel Jouvin
>>    >>> <michel.jouvin@xxxxxxxxxxxxxxx> wrote:
>>    >>>
>>    >>> Hi,
>>    >>>
>>    >>> As I said in my initial message, I'd in mind to do exactly the
>>    >>> same as I
>>    >>> identified in my initial analysis that all the PGs with this
>>    problem
>>    >>> where sharing one OSD (but only 20 PGs had the problem over ~200
>>    >>> hosted
>>    >>> by the OSD). But as I don't feel I'm in an urgent situation, I
>>    was
>>    >>> wondering if collecting more information on the problem may
>>    have some
>>    >>> value and which one... If it helps, I add below the `pg dump` for
>>    >>> the 17
>>    >>> PGs still with a "stucked scrub".
>>    >>>
>>    >>> I observed the "stucked scrubs" is lowering very slowly. In the
>>    >>> last 12
>>    >>> hours, 1 more PG was successfully scrubbed/deep scrubbed. In case
>>    >>> it was
>>    >>> not clear in my initial message, the lists of PGs with a too old
>>    >>> scrub
>>    >>> and too old deep scrub are the same.
>>    >>>
>>    >>> Without an answer, next week i may consider doing what you did:
>>    >>> remove
>>    >>> the suspect OSD (instead of just restarting it) and see it
>>    >>> unblocks the
>>    >>> stucked scrubs.
>>    >>>
>>    >>> Best regards,
>>    >>>
>>    >>> Michel
>>    >>>
>>    >>> --------------------------------- "ceph pg dump pgs" for the 17
>>    >>> PGs with
>>    >>> a too old scrub and deep scrub (same list)
>>    >>> ------------------------------------------------------------
>>    >>>
>>    >>> PG_STAT  OBJECTS  MISSING_ON_PRIMARY DEGRADED  MISPLACED UNFOUND
>>    >>> BYTES        OMAP_BYTES*  OMAP_KEYS* LOG    LOG_DUPS DISK_LOG 
>>    STATE
>>    >>> STATE_STAMP VERSION       REPORTED
>>    >>> UP                 UP_PRIMARY  ACTING ACTING_PRIMARY
>>    >>> LAST_SCRUB    SCRUB_STAMP LAST_DEEP_SCRUB
>>    >>> DEEP_SCRUB_STAMP SNAPTRIMQ_LEN LAST_SCRUB_DURATION
>>    >>> SCRUB_SCHEDULING OBJECTS_SCRUBBED OBJECTS_TRIMMED
>>    >>> 29.7e3       260 0         0          0 0
>>    >>> 1090519040            0           0 1978       500
>>    >>> 1978                 active+clean 2024-03-21T18:28:53.369789+0000
>>    >>> 39202'2478    83812:97136 [29,141,64,194]          29
>>    >>> [29,141,64,194]              29 39202'2478
>>    >>> 2024-02-17T19:56:34.413412+0000 39202'2478
>>    >>> 2024-02-17T19:56:34.413412+0000              0 3  queued for deep
>>    >>> scrub
>>    >>> 0                0
>>    >>> 25.7cc         0 0         0          0 0
>>    >>> 0            0           0      0      1076 0
>>    >>> active+clean 2024-03-21T18:09:48.104279+0000     46253'548
>>    >>> 83812:89843        [29,50,173]          29 [29,50,173]
>>    >>> 29     39159'536 2024-02-17T18:14:54.950401+0000 39159'536
>>    >>> 2024-02-17T18:14:54.950401+0000              0 1  queued for deep
>>    >>> scrub
>>    >>> 0                0
>>    >>> 25.70c         0 0         0          0 0
>>    >>> 0            0           0      0       918 0
>>    >>> active+clean 2024-03-21T18:00:57.942902+0000 46253'514
>>    >>> 83812:95212 [29,195,185]          29 [29,195,185]              29
>>    >>> 39159'530 2024-02-18T03:56:17.559531+0000        39159'530
>>    >>> 2024-02-16T17:39:03.281785+0000              0 1  queued for deep
>>    >>> scrub
>>    >>> 0                0
>>    >>> 29.70c       249 0         0          0 0
>>    >>> 1044381696            0           0 1987       600
>>    >>> 1987                 active+clean 2024-03-21T18:35:36.848189+0000
>>    >>> 39202'2587    83812:99628 [29,138,63,12]          29
>>    >>> [29,138,63,12]              29 39202'2587
>>    >>> 2024-02-17T21:34:22.042560+0000 39202'2587
>>    >>> 2024-02-17T21:34:22.042560+0000              0 1  queued for deep
>>    >>> scrub
>>    >>> 0                0
>>    >>> 29.705       231 0         0          0 0
>>    >>> 968884224            0           0 1959       500 1959
>>    >>> active+clean 2024-03-21T18:18:22.028551+0000 39202'2459
>>    >>> 83812:91258 [29,147,173,61]          29 [29,147,173,61]
>>    >>> 29 39202'2459 2024-02-17T16:41:40.421763+0000 39202'2459
>>    >>> 2024-02-17T16:41:40.421763+0000              0 1  queued for deep
>>    >>> scrub
>>    >>> 0                0
>>    >>> 29.6b9       236 0         0          0 0
>>    >>> 989855744            0           0 1956       500 1956
>>    >>> active+clean 2024-03-21T18:11:29.912132+0000 39202'2456
>>    >>> 83812:95607 [29,199,74,16]          29 [29,199,74,16]
>>    >>> 29 39202'2456 2024-02-17T11:46:06.706625+0000 39202'2456
>>    >>> 2024-02-17T11:46:06.706625+0000              0 1  queued for deep
>>    >>> scrub
>>    >>> 0                0
>>    >>> 25.56e         0 0         0          0 0
>>    >>> 0            0           0      0      1158 0
>>    >>> active+clean+scrubbing+deep 2024-03-22T08:09:38.840145+0000
>>    >>> 46253'514   83812:637482 [111,29,128]         111
>>    >>> [111,29,128]             111 39159'579
>>    >>> 2024-03-06T17:57:53.158936+0000 39159'579
>>    >>> 2024-03-06T17:57:53.158936+0000              0 1  queued for deep
>>    >>> scrub
>>    >>> 0                0
>>    >>> 25.56a         0 0         0          0 0
>>    >>> 0            0           0      0      1055 0
>>    >>> active+clean 2024-03-21T18:00:57.940851+0000     46253'545
>>    >>> 83812:93475        [29,19,211]          29 [29,19,211]
>>    >>> 29     46253'545 2024-03-07T11:12:45.881545+0000 46253'545
>>    >>> 2024-03-07T11:12:45.881545+0000              0 28 queued for deep
>>    >>> scrub
>>    >>> 0                0
>>    >>> 25.55a         0 0         0          0 0
>>    >>> 0            0           0      0      1022 0
>>    >>> active+clean 2024-03-21T18:10:24.124914+0000     46253'565
>>    >>> 83812:89876        [29,58,195]          29 [29,58,195]
>>    >>> 29     46253'561 2024-02-17T06:54:35.320454+0000 46253'561
>>    >>> 2024-02-17T06:54:35.320454+0000              0 28 queued for deep
>>    >>> scrub
>>    >>> 0                0
>>    >>> 29.c0        256 0         0          0 0
>>    >>> 1073741824            0           0 1986       600 1986
>>    >>> active+clean+scrubbing+deep 2024-03-22T08:09:12.849868+0000
>>    >>> 39202'2586   83812:603625 [22,150,29,56]          22
>>    >>> [22,150,29,56]              22 39202'2586
>>    >>> 2024-03-07T18:53:22.952868+0000 39202'2586
>>    >>> 2024-03-07T18:53:22.952868+0000              0 1  queued for deep
>>    >>> scrub
>>    >>> 0                0
>>    >>> 18.6       15501 0         0          0 0
>>    >>> 63959444676            0           0 2068      3000 2068
>>    >>> active+clean+scrubbing+deep 2024-03-22T02:29:24.508889+0000
>>    >>> 81688'663900  83812:1272160 [187,29,211]         187
>>    >>> [187,29,211]             187 52735'663878
>>    >>> 2024-03-06T16:36:32.080259+0000 52735'663878
>>    >>> 2024-03-06T16:36:32.080259+0000              0 684445 deep
>>    scrubbing
>>    >>> for 20373s 449                0
>>    >>> 16.15          0 0         0          0 0
>>    >>> 0            0           0      0         0 0
>>    >>> active+clean 2024-03-21T18:20:29.632554+0000           0'0
>>    >>> 83812:104893        [29,165,85]          29 [29,165,85]
>>    >>> 29           0'0 2024-02-17T06:54:06.370647+0000              0'0
>>    >>> 2024-02-17T06:54:06.370647+0000              0 28 queued for deep
>>    >>> scrub
>>    >>> 0                0
>>    >>> 25.45          0 0         0          0 0
>>    >>> 0            0           0      0      1036 0
>>    >>> active+clean 2024-03-21T18:10:24.125134+0000     39159'561
>>    >>> 83812:93649         [29,13,58]          29 [29,13,58]
>>    >>> 29     39159'512 2024-02-27T12:27:35.728176+0000 39159'512
>>    >>> 2024-02-27T12:27:35.728176+0000              0 1  queued for deep
>>    >>> scrub
>>    >>> 0                0
>>    >>> 29.249       260 0         0          0 0
>>    >>> 1090519040            0           0 1970       500
>>    >>> 1970                 active+clean 2024-03-21T18:29:22.588805+0000
>>    >>> 39202'2470    83812:96016 [29,191,18,143]          29
>>    >>> [29,191,18,143]              29 39202'2470
>>    >>> 2024-02-17T13:32:42.910335+0000 39202'2470
>>    >>> 2024-02-17T13:32:42.910335+0000              0 1  queued for deep
>>    >>> scrub
>>    >>> 0                0
>>    >>> 29.25a       248 0         0          0 0
>>    >>> 1040187392            0           0 1952       600
>>    >>> 1952                 active+clean 2024-03-21T18:20:29.623422+0000
>>    >>> 39202'2552    83812:99157 [29,200,85,164]          29
>>    >>> [29,200,85,164]              29 39202'2552
>>    >>> 2024-02-17T08:33:14.326087+0000 39202'2552
>>    >>> 2024-02-17T08:33:14.326087+0000              0 1  queued for deep
>>    >>> scrub
>>    >>> 0                0
>>    >>> 25.3cf         0 0         0          0 0
>>    >>> 0            0           0      0      1343 0
>>    >>> active+clean 2024-03-21T18:16:00.933375+0000     46253'598
>>    >>> 83812:91659        [29,75,175]          29 [29,75,175]
>>    >>> 29     46253'598 2024-02-17T11:48:51.840600+0000 46253'598
>>    >>> 2024-02-17T11:48:51.840600+0000              0 28 queued for deep
>>    >>> scrub
>>    >>> 0                0
>>    >>> 29.4ec       243 0         0          0 0
>>    >>> 1019215872            0           0 1933       500
>>    >>> 1933                 active+clean 2024-03-21T18:15:35.389598+0000
>>    >>> 39202'2433   83812:101501 [29,206,63,17]          29
>>    >>> [29,206,63,17]              29 39202'2433
>>    >>> 2024-02-17T15:10:41.027755+0000 39202'2433
>>    >>> 2024-02-17T15:10:41.027755+0000              0 3  queued for deep
>>    >>> scrub
>>    >>> 0                0
>>    >>>
>>    >>>
>>    >>> Le 22/03/2024 à 08:16, Bandelow, Gunnar a écrit :
>>    >>> > Hi Michael,
>>    >>> >
>>    >>> > i think yesterday i found the culprit in my case.
>>    >>> >
>>    >>> > After inspecting "ceph pg dump" and especially the column
>>    >>> > "last_scrub_duration". I found, that any PG without proper
>>    >>> scrubbing
>>    >>> > was located on one of three OSDs (and all these OSDs share
>>    the same
>>    >>> > SSD for their DB). I put them on "out" and now after
>>    backfill and
>>    >>> > remapping everything seems to be fine.
>>    >>> >
>>    >>> > Only the log is still flooded with "scrub starts" and i have no
>>    >>> clue
>>    >>> > why these OSDs are causing the problems.
>>    >>> > Will investigate further.
>>    >>> >
>>    >>> > Best regards,
>>    >>> > Gunnar
>>    >>> >
>>    >>> > ===================================
>>    >>> >
>>    >>> >  Gunnar Bandelow
>>    >>> >  Universitätsrechenzentrum (URZ)
>>    >>> >  Universität Greifswald
>>    >>> >  Felix-Hausdorff-Straße 18
>>    >>> >  17489 Greifswald
>>    >>> >  Germany
>>    >>> >
>>    >>> >  Tel.: +49 3834 420 1450
>>    >>> >
>>    >>> >
>>    >>> > --- Original Nachricht ---
>>    >>> > *Betreff: * Re: Reef (18.2): Some PG not
>>    scrubbed/deep
>>    >>> > scrubbed for 1 month
>>    >>> > *Von: *"Michel Jouvin" <michel.jouvin@xxxxxxxxxxxxxxx
>>    >>> > <mailto:michel.jouvin@xxxxxxxxxxxxxxx>>
>>    >>> > *An: *ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
>>    >>> > *Datum: *21-03-2024 23:40
>>    >>> >
>>    >>> >
>>    >>> >
>>    >>> >     Hi,
>>    >>> >
>>    >>> >     Today we decided to upgrade from 18.2.0 to 18.2.2. No real
>>    >>> hope of a
>>    >>> >     direct impact (nothing in the change log related to
>>    something
>>    >>> >     similar)
>>    >>> >     but at least all daemons were restarted so we thought that
>>    >>> may be
>>    >>> >     this
>>    >>> >     will clear the problem at least temporarily. Unfortunately
>>    >>> it has not
>>    >>> >     been the case. The same pages are still stuck, despite
>>    >>> continuous
>>    >>> >     activity of scrubbing/deep scrubbing in the cluster...
>>    >>> >
>>    >>> >     I'm happy to provide more information if somebody tells me
>>    >>> what to
>>    >>> >     look
>>    >>> >     at...
>>    >>> >
>>    >>> >     Cheers,
>>    >>> >
>>    >>> >     Michel
>>    >>> >
>>    >>> >     Le 21/03/2024 à 14:40, Bernhard Krieger a écrit :
>>    >>> >     > Hi,
>>    >>> >     >
>>    >>> >     > i have the same issues.
>>    >>> >     > Deep scrub havent finished the jobs on some PGs.
>>    >>> >     >
>>    >>> >     > Using ceph 18.2.2.
>>    >>> >     > Initial installed version was 18.0.0
>>    >>> >     >
>>    >>> >     >
>>    >>> >     > In the logs i see a lot of scrub/deep-scrub starts
>>    >>> >     >
>>    >>> >     > Mar 21 14:21:09 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 13.b deep-scrubstarts
>>    >>> >     > Mar 21 14:21:10 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 13.1a deep-scrubstarts
>>    >>> >     > Mar 21 14:21:17 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 13.1c deep-scrubstarts
>>    >>> >     > Mar 21 14:21:19 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 11.1 scrubstarts
>>    >>> >     > Mar 21 14:21:27 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 14.6 scrubstarts
>>    >>> >     > Mar 21 14:21:30 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 10.c deep-scrubstarts
>>    >>> >     > Mar 21 14:21:35 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 12.3 deep-scrubstarts
>>    >>> >     > Mar 21 14:21:41 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 6.0 scrubstarts
>>    >>> >     > Mar 21 14:21:44 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 8.5 deep-scrubstarts
>>    >>> >     > Mar 21 14:21:45 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 5.66 deep-scrubstarts
>>    >>> >     > Mar 21 14:21:49 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 5.30 deep-scrubstarts
>>    >>> >     > Mar 21 14:21:50 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 13.b deep-scrubstarts
>>    >>> >     > Mar 21 14:21:52 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 13.1a deep-scrubstarts
>>    >>> >     > Mar 21 14:21:54 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 13.1c deep-scrubstarts
>>    >>> >     > Mar 21 14:21:55 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 11.1 scrubstarts
>>    >>> >     > Mar 21 14:21:58 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 14.6 scrubstarts
>>    >>> >     > Mar 21 14:22:01 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 10.c deep-scrubstarts
>>    >>> >     > Mar 21 14:22:04 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 12.3 scrubstarts
>>    >>> >     > Mar 21 14:22:13 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 6.0 scrubstarts
>>    >>> >     > Mar 21 14:22:15 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 8.5 deep-scrubstarts
>>    >>> >     > Mar 21 14:22:20 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 5.66 deep-scrubstarts
>>    >>> >     > Mar 21 14:22:27 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 5.30 scrubstarts
>>    >>> >     > Mar 21 14:22:30 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 13.b deep-scrubstarts
>>    >>> >     > Mar 21 14:22:32 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 13.1a deep-scrubstarts
>>    >>> >     > Mar 21 14:22:33 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 13.1c deep-scrubstarts
>>    >>> >     > Mar 21 14:22:35 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 11.1 deep-scrubstarts
>>    >>> >     > Mar 21 14:22:37 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 14.6 scrubstarts
>>    >>> >     > Mar 21 14:22:38 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 10.c scrubstarts
>>    >>> >     > Mar 21 14:22:39 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 12.3 scrubstarts
>>    >>> >     > Mar 21 14:22:41 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 6.0 deep-scrubstarts
>>    >>> >     > Mar 21 14:22:43 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 8.5 deep-scrubstarts
>>    >>> >     > Mar 21 14:22:46 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 5.66 deep-scrubstarts
>>    >>> >     > Mar 21 14:22:49 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 5.30 scrubstarts
>>    >>> >     > Mar 21 14:22:55 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 13.b deep-scrubstarts
>>    >>> >     > Mar 21 14:22:57 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 13.1a deep-scrubstarts
>>    >>> >     > Mar 21 14:22:58 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 13.1c deep-scrubstarts
>>    >>> >     > Mar 21 14:23:03 ceph-node10 ceph-osd[3804193]:
>>    >>> log_channel(cluster)
>>    >>> >     > log [DBG] : 11.1 deep-scrubstarts
>>    >>> >     >
>>    >>> >     >
>>    >>> >     >
>>    >>> >     > *
>>    >>> >     > *The amount of scrubbed/deep-scrubbed pgs changes every
>>    >>> few seconds.
>>    >>> >     >
>>    >>> >     > [root@ceph-node10 ~]# ceph -s | grep active+clean
>>    >>> >     >    pgs:     214 active+clean
>>    >>> >     >             50 active+clean+scrubbing+deep
>>    >>> >     >             25 active+clean+scrubbing
>>    >>> >     > [root@ceph-node10 ~]# ceph -s | grep active+clean
>>    >>> >     >    pgs:     208 active+clean
>>    >>> >     >             53 active+clean+scrubbing+deep
>>    >>> >     >             28 active+clean+scrubbing
>>    >>> >     > [root@ceph-node10 ~]# ceph -s | grep active+clean
>>    >>> >     >    pgs:     208 active+clean
>>    >>> >     >             53 active+clean+scrubbing+deep
>>    >>> >     >             28 active+clean+scrubbing
>>    >>> >     > [root@ceph-node10 ~]# ceph -s | grep active+clean
>>    >>> >     >    pgs:     207 active+clean
>>    >>> >     >             54 active+clean+scrubbing+deep
>>    >>> >     >             28 active+clean+scrubbing
>>    >>> >     > [root@ceph-node10 ~]# ceph -s | grep active+clean
>>    >>> >     >    pgs:     202 active+clean
>>    >>> >     >             56 active+clean+scrubbing+deep
>>    >>> >     >             31 active+clean+scrubbing
>>    >>> >     > [root@ceph-node10 ~]# ceph -s | grep active+clean
>>    >>> >     >    pgs:     213 active+clean
>>    >>> >     >             45 active+clean+scrubbing+deep
>>    >>> >     >             31 active+clean+scrubbing
>>    >>> >     >
>>    >>> >     > ceph pg dump showing PGs which are not deep scrubbed
>>    since
>>    >>> january.
>>    >>> >     > Some PGs deep scrubbing  over 700000 seconds.
>>    >>> >     >
>>    >>> >     > *[ceph: root@ceph-node10 /]#  ceph pg dump pgs | grep -e
>>    >>> >     'scrubbing f'
>>    >>> >     > 5.6e      221223                   0         0          0
>>    >>>        0
>>    >>> >     >  927795290112            0           0  4073      3000
>>    >>>      4073
>>    >>> >     >  active+clean+scrubbing+deep  2024-03-20T01:07:21.196293+
>>    >>> >     > 0000  128383'15766927  128383:20517419
>>      [2,4,18,16,14,21]
>>    >>> >               2
>>    >>> >     >   [2,4,18,16,14,21]               2  125519'12328877
>>    >>> >     >  2024-01-23T11:25:35.503811+0000  124844'11873951
>>    >>>  2024-01-21T22:
>>    >>> >     > 24:12.620693+0000              0                    5
>>     deep
>>    >>> >     scrubbing
>>    >>> >     > for 270790s
>>                                                53772
>>    >>> >     >                0
>>    >>> >     > 5.6c      221317                   0         0          0
>>    >>>        0
>>    >>> >     >  928173256704            0           0  6332         0
>>    >>>      6332
>>    >>> >     >  active+clean+scrubbing+deep  2024-03-18T09:29:29.233084+
>>    >>> >     > 0000  128382'15788196  128383:20727318
>>        [6,9,12,14,1,4]
>>    >>> >               6
>>    >>> >     >     [6,9,12,14,1,4]               6  127180'14709746
>>    >>> >     >  2024-03-06T12:47:57.741921+0000  124817'11821502
>>    >>>  2024-01-20T20:
>>    >>> >     > 59:40.566384+0000              0                13452
>>     deep
>>    >>> >     scrubbing
>>    >>> >     > for 273519s
>>                                               122803
>>    >>> >     >                0
>>    >>> >     > 5.6a      221325                   0         0          0
>>    >>>        0
>>    >>> >     >  928184565760            0           0  4649      3000
>>    >>>      4649
>>    >>> >     >  active+clean+scrubbing+deep  2024-03-13T03:48:54.065125+
>>    >>> >     > 0000  128382'16031499  128383:21221685
>>        [13,11,1,2,9,8]
>>    >>> >              13
>>    >>> >     >     [13,11,1,2,9,8]              13  127181'14915404
>>    >>> >     >  2024-03-06T13:16:58.635982+0000  125967'12517899
>>    >>>  2024-01-28T09:
>>    >>> >     > 13:08.276930+0000              0                10078
>>     deep
>>    >>> >     scrubbing
>>    >>> >     > for 726001s
>>                                               184819
>>    >>> >     >                0
>>    >>> >     > 5.54      221050                   0         0          0
>>    >>>        0
>>    >>> >     >  927036203008            0           0  4864      3000
>>    >>>      4864
>>    >>> >     >  active+clean+scrubbing+deep  2024-03-18T00:17:48.086231+
>>    >>> >     > 0000  128383'15584012  128383:20293678
>>     [0,20,18,19,11,12]
>>    >>> >               0
>>    >>> >     >  [0,20,18,19,11,12]               0  127195'14651908
>>    >>> >     >  2024-03-07T09:22:31.078448+0000  124816'11813857
>>    >>>  2024-01-20T16:
>>    >>> >     > 43:15.755200+0000              0                 9808
>>     deep
>>    >>> >     scrubbing
>>    >>> >     > for 306667s
>>                                               142126
>>    >>> >     >                0
>>    >>> >     > 5.47      220849                   0         0          0
>>    >>>        0
>>    >>> >     >  926233448448            0           0  5592         0
>>    >>>      5592
>>    >>> >     >  active+clean+scrubbing+deep  2024-03-12T08:10:39.413186+
>>    >>> >     > 0000  128382'15653864  128383:20403071
>>     [16,15,20,0,13,21]
>>    >>> >              16
>>    >>> >     >  [16,15,20,0,13,21]              16  127183'14600433
>>    >>> >     >  2024-03-06T18:21:03.057165+0000  124809'11792397
>>    >>>  2024-01-20T05:
>>    >>> >     > 27:07.617799+0000              0                13066
>>     deep
>>    >>> >     scrubbing
>>    >>> >     > for 796697s
>>                                               209193
>>    >>> >     >                0
>>    >>> >     > dumped pgs
>>    >>> >     >
>>    >>> >     >
>>    >>> >     > *
>>    >>> >     >
>>    >>> >     >
>>    >>> >     > regards
>>    >>> >     > Bernhard
>>    >>> >     >
>>    >>> >     >
>>    >>> >     >
>>    >>> >     >
>>    >>> >     >
>>    >>> >     >
>>    >>> >     > On 20/03/2024 21:12, Bandelow, Gunnar wrote:
>>    >>> >     >> Hi,
>>    >>> >     >>
>>    >>> >     >> i just wanted to mention, that i am running a cluster
>>    >>> with reef
>>    >>> >     >> 18.2.1 with the same issue.
>>    >>> >     >>
>>    >>> >     >> 4 PGs start to deepscrub but dont finish since mid
>>    >>> february. In
>>    >>> >     the
>>    >>> >     >> pg dump they are shown as scheduled for deep scrub. They
>>    >>> sometimes
>>    >>> >     >> change their status from active+clean to
>>    >>> >     active+clean+scrubbing+deep
>>    >>> >     >> and back.
>>    >>> >     >>
>>    >>> >     >> Best regards,
>>    >>> >     >> Gunnar
>>    >>> >     >>
>>    >>> >     >> =======================================================
>>    >>> >     >>
>>    >>> >     >> Gunnar Bandelow
>>    >>> >     >> Universitätsrechenzentrum (URZ)
>>    >>> >     >> Universität Greifswald
>>    >>> >     >> Felix-Hausdorff-Straße 18
>>    >>> >     >> 17489 Greifswald
>>    >>> >     >> Germany
>>    >>> >     >>
>>    >>> >     >> Tel.: +49 3834 420 1450
>>    >>> >     >>
>>    >>> >     >>
>>    >>> >     >>
>>    >>> >     >>
>>    >>> >     >> --- Original Nachricht ---
>>    >>> >     >> *Betreff: * Re: Reef (18.2): Some PG not
>>    >>> scrubbed/deep
>>    >>> >     >> scrubbed for 1 month
>>    >>> >     >> *Von: *"Michel Jouvin" <michel.jouvin@xxxxxxxxxxxxxxx
>>    >>> > <mailto:michel.jouvin@xxxxxxxxxxxxxxx>
>>    >>> >     >> <michel.jouvin@xxxxxxxxxxxxxxx
>>    >>> > <mailto:michel.jouvin@xxxxxxxxxxxxxxx>>>
>>    >>> >     >> *An: *ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>
>>    >>> >     <ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>>
>>    >>> >     >> *Datum: *20-03-2024 20:00
>>    >>> >     >>
>>    >>> >     >>
>>    >>> >     >>
>>    >>> >     >>     Hi Rafael,
>>    >>> >     >>
>>    >>> >     >>     Good to know I am not alone!
>>    >>> >     >>
>>    >>> >     >>     Additional information ~6h after the OSD restart:
>>    >>> over the
>>    >>> >     20 PGs
>>    >>> >     >>     impacted, 2 have been processed successfully... I
>>    don't
>>    >>> >     have a clear
>>    >>> >     >>     picture on how Ceph prioritize the scrub of one
>>    PG over
>>    >>> >     another, I
>>    >>> >     >>     had
>>    >>> >     >>     thought that the oldest/expired scrubs are taken
>>    >>> first but
>>    >>> >     it may
>>    >>> >     >>     not be
>>    >>> >     >>     the case. Anyway, I have seen a very significant
>>    >>> decrese of
>>    >>> >     the
>>    >>> >     >> scrub
>>    >>> >     >>     activity this afternoon and the cluster is not
>>    loaded
>>    >>> at all
>>    >>> >     >>     (almost no
>>    >>> >     >>     users yet)...
>>    >>> >     >>
>>    >>> >     >>     Michel
>>    >>> >     >>
>>    >>> >     >>     Le 20/03/2024 à 17:55, quaglio@xxxxxxxxxx
>>    >>> > <mailto:quaglio@xxxxxxxxxx>
>>    >>> >     >> <quaglio@xxxxxxxxxx <mailto:quaglio@xxxxxxxxxx>> a
>>    >>> écrit :
>>    >>> >     >>     > Hi,
>>    >>> >     >>     >      I upgraded a cluster 2 weeks ago here. The
>>    >>> situation
>>    >>> >     is the
>>    >>> >     >>     same
>>    >>> >     >>     > as Michel.
>>    >>> >     >>     >      A lot of PGs no scrubbed/deep-scrubed.
>>    >>> >     >>     >
>>    >>> >     >>     > Rafael.
>>    >>> >     >>     >
>>    >>> >     >>     > _______________________________________________
>>    >>> >     >>     > ceph-users mailing list -- ceph-users@xxxxxxx
>>    >>> > <mailto:ceph-users@xxxxxxx>
>>    >>> >     >> <ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>>
>>    >>> >     >>     > To unsubscribe send an email to
>>    >>> ceph-users-leave@xxxxxxx
>>    >>> > <mailto:ceph-users-leave@xxxxxxx>
>>    >>> >     >> <ceph-users-leave@xxxxxxx
>>    >>> <mailto:ceph-users-leave@xxxxxxx>>
>>    >>> >     >> _______________________________________________
>>    >>> >     >>     ceph-users mailing list -- ceph-users@xxxxxxx
>>    >>> > <mailto:ceph-users@xxxxxxx>
>>    >>> >     >> <ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>>
>>    >>> >     >>     To unsubscribe send an email to
>>    ceph-users-leave@xxxxxxx
>>    >>> > <mailto:ceph-users-leave@xxxxxxx>
>>    >>> >     >> <ceph-users-leave@xxxxxxx
>>    >>> <mailto:ceph-users-leave@xxxxxxx>>
>>    >>> >     >>
>>    >>> >     >>
>>    >>> >     >> _______________________________________________
>>    >>> >     >> ceph-users mailing list --ceph-users@xxxxxxx
>>    >>> > <mailto:ceph-users@xxxxxxx>
>>    >>> >     >> To unsubscribe send an email toceph-users-leave@xxxxxxx
>>    >>> > <mailto:toceph-users-leave@xxxxxxx>
>>    >>> >     >
>>    >>> >     > _______________________________________________
>>    >>> >     > ceph-users mailing list -- ceph-users@xxxxxxx
>>    >>> > <mailto:ceph-users@xxxxxxx>
>>    >>> >     > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>    >>> > <mailto:ceph-users-leave@xxxxxxx>
>>    >>> >  _______________________________________________
>>    >>> >     ceph-users mailing list -- ceph-users@xxxxxxx
>>    >>> > <mailto:ceph-users@xxxxxxx>
>>    >>> >     To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>    >>> > <mailto:ceph-users-leave@xxxxxxx>
>>    >>> >
>>    >>> >
>>    >>> > _______________________________________________
>>    >>> > ceph-users mailing list --ceph-users@xxxxxxx
>>    >>> > To unsubscribe send an email toceph-users-leave@xxxxxxx
>>    >>> _______________________________________________
>>    >>> ceph-users mailing list -- ceph-users@xxxxxxx
>>    >>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>    >>>
>>    >> _______________________________________________
>>    >> ceph-users mailing list -- ceph-users@xxxxxxx
>>    >> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux