Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



 
 
 
 
 
Michel, 
  
Log says that osd.29 is providing 2792 '4k' iops at 10.910 MiB/s. These figures suggest that a controller write-back cache is in use along the IO path. Is that right? 
  
Since 2792 is above 500, osd_mclock_max_capacity_iops_hdd falls back to 315 and OSD is suggesting running a benchmark and setting osd_mclock_max_capacity_iops_[hdd|ssd] accordingly. 
Removing any per osd osd_mclock_max_capacity_iops_hdd and restarting all concerned OSDs, checking that no osd_mclock_max_capacity_iops_hdd is set anymore should be enough for the time being. 
  
No sure why these OSDs had such pretty bad performance in the past. Maybe a controller firmware issue at that time. 
  
Regarding the write-back cache, be carefull to not set osd_mclock_max_capacity_iops_hdd too high as OSDs may not always benefit from the controller's write-back cache, especially during large IO workloads filling up the cache or would this cache be disabled due to controller's battery becoming defective. 
  
I'll be interested in what you decide for osd_mclock_max_capacity_iops_hdd in such configuration. 
  
Cheers, 
Frédéric.    

 
 
 
 

-----Message original-----

De: Michel <michel.jouvin@xxxxxxxxxxxxxxx>
à: ceph-users <ceph-users@xxxxxxx>
Envoyé: vendredi 22 mars 2024 17:20 CET
Sujet :  Re: Reef (18.2): Some PG not scrubbed/deep scrubbed for 1 month

Hi, 

The attempt to rerun the bench was not really a success. I got the 
following messages: 

----- 

Mar 22 14:48:36 idr-osd2 ceph-osd[326854]: osd.29 83873 
maybe_override_max_osd_capacity_for_qos osd bench result - bandwidth 
(MiB/sec): 10.910 iops: 2792.876 elapsed_sec: 1.074 
Mar 22 14:48:36 idr-osd2 ceph-osd[326854]: log_channel(cluster) log 
[WRN] : OSD bench result of 2792.876456 IOPS exceeded the threshold 
limit of 500.000000 IOPS for osd.29. IOPS capacity is unchanged at 
0.000000 IOPS. The recommendation is to establish the osd's IOPS 
capacity using other benchmark tools (e.g. Fio) and then override 
osd_mclock_max_capacity_iops_[hdd|ssd]. 
----- 

I decided as a first step to raise the osd_mclock_max_capacity_iops_hdd 
for the suspect OSD to 50. It was magic! I already managed to get 16 
over 17 scrubs/deep scrubs to be run and the last one is in progress. 

I now have to understand why this OSD had such bad perfs that 
osd_mclock_max_capacity_iops_hdd was set to such a low value... I have 
12 OSDs with an entry for their osd_mclock_max_capacity_iops_hdd and 
they are mostly on one server (with 2 OSDs on another one). I suspect 
there was a problem on these servers at some points. It is unclear why 
it is not enough to just rerun the benchmark and why a crazy value for 
an HDD is found... 

Best regards, 

Michel 

Le 22/03/2024 à 14:44, Michel Jouvin a écrit : 
> Hi Frédéric, 
> 
> I think you raise the right point, sorry if I misunderstood Pierre's 
> suggestion to look at OSD performances. Just before reading your 
> email, I was implementing Pierre's suggestion for max_osd_scrubs and I 
> saw the osd_mclock_max_capacity_iops_hdd for a few OSDs (I guess those 
> with a value different from the default). For the suspect OSD, the 
> value is very low, 0.145327, and I suspect it is the cause of the 
> problem. A few others have a value ~5 which I find also very low (all 
> OSDs are using the same recent HW/HDD). 
> 
> Thanks for these informations. I'll follow your suggestions to rerun 
> the benchmark and report if it improved the situation. 
> 
> Best regards, 
> 
> Michel 
> 
> Le 22/03/2024 à 12:18, Frédéric Nass a écrit : 
>> Hello Michel, 
>> 
>> Pierre also suggested checking the performance of this OSD's 
>> device(s) which can be done by running a ceph tell osd.x bench. 
>> 
>> One think I can think of is how the scrubbing speed of this very OSD 
>> could be influenced by mclock sheduling, would the max iops capacity 
>> calculated by this OSD during its initialization be significantly 
>> lower than other OSDs's. 
>> 
>> What I would do is check (from this OSD's log) the calculated value 
>> for max iops capacity and compare it to other OSDs. Eventually force 
>> a recalculation by setting 'ceph config set osd.x 
>> osd_mclock_force_run_benchmark_on_init true' and restart this OSD. 
>> 
>> Also I would: 
>> 
>> - compare running OSD's mclock values (cephadm shell ceph daemon 
>> osd.x config show | grep mclock) to other OSDs's. 
>> - compare ceph tell osd.x bench to other OSDs's benchmarks. 
>> - compare the rotational status of this OSD's db and data devices to 
>> other OSDs, to make sure things are in order. 
>> 
>> Bests, 
>> Frédéric. 
>> 
>> PS: If mclock is the culprit here, then setting osd_op_queue back to 
>> mpq for this only OSD would probably reveal it. Not sure about the 
>> implication of having a signel OSD running a different scheduler in 
>> the cluster though. 
>> 
>> 
>> ----- Le 22 Mar 24, à 10:11, Michel Jouvin 
>> michel.jouvin@xxxxxxxxxxxxxxx a écrit : 
>> 
>>> Pierre, 
>>> 
>>> Yes, as mentioned in my initial email, I checked the OSD state and 
>>> found 
>>> nothing wrong either in the OSD logs or in the system logs (SMART 
>>> errors). 
>>> 
>>> Thanks for the advice of increasing osd_max_scrubs, I may try it, but I 
>>> doubt it is a contention problem because it really only affects a fixed 
>>> set of PGs (no new PGS have a "stucked scrub") and there is a 
>>> significant scrubbing activity going on continuously (~10K PGs in the 
>>> cluster). 
>>> 
>>> Again, it is not a problem for me to try to kick out the suspect OSD 
>>> and 
>>> see it fixes the issue but as this cluster is pretty simple/low in 
>>> terms 
>>> of activity and I see nothing that may explain why we have this 
>>> situation on a pretty new cluster (9 months, created in Quincy) and not 
>>> on our 2 other production clusters, much more used, one of them being 
>>> the backend storage of a significant OpenStack clouds, a cluster 
>>> created 
>>> 10 years ago with Infernetis and upgraded since then, a better 
>>> candidate 
>>> for this kind of problems! So, I'm happy to contribute to 
>>> troubleshooting a potential issue in Reef if somebody finds it useful 
>>> and can help. Else I'll try the approach that worked for Gunnar. 
>>> 
>>> Best regards, 
>>> 
>>> Michel 
>>> 
>>> Le 22/03/2024 à 09:59, Pierre Riteau a écrit : 
>>>> Hello Michel, 
>>>> 
>>>> It might be worth mentioning that the next releases of Reef and Quincy 
>>>> should increase the default value of osd_max_scrubs from 1 to 3. See 
>>>> the Reef pull request: https://github.com/ceph/ceph/pull/55173 
>>>> You could try increasing this configuration setting if you 
>>>> haven't already, but note that it can impact client I/O performance. 
>>>> 
>>>> Also, if the delays appear to be related to a single OSD, have you 
>>>> checked the health and performance of this device? 
>>>> 
>>>> On Fri, 22 Mar 2024 at 09:29, Michel Jouvin 
>>>> <michel.jouvin@xxxxxxxxxxxxxxx> wrote: 
>>>> 
>>>>      Hi, 
>>>> 
>>>>      As I said in my initial message, I'd in mind to do exactly the 
>>>>      same as I 
>>>>      identified in my initial analysis that all the PGs with this 
>>>> problem 
>>>>      where sharing one OSD (but only 20 PGs had the problem over ~200 
>>>>      hosted 
>>>>      by the OSD). But as I don't feel I'm in an urgent situation, I 
>>>> was 
>>>>      wondering if collecting more information on the problem may 
>>>> have some 
>>>>      value and which one... If it helps, I add below the `pg dump` for 
>>>>      the 17 
>>>>      PGs still with a "stucked scrub". 
>>>> 
>>>>      I observed the "stucked scrubs" is lowering very slowly. In the 
>>>>      last 12 
>>>>      hours, 1 more PG was successfully scrubbed/deep scrubbed. In case 
>>>>      it was 
>>>>      not clear in my initial message, the lists of PGs with a too old 
>>>>      scrub 
>>>>      and too old deep scrub are the same. 
>>>> 
>>>>      Without an answer, next week i may consider doing what you did: 
>>>>      remove 
>>>>      the suspect OSD (instead of just restarting it) and see it 
>>>>      unblocks the 
>>>>      stucked scrubs. 
>>>> 
>>>>      Best regards, 
>>>> 
>>>>      Michel 
>>>> 
>>>>      --------------------------------- "ceph pg dump pgs" for the 17 
>>>>      PGs with 
>>>>      a too old scrub and deep scrub (same list) 
>>>> ------------------------------------------------------------ 
>>>> 
>>>>      PG_STAT  OBJECTS  MISSING_ON_PRIMARY  DEGRADED MISPLACED UNFOUND 
>>>>      BYTES        OMAP_BYTES*  OMAP_KEYS*  LOG    LOG_DUPS 
>>>> DISK_LOG  STATE 
>>>>      STATE_STAMP                      VERSION       REPORTED 
>>>>      UP                 UP_PRIMARY  ACTING ACTING_PRIMARY 
>>>>      LAST_SCRUB    SCRUB_STAMP LAST_DEEP_SCRUB 
>>>>      DEEP_SCRUB_STAMP                 SNAPTRIMQ_LEN 
>>>> LAST_SCRUB_DURATION 
>>>>      SCRUB_SCHEDULING OBJECTS_SCRUBBED  OBJECTS_TRIMMED 
>>>>      29.7e3       260                   0         0 0 0 
>>>>      1090519040            0           0   1978       500 
>>>>      1978                 active+clean 2024-03-21T18:28:53.369789+0000 
>>>>      39202'2478    83812:97136 [29,141,64,194]          29 
>>>>      [29,141,64,194]              29 39202'2478 
>>>>      2024-02-17T19:56:34.413412+0000       39202'2478 
>>>>      2024-02-17T19:56:34.413412+0000              0 3 queued for deep 
>>>>      scrub 
>>>>      0                0 
>>>>      25.7cc         0                   0         0 0 0 
>>>>      0            0           0      0      1076 0 
>>>>      active+clean 2024-03-21T18:09:48.104279+0000 46253'548 
>>>>      83812:89843        [29,50,173]          29 [29,50,173] 
>>>>      29     39159'536 2024-02-17T18:14:54.950401+0000 39159'536 
>>>>      2024-02-17T18:14:54.950401+0000              0 1 queued for deep 
>>>>      scrub 
>>>>      0                0 
>>>>      25.70c         0                   0         0 0 0 
>>>>      0            0           0      0       918 0 
>>>>      active+clean 2024-03-21T18:00:57.942902+0000 46253'514 
>>>>      83812:95212 [29,195,185]          29 [29,195,185]              29 
>>>>      39159'530  2024-02-18T03:56:17.559531+0000 39159'530 
>>>>      2024-02-16T17:39:03.281785+0000              0 1 queued for deep 
>>>>      scrub 
>>>>      0                0 
>>>>      29.70c       249                   0         0 0 0 
>>>>      1044381696            0           0   1987       600 
>>>>      1987                 active+clean 2024-03-21T18:35:36.848189+0000 
>>>>      39202'2587    83812:99628 [29,138,63,12]          29 
>>>>      [29,138,63,12]              29 39202'2587 
>>>>      2024-02-17T21:34:22.042560+0000       39202'2587 
>>>>      2024-02-17T21:34:22.042560+0000              0 1 queued for deep 
>>>>      scrub 
>>>>      0                0 
>>>>      29.705       231                   0         0 0 0 
>>>>      968884224            0           0   1959       500 1959 
>>>>      active+clean 2024-03-21T18:18:22.028551+0000 39202'2459 
>>>>      83812:91258 [29,147,173,61]          29 [29,147,173,61] 
>>>>      29 39202'2459  2024-02-17T16:41:40.421763+0000 39202'2459 
>>>>      2024-02-17T16:41:40.421763+0000              0 1 queued for deep 
>>>>      scrub 
>>>>      0                0 
>>>>      29.6b9       236                   0         0 0 0 
>>>>      989855744            0           0   1956       500 1956 
>>>>      active+clean 2024-03-21T18:11:29.912132+0000 39202'2456 
>>>>      83812:95607 [29,199,74,16]          29 [29,199,74,16] 
>>>>      29 39202'2456  2024-02-17T11:46:06.706625+0000 39202'2456 
>>>>      2024-02-17T11:46:06.706625+0000              0 1 queued for deep 
>>>>      scrub 
>>>>      0                0 
>>>>      25.56e         0                   0         0 0 0 
>>>>      0            0           0      0      1158 0 
>>>>      active+clean+scrubbing+deep 2024-03-22T08:09:38.840145+0000 
>>>>      46253'514   83812:637482 [111,29,128]         111 
>>>>      [111,29,128]             111 39159'579 
>>>>      2024-03-06T17:57:53.158936+0000        39159'579 
>>>>      2024-03-06T17:57:53.158936+0000              0 1 queued for deep 
>>>>      scrub 
>>>>      0                0 
>>>>      25.56a         0                   0         0 0 0 
>>>>      0            0           0      0      1055 0 
>>>>      active+clean 2024-03-21T18:00:57.940851+0000 46253'545 
>>>>      83812:93475        [29,19,211]          29 [29,19,211] 
>>>>      29     46253'545 2024-03-07T11:12:45.881545+0000 46253'545 
>>>>      2024-03-07T11:12:45.881545+0000              0 28 queued for deep 
>>>>      scrub 
>>>>      0                0 
>>>>      25.55a         0                   0         0 0 0 
>>>>      0            0           0      0      1022 0 
>>>>      active+clean 2024-03-21T18:10:24.124914+0000 46253'565 
>>>>      83812:89876        [29,58,195]          29 [29,58,195] 
>>>>      29     46253'561 2024-02-17T06:54:35.320454+0000 46253'561 
>>>>      2024-02-17T06:54:35.320454+0000              0 28 queued for deep 
>>>>      scrub 
>>>>      0                0 
>>>>      29.c0        256                   0         0 0 0 
>>>>      1073741824            0           0   1986       600 1986 
>>>>      active+clean+scrubbing+deep 2024-03-22T08:09:12.849868+0000 
>>>>      39202'2586   83812:603625 [22,150,29,56]          22 
>>>>      [22,150,29,56]              22 39202'2586 
>>>>      2024-03-07T18:53:22.952868+0000       39202'2586 
>>>>      2024-03-07T18:53:22.952868+0000              0 1 queued for deep 
>>>>      scrub 
>>>>      0                0 
>>>>      18.6       15501                   0         0 0 0 
>>>>      63959444676            0           0   2068      3000 2068 
>>>>      active+clean+scrubbing+deep 2024-03-22T02:29:24.508889+0000 
>>>>      81688'663900  83812:1272160 [187,29,211]         187 
>>>>      [187,29,211]             187 52735'663878 
>>>>      2024-03-06T16:36:32.080259+0000     52735'663878 
>>>>      2024-03-06T16:36:32.080259+0000              0 684445 deep 
>>>> scrubbing 
>>>>      for 20373s 449                0 
>>>>      16.15          0                   0         0 0 0 
>>>>      0            0           0      0         0 0 
>>>>      active+clean 2024-03-21T18:20:29.632554+0000 0'0 
>>>>      83812:104893        [29,165,85]          29 [29,165,85] 
>>>>      29           0'0 2024-02-17T06:54:06.370647+0000              0'0 
>>>>      2024-02-17T06:54:06.370647+0000              0 28 queued for deep 
>>>>      scrub 
>>>>      0                0 
>>>>      25.45          0                   0         0 0 0 
>>>>      0            0           0      0      1036 0 
>>>>      active+clean 2024-03-21T18:10:24.125134+0000 39159'561 
>>>>      83812:93649         [29,13,58]          29 [29,13,58] 
>>>>      29     39159'512 2024-02-27T12:27:35.728176+0000 39159'512 
>>>>      2024-02-27T12:27:35.728176+0000              0 1 queued for deep 
>>>>      scrub 
>>>>      0                0 
>>>>      29.249       260                   0         0 0 0 
>>>>      1090519040            0           0   1970       500 
>>>>      1970                 active+clean 2024-03-21T18:29:22.588805+0000 
>>>>      39202'2470    83812:96016 [29,191,18,143]          29 
>>>>      [29,191,18,143]              29 39202'2470 
>>>>      2024-02-17T13:32:42.910335+0000       39202'2470 
>>>>      2024-02-17T13:32:42.910335+0000              0 1 queued for deep 
>>>>      scrub 
>>>>      0                0 
>>>>      29.25a       248                   0         0 0 0 
>>>>      1040187392            0           0   1952       600 
>>>>      1952                 active+clean 2024-03-21T18:20:29.623422+0000 
>>>>      39202'2552    83812:99157 [29,200,85,164]          29 
>>>>      [29,200,85,164]              29 39202'2552 
>>>>      2024-02-17T08:33:14.326087+0000       39202'2552 
>>>>      2024-02-17T08:33:14.326087+0000              0 1 queued for deep 
>>>>      scrub 
>>>>      0                0 
>>>>      25.3cf         0                   0         0 0 0 
>>>>      0            0           0      0      1343 0 
>>>>      active+clean 2024-03-21T18:16:00.933375+0000 46253'598 
>>>>      83812:91659        [29,75,175]          29 [29,75,175] 
>>>>      29     46253'598 2024-02-17T11:48:51.840600+0000 46253'598 
>>>>      2024-02-17T11:48:51.840600+0000              0 28 queued for deep 
>>>>      scrub 
>>>>      0                0 
>>>>      29.4ec       243                   0         0 0 0 
>>>>      1019215872            0           0   1933       500 
>>>>      1933                 active+clean 2024-03-21T18:15:35.389598+0000 
>>>>      39202'2433   83812:101501 [29,206,63,17]          29 
>>>>      [29,206,63,17]              29 39202'2433 
>>>>      2024-02-17T15:10:41.027755+0000       39202'2433 
>>>>      2024-02-17T15:10:41.027755+0000              0 3 queued for deep 
>>>>      scrub 
>>>>      0                0 
>>>> 
>>>> 
>>>>      Le 22/03/2024 à 08:16, Bandelow, Gunnar a écrit : 
>>>>      > Hi Michael, 
>>>>      > 
>>>>      > i think yesterday i found the culprit in my case. 
>>>>      > 
>>>>      > After inspecting "ceph pg dump" and especially the column 
>>>>      > "last_scrub_duration". I found, that any PG without proper 
>>>>      scrubbing 
>>>>      > was located on one of three OSDs (and all these OSDs share 
>>>> the same 
>>>>      > SSD for their DB). I put them on "out" and now after 
>>>> backfill and 
>>>>      > remapping everything seems to be fine. 
>>>>      > 
>>>>      > Only the log is still flooded with "scrub starts" and i have no 
>>>>      clue 
>>>>      > why these OSDs are causing the problems. 
>>>>      > Will investigate further. 
>>>>      > 
>>>>      > Best regards, 
>>>>      > Gunnar 
>>>>      > 
>>>>      > =================================== 
>>>>      > 
>>>>      >  Gunnar Bandelow 
>>>>      >  Universitätsrechenzentrum (URZ) 
>>>>      >  Universität Greifswald 
>>>>      >  Felix-Hausdorff-Straße 18 
>>>>      >  17489 Greifswald 
>>>>      >  Germany 
>>>>      > 
>>>>      >  Tel.: +49 3834 420 1450 
>>>>      > 
>>>>      > 
>>>>      > --- Original Nachricht --- 
>>>>      > *Betreff: * Re: Reef (18.2): Some PG not 
>>>> scrubbed/deep 
>>>>      > scrubbed for 1 month 
>>>>      > *Von: *"Michel Jouvin" <michel.jouvin@xxxxxxxxxxxxxxx 
>>>>      > <mailto:michel.jouvin@xxxxxxxxxxxxxxx>> 
>>>>      > *An: *ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx> 
>>>>      > *Datum: *21-03-2024 23:40 
>>>>      > 
>>>>      > 
>>>>      > 
>>>>      >     Hi, 
>>>>      > 
>>>>      >     Today we decided to upgrade from 18.2.0 to 18.2.2. No real 
>>>>      hope of a 
>>>>      >     direct impact (nothing in the change log related to 
>>>> something 
>>>>      >     similar) 
>>>>      >     but at least all daemons were restarted so we thought that 
>>>>      may be 
>>>>      >     this 
>>>>      >     will clear the problem at least temporarily. Unfortunately 
>>>>      it has not 
>>>>      >     been the case. The same pages are still stuck, despite 
>>>>      continuous 
>>>>      >     activity of scrubbing/deep scrubbing in the cluster... 
>>>>      > 
>>>>      >     I'm happy to provide more information if somebody tells me 
>>>>      what to 
>>>>      >     look 
>>>>      >     at... 
>>>>      > 
>>>>      >     Cheers, 
>>>>      > 
>>>>      >     Michel 
>>>>      > 
>>>>      >     Le 21/03/2024 à 14:40, Bernhard Krieger a écrit : 
>>>>      >     > Hi, 
>>>>      >     > 
>>>>      >     > i have the same issues. 
>>>>      >     > Deep scrub havent finished the jobs on some PGs. 
>>>>      >     > 
>>>>      >     > Using ceph 18.2.2. 
>>>>      >     > Initial installed version was 18.0.0 
>>>>      >     > 
>>>>      >     > 
>>>>      >     > In the logs i see a lot of scrub/deep-scrub starts 
>>>>      >     > 
>>>>      >     > Mar 21 14:21:09 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 13.b deep-scrubstarts 
>>>>      >     > Mar 21 14:21:10 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 13.1a deep-scrubstarts 
>>>>      >     > Mar 21 14:21:17 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 13.1c deep-scrubstarts 
>>>>      >     > Mar 21 14:21:19 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 11.1 scrubstarts 
>>>>      >     > Mar 21 14:21:27 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 14.6 scrubstarts 
>>>>      >     > Mar 21 14:21:30 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 10.c deep-scrubstarts 
>>>>      >     > Mar 21 14:21:35 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 12.3 deep-scrubstarts 
>>>>      >     > Mar 21 14:21:41 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 6.0 scrubstarts 
>>>>      >     > Mar 21 14:21:44 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 8.5 deep-scrubstarts 
>>>>      >     > Mar 21 14:21:45 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 5.66 deep-scrubstarts 
>>>>      >     > Mar 21 14:21:49 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 5.30 deep-scrubstarts 
>>>>      >     > Mar 21 14:21:50 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 13.b deep-scrubstarts 
>>>>      >     > Mar 21 14:21:52 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 13.1a deep-scrubstarts 
>>>>      >     > Mar 21 14:21:54 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 13.1c deep-scrubstarts 
>>>>      >     > Mar 21 14:21:55 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 11.1 scrubstarts 
>>>>      >     > Mar 21 14:21:58 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 14.6 scrubstarts 
>>>>      >     > Mar 21 14:22:01 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 10.c deep-scrubstarts 
>>>>      >     > Mar 21 14:22:04 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 12.3 scrubstarts 
>>>>      >     > Mar 21 14:22:13 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 6.0 scrubstarts 
>>>>      >     > Mar 21 14:22:15 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 8.5 deep-scrubstarts 
>>>>      >     > Mar 21 14:22:20 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 5.66 deep-scrubstarts 
>>>>      >     > Mar 21 14:22:27 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 5.30 scrubstarts 
>>>>      >     > Mar 21 14:22:30 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 13.b deep-scrubstarts 
>>>>      >     > Mar 21 14:22:32 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 13.1a deep-scrubstarts 
>>>>      >     > Mar 21 14:22:33 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 13.1c deep-scrubstarts 
>>>>      >     > Mar 21 14:22:35 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 11.1 deep-scrubstarts 
>>>>      >     > Mar 21 14:22:37 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 14.6 scrubstarts 
>>>>      >     > Mar 21 14:22:38 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 10.c scrubstarts 
>>>>      >     > Mar 21 14:22:39 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 12.3 scrubstarts 
>>>>      >     > Mar 21 14:22:41 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 6.0 deep-scrubstarts 
>>>>      >     > Mar 21 14:22:43 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 8.5 deep-scrubstarts 
>>>>      >     > Mar 21 14:22:46 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 5.66 deep-scrubstarts 
>>>>      >     > Mar 21 14:22:49 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 5.30 scrubstarts 
>>>>      >     > Mar 21 14:22:55 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 13.b deep-scrubstarts 
>>>>      >     > Mar 21 14:22:57 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 13.1a deep-scrubstarts 
>>>>      >     > Mar 21 14:22:58 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 13.1c deep-scrubstarts 
>>>>      >     > Mar 21 14:23:03 ceph-node10 ceph-osd[3804193]: 
>>>>      log_channel(cluster) 
>>>>      >     > log [DBG] : 11.1 deep-scrubstarts 
>>>>      >     > 
>>>>      >     > 
>>>>      >     > 
>>>>      >     > * 
>>>>      >     > *The amount of scrubbed/deep-scrubbed pgs changes every 
>>>>      few seconds. 
>>>>      >     > 
>>>>      >     > [root@ceph-node10 ~]# ceph -s | grep active+clean 
>>>>      >     >    pgs:     214 active+clean 
>>>>      >     >             50 active+clean+scrubbing+deep 
>>>>      >     >             25 active+clean+scrubbing 
>>>>      >     > [root@ceph-node10 ~]# ceph -s | grep active+clean 
>>>>      >     >    pgs:     208 active+clean 
>>>>      >     >             53 active+clean+scrubbing+deep 
>>>>      >     >             28 active+clean+scrubbing 
>>>>      >     > [root@ceph-node10 ~]# ceph -s | grep active+clean 
>>>>      >     >    pgs:     208 active+clean 
>>>>      >     >             53 active+clean+scrubbing+deep 
>>>>      >     >             28 active+clean+scrubbing 
>>>>      >     > [root@ceph-node10 ~]# ceph -s | grep active+clean 
>>>>      >     >    pgs:     207 active+clean 
>>>>      >     >             54 active+clean+scrubbing+deep 
>>>>      >     >             28 active+clean+scrubbing 
>>>>      >     > [root@ceph-node10 ~]# ceph -s | grep active+clean 
>>>>      >     >    pgs:     202 active+clean 
>>>>      >     >             56 active+clean+scrubbing+deep 
>>>>      >     >             31 active+clean+scrubbing 
>>>>      >     > [root@ceph-node10 ~]# ceph -s | grep active+clean 
>>>>      >     >    pgs:     213 active+clean 
>>>>      >     >             45 active+clean+scrubbing+deep 
>>>>      >     >             31 active+clean+scrubbing 
>>>>      >     > 
>>>>      >     > ceph pg dump showing PGs which are not deep scrubbed 
>>>> since 
>>>>      january. 
>>>>      >     > Some PGs deep scrubbing  over 700000 seconds. 
>>>>      >     > 
>>>>      >     > *[ceph: root@ceph-node10 /]#  ceph pg dump pgs | grep -e 
>>>>      >     'scrubbing f' 
>>>>      >     > 5.6e      221223                   0         0          0 
>>>>             0 
>>>>      >     >  927795290112            0           0  4073      3000 
>>>>           4073 
>>>>      >     >  active+clean+scrubbing+deep  2024-03-20T01:07:21.196293+ 
>>>>      >     > 0000  128383'15766927  128383:20517419 
>>>>   [2,4,18,16,14,21] 
>>>>      >               2 
>>>>      >     >   [2,4,18,16,14,21]               2  125519'12328877 
>>>>      >     >  2024-01-23T11:25:35.503811+0000  124844'11873951 
>>>>       2024-01-21T22: 
>>>>      >     > 24:12.620693+0000              0                    5 
>>>>  deep 
>>>>      >     scrubbing 
>>>>      >     > for 270790s 
>>>>                                             53772 
>>>>      >     >                0 
>>>>      >     > 5.6c      221317                   0         0          0 
>>>>             0 
>>>>      >     >  928173256704            0           0  6332         0 
>>>>           6332 
>>>>      >     >  active+clean+scrubbing+deep  2024-03-18T09:29:29.233084+ 
>>>>      >     > 0000  128382'15788196  128383:20727318 
>>>>     [6,9,12,14,1,4] 
>>>>      >               6 
>>>>      >     >     [6,9,12,14,1,4]               6  127180'14709746 
>>>>      >     >  2024-03-06T12:47:57.741921+0000  124817'11821502 
>>>>       2024-01-20T20: 
>>>>      >     > 59:40.566384+0000              0                13452 
>>>>  deep 
>>>>      >     scrubbing 
>>>>      >     > for 273519s 
>>>>                                            122803 
>>>>      >     >                0 
>>>>      >     > 5.6a      221325                   0         0          0 
>>>>             0 
>>>>      >     >  928184565760            0           0  4649      3000 
>>>>           4649 
>>>>      >     >  active+clean+scrubbing+deep  2024-03-13T03:48:54.065125+ 
>>>>      >     > 0000  128382'16031499  128383:21221685 
>>>>     [13,11,1,2,9,8] 
>>>>      >              13 
>>>>      >     >     [13,11,1,2,9,8]              13  127181'14915404 
>>>>      >     >  2024-03-06T13:16:58.635982+0000  125967'12517899 
>>>>       2024-01-28T09: 
>>>>      >     > 13:08.276930+0000              0                10078 
>>>>  deep 
>>>>      >     scrubbing 
>>>>      >     > for 726001s 
>>>>                                            184819 
>>>>      >     >                0 
>>>>      >     > 5.54      221050                   0         0          0 
>>>>             0 
>>>>      >     >  927036203008            0           0  4864      3000 
>>>>           4864 
>>>>      >     >  active+clean+scrubbing+deep  2024-03-18T00:17:48.086231+ 
>>>>      >     > 0000  128383'15584012  128383:20293678 
>>>>  [0,20,18,19,11,12] 
>>>>      >               0 
>>>>      >     >  [0,20,18,19,11,12]               0  127195'14651908 
>>>>      >     >  2024-03-07T09:22:31.078448+0000  124816'11813857 
>>>>       2024-01-20T16: 
>>>>      >     > 43:15.755200+0000              0                 9808 
>>>>  deep 
>>>>      >     scrubbing 
>>>>      >     > for 306667s 
>>>>                                            142126 
>>>>      >     >                0 
>>>>      >     > 5.47      220849                   0         0          0 
>>>>             0 
>>>>      >     >  926233448448            0           0  5592         0 
>>>>           5592 
>>>>      >     >  active+clean+scrubbing+deep  2024-03-12T08:10:39.413186+ 
>>>>      >     > 0000  128382'15653864  128383:20403071 
>>>>  [16,15,20,0,13,21] 
>>>>      >              16 
>>>>      >     >  [16,15,20,0,13,21]              16  127183'14600433 
>>>>      >     >  2024-03-06T18:21:03.057165+0000  124809'11792397 
>>>>       2024-01-20T05: 
>>>>      >     > 27:07.617799+0000              0                13066 
>>>>  deep 
>>>>      >     scrubbing 
>>>>      >     > for 796697s 
>>>>                                            209193 
>>>>      >     >                0 
>>>>      >     > dumped pgs 
>>>>      >     > 
>>>>      >     > 
>>>>      >     > * 
>>>>      >     > 
>>>>      >     > 
>>>>      >     > regards 
>>>>      >     > Bernhard 
>>>>      >     > 
>>>>      >     > 
>>>>      >     > 
>>>>      >     > 
>>>>      >     > 
>>>>      >     > 
>>>>      >     > On 20/03/2024 21:12, Bandelow, Gunnar wrote: 
>>>>      >     >> Hi, 
>>>>      >     >> 
>>>>      >     >> i just wanted to mention, that i am running a cluster 
>>>>      with reef 
>>>>      >     >> 18.2.1 with the same issue. 
>>>>      >     >> 
>>>>      >     >> 4 PGs start to deepscrub but dont finish since mid 
>>>>      february. In 
>>>>      >     the 
>>>>      >     >> pg dump they are shown as scheduled for deep scrub. They 
>>>>      sometimes 
>>>>      >     >> change their status from active+clean to 
>>>>      >     active+clean+scrubbing+deep 
>>>>      >     >> and back. 
>>>>      >     >> 
>>>>      >     >> Best regards, 
>>>>      >     >> Gunnar 
>>>>      >     >> 
>>>>      >     >> ======================================================= 
>>>>      >     >> 
>>>>      >     >> Gunnar Bandelow 
>>>>      >     >> Universitätsrechenzentrum (URZ) 
>>>>      >     >> Universität Greifswald 
>>>>      >     >> Felix-Hausdorff-Straße 18 
>>>>      >     >> 17489 Greifswald 
>>>>      >     >> Germany 
>>>>      >     >> 
>>>>      >     >> Tel.: +49 3834 420 1450 
>>>>      >     >> 
>>>>      >     >> 
>>>>      >     >> 
>>>>      >     >> 
>>>>      >     >> --- Original Nachricht --- 
>>>>      >     >> *Betreff: * Re: Reef (18.2): Some PG not 
>>>>      scrubbed/deep 
>>>>      >     >> scrubbed for 1 month 
>>>>      >     >> *Von: *"Michel Jouvin" <michel.jouvin@xxxxxxxxxxxxxxx 
>>>>      >     <mailto:michel.jouvin@xxxxxxxxxxxxxxx> 
>>>>      >     >> <michel.jouvin@xxxxxxxxxxxxxxx 
>>>>      >  <mailto:michel.jouvin@xxxxxxxxxxxxxxx>>> 
>>>>      >     >> *An: *ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx> 
>>>>      >     <ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>> 
>>>>      >     >> *Datum: *20-03-2024 20:00 
>>>>      >     >> 
>>>>      >     >> 
>>>>      >     >> 
>>>>      >     >>     Hi Rafael, 
>>>>      >     >> 
>>>>      >     >>     Good to know I am not alone! 
>>>>      >     >> 
>>>>      >     >>     Additional information ~6h after the OSD restart: 
>>>>      over the 
>>>>      >     20 PGs 
>>>>      >     >>     impacted, 2 have been processed successfully... I 
>>>> don't 
>>>>      >     have a clear 
>>>>      >     >>     picture on how Ceph prioritize the scrub of one 
>>>> PG over 
>>>>      >     another, I 
>>>>      >     >>     had 
>>>>      >     >>     thought that the oldest/expired scrubs are taken 
>>>>      first but 
>>>>      >     it may 
>>>>      >     >>     not be 
>>>>      >     >>     the case. Anyway, I have seen a very significant 
>>>>      decrese of 
>>>>      >     the 
>>>>      >     >> scrub 
>>>>      >     >>     activity this afternoon and the cluster is not 
>>>> loaded 
>>>>      at all 
>>>>      >     >>     (almost no 
>>>>      >     >>     users yet)... 
>>>>      >     >> 
>>>>      >     >>     Michel 
>>>>      >     >> 
>>>>      >     >>     Le 20/03/2024 à 17:55, quaglio@xxxxxxxxxx 
>>>>      >     <mailto:quaglio@xxxxxxxxxx> 
>>>>      >     >>     <quaglio@xxxxxxxxxx <mailto:quaglio@xxxxxxxxxx>> a 
>>>>      écrit : 
>>>>      >     >>     > Hi, 
>>>>      >     >>     >      I upgraded a cluster 2 weeks ago here. The 
>>>>      situation 
>>>>      >     is the 
>>>>      >     >>     same 
>>>>      >     >>     > as Michel. 
>>>>      >     >>     >      A lot of PGs no scrubbed/deep-scrubed. 
>>>>      >     >>     > 
>>>>      >     >>     > Rafael. 
>>>>      >     >>     > 
>>>>      >     >>     > _______________________________________________ 
>>>>      >     >>     > ceph-users mailing list -- ceph-users@xxxxxxx 
>>>>      >     <mailto:ceph-users@xxxxxxx> 
>>>>      >     >>     <ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>> 
>>>>      >     >>     > To unsubscribe send an email to 
>>>>      ceph-users-leave@xxxxxxx 
>>>>      >     <mailto:ceph-users-leave@xxxxxxx> 
>>>>      >     >>     <ceph-users-leave@xxxxxxx 
>>>>      <mailto:ceph-users-leave@xxxxxxx>> 
>>>>      >     >> _______________________________________________ 
>>>>      >     >>     ceph-users mailing list -- ceph-users@xxxxxxx 
>>>>      >     <mailto:ceph-users@xxxxxxx> 
>>>>      >     >>     <ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>> 
>>>>      >     >>     To unsubscribe send an email to 
>>>> ceph-users-leave@xxxxxxx 
>>>>      >     <mailto:ceph-users-leave@xxxxxxx> 
>>>>      >     >>     <ceph-users-leave@xxxxxxx 
>>>>      <mailto:ceph-users-leave@xxxxxxx>> 
>>>>      >     >> 
>>>>      >     >> 
>>>>      >     >> _______________________________________________ 
>>>>      >     >> ceph-users mailing list --ceph-users@xxxxxxx 
>>>>      >     <mailto:ceph-users@xxxxxxx> 
>>>>      >     >> To unsubscribe send an email toceph-users-leave@xxxxxxx 
>>>>      >     <mailto:toceph-users-leave@xxxxxxx> 
>>>>      >     > 
>>>>      >     > _______________________________________________ 
>>>>      >     > ceph-users mailing list -- ceph-users@xxxxxxx 
>>>>      >     <mailto:ceph-users@xxxxxxx> 
>>>>      >     > To unsubscribe send an email to ceph-users-leave@xxxxxxx 
>>>>      >     <mailto:ceph-users-leave@xxxxxxx> 
>>>>      >  _______________________________________________ 
>>>>      >     ceph-users mailing list -- ceph-users@xxxxxxx 
>>>>      >     <mailto:ceph-users@xxxxxxx> 
>>>>      >     To unsubscribe send an email to ceph-users-leave@xxxxxxx 
>>>>      >     <mailto:ceph-users-leave@xxxxxxx> 
>>>>      > 
>>>>      > 
>>>>      > _______________________________________________ 
>>>>      > ceph-users mailing list --ceph-users@xxxxxxx 
>>>>      > To unsubscribe send an email toceph-users-leave@xxxxxxx 
>>>>      _______________________________________________ 
>>>>      ceph-users mailing list -- ceph-users@xxxxxxx 
>>>>      To unsubscribe send an email to ceph-users-leave@xxxxxxx 
>>>> 
>>> _______________________________________________ 
>>> ceph-users mailing list -- ceph-users@xxxxxxx 
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx 
> _______________________________________________ 
> ceph-users mailing list -- ceph-users@xxxxxxx 
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx   
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux