Re: 6 pgs not deep-scrubbed in time

Wesley Dillingham <wes@xxxxxxxxxxxxxxxxx> · Tue, 30 Jan 2024 10:04:50 -0500

It will take a couple weeks to a couple months to complete is my best guess
on 10TB spinners at ~40% full. The cluster should be usable throughout the
process.

Keep in mind, you should disable the pg autoscaler on any pool which you
are manually adjusting the pg_num for. Increasing the pg_num is called "pg
splitting" you can google around for this to see how it will work etc.

There are a few knobs to increase or decrease the aggressiveness of the pg
split, primarily these are osd_max_backfills and
target_max_misplaced_ratio.

You can monitor the progress of the split by looking at "ceph osd pool ls
detail" for the pool you are splitting, for this pool pgp_num will slowly
increase up until it reaches the pg_num / pg_num_target.

IMO this blog post best covers the issue which you are looking to
undertake:
https://ceph.io/en/news/blog/2019/new-in-nautilus-pg-merging-and-autotuning/

Respectfully,

*Wes Dillingham*
wes@xxxxxxxxxxxxxxxxx
LinkedIn <http://www.linkedin.com/in/wesleydillingham>

On Tue, Jan 30, 2024 at 9:38 AM Michel Niyoyita <micou12@xxxxxxxxx> wrote:

> Thanks for your advices Wes, below is what ceph osd df tree shows , the
> increase of pg_num of the production cluster will not affect the
> performance or crush ? how long it can takes to finish?
>
> ceph osd df tree
> ID  CLASS  WEIGHT     REWEIGHT  SIZE     RAW USE  DATA      OMAP     META
>    AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME
> -1         433.11841         -  433 TiB  151 TiB    67 TiB  364 MiB  210
> GiB  282 TiB  34.86  1.00    -          root default
> -3         144.37280         -  144 TiB   50 TiB    22 TiB  121 MiB   70
> GiB   94 TiB  34.86  1.00    -              host ceph-osd1
>  2    hdd    9.02330   1.00000  9.0 TiB  2.7 TiB  1021 GiB  5.4 MiB  3.7
> GiB  6.3 TiB  30.40  0.87   19      up          osd.2
>  3    hdd    9.02330   1.00000  9.0 TiB  2.7 TiB   931 GiB  4.1 MiB  3.5
> GiB  6.4 TiB  29.43  0.84   29      up          osd.3
>  6    hdd    9.02330   1.00000  9.0 TiB  3.3 TiB   1.5 TiB  8.1 MiB  4.5
> GiB  5.8 TiB  36.09  1.04   20      up          osd.6
>  9    hdd    9.02330   1.00000  9.0 TiB  2.8 TiB   1.0 TiB  6.6 MiB  3.8
> GiB  6.2 TiB  30.97  0.89   23      up          osd.9
> 12    hdd    9.02330   1.00000  9.0 TiB  4.0 TiB   2.3 TiB   13 MiB  6.1
> GiB  5.0 TiB  44.68  1.28   30      up          osd.12
> 15    hdd    9.02330   1.00000  9.0 TiB  3.5 TiB   1.8 TiB  9.2 MiB  5.2
> GiB  5.5 TiB  38.99  1.12   30      up          osd.15
> 18    hdd    9.02330   1.00000  9.0 TiB  3.0 TiB   1.2 TiB  6.5 MiB  4.0
> GiB  6.1 TiB  32.80  0.94   21      up          osd.18
> 22    hdd    9.02330   1.00000  9.0 TiB  3.6 TiB   1.9 TiB   10 MiB  5.4
> GiB  5.4 TiB  40.25  1.15   22      up          osd.22
> 25    hdd    9.02330   1.00000  9.0 TiB  3.9 TiB   2.1 TiB   12 MiB  5.7
> GiB  5.1 TiB  42.94  1.23   22      up          osd.25
> 28    hdd    9.02330   1.00000  9.0 TiB  3.1 TiB   1.4 TiB  7.5 MiB  4.1
> GiB  5.9 TiB  34.87  1.00   21      up          osd.28
> 32    hdd    9.02330   1.00000  9.0 TiB  2.7 TiB  1017 GiB  4.8 MiB  3.7
> GiB  6.3 TiB  30.36  0.87   27      up          osd.32
> 35    hdd    9.02330   1.00000  9.0 TiB  3.0 TiB   1.3 TiB  7.2 MiB  4.2
> GiB  6.0 TiB  33.73  0.97   21      up          osd.35
> 38    hdd    9.02330   1.00000  9.0 TiB  3.1 TiB   1.4 TiB  7.3 MiB  4.1
> GiB  5.9 TiB  34.57  0.99   24      up          osd.38
> 41    hdd    9.02330   1.00000  9.0 TiB  2.9 TiB   1.2 TiB  6.2 MiB  4.0
> GiB  6.1 TiB  32.49  0.93   24      up          osd.41
> 44    hdd    9.02330   1.00000  9.0 TiB  3.1 TiB   1.4 TiB  7.3 MiB  4.4
> GiB  5.9 TiB  34.87  1.00   29      up          osd.44
> 47    hdd    9.02330   1.00000  9.0 TiB  2.7 TiB  1016 GiB  5.4 MiB  3.6
> GiB  6.3 TiB  30.35  0.87   23      up          osd.47
> -7         144.37280         -  144 TiB   50 TiB    22 TiB  122 MiB   70
> GiB   94 TiB  34.86  1.00    -              host ceph-osd2
>  1    hdd    9.02330   1.00000  9.0 TiB  2.8 TiB   1.1 TiB  5.7 MiB  3.8
> GiB  6.2 TiB  31.00  0.89   27      up          osd.1
>  5    hdd    9.02330   1.00000  9.0 TiB  3.2 TiB   1.5 TiB  7.3 MiB  4.5
> GiB  5.8 TiB  35.45  1.02   27      up          osd.5
>  8    hdd    9.02330   1.00000  9.0 TiB  3.3 TiB   1.6 TiB  8.3 MiB  4.7
> GiB  5.7 TiB  36.85  1.06   30      up          osd.8
> 10    hdd    9.02330   1.00000  9.0 TiB  3.1 TiB   1.4 TiB  7.5 MiB  4.5
> GiB  5.9 TiB  34.87  1.00   20      up          osd.10
> 13    hdd    9.02330   1.00000  9.0 TiB  3.6 TiB   1.8 TiB   10 MiB  5.3
> GiB  5.4 TiB  39.63  1.14   27      up          osd.13
> 16    hdd    9.02330   1.00000  9.0 TiB  2.8 TiB   1.1 TiB  6.0 MiB  3.8
> GiB  6.2 TiB  31.01  0.89   19      up          osd.16
> 19    hdd    9.02330   1.00000  9.0 TiB  3.0 TiB   1.2 TiB  6.4 MiB  4.0
> GiB  6.1 TiB  32.77  0.94   21      up          osd.19
> 21    hdd    9.02330   1.00000  9.0 TiB  2.8 TiB   1.1 TiB  5.5 MiB  3.7
> GiB  6.2 TiB  31.58  0.91   26      up          osd.21
> 24    hdd    9.02330   1.00000  9.0 TiB  2.6 TiB   855 GiB  4.7 MiB  3.3
> GiB  6.4 TiB  28.61  0.82   19      up          osd.24
> 27    hdd    9.02330   1.00000  9.0 TiB  3.7 TiB   1.9 TiB   10 MiB  5.2
> GiB  5.3 TiB  40.84  1.17   24      up          osd.27
> 30    hdd    9.02330   1.00000  9.0 TiB  3.2 TiB   1.4 TiB  7.5 MiB  4.5
> GiB  5.9 TiB  35.16  1.01   22      up          osd.30
> 33    hdd    9.02330   1.00000  9.0 TiB  3.1 TiB   1.4 TiB  8.6 MiB  4.3
> GiB  5.9 TiB  34.59  0.99   23      up          osd.33
> 36    hdd    9.02330   1.00000  9.0 TiB  3.4 TiB   1.7 TiB   10 MiB  5.0
> GiB  5.6 TiB  38.17  1.09   25      up          osd.36
> 39    hdd    9.02330   1.00000  9.0 TiB  3.4 TiB   1.7 TiB  8.5 MiB  5.1
> GiB  5.6 TiB  37.79  1.08   31      up          osd.39
> 42    hdd    9.02330   1.00000  9.0 TiB  3.6 TiB   1.8 TiB   10 MiB  5.2
> GiB  5.4 TiB  39.68  1.14   23      up          osd.42
> 45    hdd    9.02330   1.00000  9.0 TiB  2.7 TiB   964 GiB  5.1 MiB  3.5
> GiB  6.3 TiB  29.78  0.85   21      up          osd.45
> -5         144.37280         -  144 TiB   50 TiB    22 TiB  121 MiB   70
> GiB   94 TiB  34.86  1.00    -              host ceph-osd3
>  0    hdd    9.02330   1.00000  9.0 TiB  2.7 TiB   934 GiB  4.9 MiB  3.4
> GiB  6.4 TiB  29.47  0.85   21      up          osd.0
>  4    hdd    9.02330   1.00000  9.0 TiB  3.0 TiB   1.2 TiB  6.5 MiB  4.1
> GiB  6.1 TiB  32.73  0.94   22      up          osd.4
>  7    hdd    9.02330   1.00000  9.0 TiB  3.5 TiB   1.8 TiB  9.2 MiB  5.1
> GiB  5.5 TiB  39.02  1.12   30      up          osd.7
> 11    hdd    9.02330   1.00000  9.0 TiB  3.6 TiB   1.9 TiB   10 MiB  5.1
> GiB  5.4 TiB  39.97  1.15   27      up          osd.11
> 14    hdd    9.02330   1.00000  9.0 TiB  3.5 TiB   1.7 TiB   10 MiB  5.1
> GiB  5.6 TiB  38.24  1.10   27      up          osd.14
> 17    hdd    9.02330   1.00000  9.0 TiB  3.0 TiB   1.2 TiB  6.4 MiB  4.1
> GiB  6.0 TiB  33.09  0.95   23      up          osd.17
> 20    hdd    9.02330   1.00000  9.0 TiB  2.8 TiB   1.1 TiB  5.6 MiB  3.8
> GiB  6.2 TiB  31.55  0.90   20      up          osd.20
> 23    hdd    9.02330   1.00000  9.0 TiB  2.6 TiB   828 GiB  4.0 MiB  3.3
> GiB  6.5 TiB  28.32  0.81   23      up          osd.23
> 26    hdd    9.02330   1.00000  9.0 TiB  2.9 TiB   1.2 TiB  5.8 MiB  3.8
> GiB  6.1 TiB  32.12  0.92   26      up          osd.26
> 29    hdd    9.02330   1.00000  9.0 TiB  3.6 TiB   1.8 TiB   10 MiB  5.1
> GiB  5.4 TiB  39.73  1.14   24      up          osd.29
> 31    hdd    9.02330   1.00000  9.0 TiB  2.8 TiB   1.1 TiB  5.8 MiB  3.7
> GiB  6.2 TiB  31.56  0.91   22      up          osd.31
> 34    hdd    9.02330   1.00000  9.0 TiB  3.3 TiB   1.5 TiB  8.2 MiB  4.6
> GiB  5.7 TiB  36.29  1.04   23      up          osd.34
> 37    hdd    9.02330   1.00000  9.0 TiB  3.2 TiB   1.5 TiB  8.2 MiB  4.5
> GiB  5.8 TiB  35.51  1.02   20      up          osd.37
> 40    hdd    9.02330   1.00000  9.0 TiB  3.4 TiB   1.7 TiB  9.3 MiB  4.9
> GiB  5.6 TiB  38.16  1.09   25      up          osd.40
> 43    hdd    9.02330   1.00000  9.0 TiB  3.4 TiB   1.6 TiB  8.5 MiB  4.8
> GiB  5.7 TiB  37.19  1.07   29      up          osd.43
> 46    hdd    9.02330   1.00000  9.0 TiB  3.1 TiB   1.4 TiB  8.4 MiB  4.4
> GiB  5.9 TiB  34.85  1.00   23      up          osd.46
>                          TOTAL  433 TiB  151 TiB    67 TiB  364 MiB  210
> GiB  282 TiB  34.86
> MIN/MAX VAR: 0.81/1.28  STDDEV: 3.95
>
>
> Michel
>
>
> On Tue, Jan 30, 2024 at 4:18 PM Wesley Dillingham <wes@xxxxxxxxxxxxxxxxx>
> wrote:
>
>> I now concur you should increase the pg_num as a first step for this
>> cluster. Disable the pg autoscaler for and increase the volumes pool to
>> pg_num 256. Then likely re-asses and make the next power of 2 jump to 512
>> and probably beyond.
>>
>> Keep in mind this is not going to fix your short term deep-scrub issue in
>> fact it will increase the number of not scrubbed in time PGs until the
>> pg_num change is complete.  This is because OSDs dont scrub when they are
>> backfilling.
>>
>> I would sit on 256 for a couple weeks and let scrubs happen then continue
>> past 256.
>>
>> with the ultimate target of around 100-200 PGs per OSD which "ceph osd df
>> tree" will show you in the PGs column.
>>
>> Respectfully,
>>
>> *Wes Dillingham*
>> wes@xxxxxxxxxxxxxxxxx
>> LinkedIn <http://www.linkedin.com/in/wesleydillingham>
>>
>>
>> On Tue, Jan 30, 2024 at 3:16 AM Michel Niyoyita <micou12@xxxxxxxxx>
>> wrote:
>>
>>> Dear team,
>>>
>>> below is the output of ceph df command and the ceph version I am running
>>>
>>>  ceph df
>>> --- RAW STORAGE ---
>>> CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
>>> hdd    433 TiB  282 TiB  151 TiB   151 TiB      34.82
>>> TOTAL  433 TiB  282 TiB  151 TiB   151 TiB      34.82
>>>
>>> --- POOLS ---
>>> POOL                   ID  PGS   STORED  OBJECTS     USED  %USED  MAX
>>> AVAIL
>>> device_health_metrics   1    1  1.1 MiB        3  3.2 MiB      0     73
>>> TiB
>>> .rgw.root               2   32  3.7 KiB        8   96 KiB      0     73
>>> TiB
>>> default.rgw.log         3   32  3.6 KiB      209  408 KiB      0     73
>>> TiB
>>> default.rgw.control     4   32      0 B        8      0 B      0     73
>>> TiB
>>> default.rgw.meta        5   32    382 B        2   24 KiB      0     73
>>> TiB
>>> volumes                 6  128   21 TiB    5.68M   62 TiB  22.09     73
>>> TiB
>>> images                  7   32  878 GiB  112.50k  2.6 TiB   1.17     73
>>> TiB
>>> backups                 8   32      0 B        0      0 B      0     73
>>> TiB
>>> vms                     9   32  881 GiB  174.30k  2.5 TiB   1.13     73
>>> TiB
>>> testbench              10   32      0 B        0      0 B      0     73
>>> TiB
>>> root@ceph-mon1:~# ceph --version
>>> ceph version 16.2.11 (3cf40e2dca667f68c6ce3ff5cd94f01e711af894) pacific
>>> (stable)
>>> root@ceph-mon1:~#
>>>
>>> please advise accordingly
>>>
>>> Michel
>>>
>>> On Mon, Jan 29, 2024 at 9:48 PM Frank Schilder <frans@xxxxxx> wrote:
>>>
>>> > You will have to look at the output of "ceph df" and make a decision to
>>> > balance "objects per PG" and "GB per PG". Increase he PG count for the
>>> > pools with the worst of these two numbers most such that it balances
>>> out as
>>> > much as possible. If you have pools that see significantly more user-IO
>>> > than others, prioritise these.
>>> >
>>> > You will have to find out for your specific cluster, we can only give
>>> > general guidelines. Make changes, run benchmarks, re-evaluate. Take the
>>> > time for it. The better you know your cluster and your users, the
>>> better
>>> > the end result will be.
>>> >
>>> > Best regards,
>>> > =================
>>> > Frank Schilder
>>> > AIT Risø Campus
>>> > Bygning 109, rum S14
>>> >
>>> > ________________________________________
>>> > From: Michel Niyoyita <micou12@xxxxxxxxx>
>>> > Sent: Monday, January 29, 2024 2:04 PM
>>> > To: Janne Johansson
>>> > Cc: Frank Schilder; E Taka; ceph-users
>>> > Subject: Re:  Re: 6 pgs not deep-scrubbed in time
>>> >
>>> > This is how it is set , if you suggest to make some changes please
>>> advises.
>>> >
>>> > Thank you.
>>> >
>>> >
>>> > ceph osd pool ls detail
>>> > pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule
>>> 0
>>> > object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change
>>> 1407
>>> > flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application
>>> > mgr_devicehealth
>>> > pool 2 '.rgw.root' replicated size 3 min_size 2 crush_rule 0
>>> object_hash
>>> > rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1393 flags
>>> > hashpspool stripe_width 0 application rgw
>>> > pool 3 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0
>>> > object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
>>> > 1394 flags hashpspool stripe_width 0 application rgw
>>> > pool 4 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0
>>> > object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
>>> > 1395 flags hashpspool stripe_width 0 application rgw
>>> > pool 5 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0
>>> > object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change
>>> > 1396 flags hashpspool stripe_width 0 pg_autoscale_bias 4 application
>>> rgw
>>> > pool 6 'volumes' replicated size 3 min_size 2 crush_rule 0 object_hash
>>> > rjenkins pg_num 128 pgp_num 128 autoscale_mode on last_change 108802
>>> lfor
>>> > 0/0/14812 flags hashpspool,selfmanaged_snaps stripe_width 0
>>> application rbd
>>> >         removed_snaps_queue
>>> >
>>> [22d7~3,11561~2,11571~1,11573~1c,11594~6,1159b~f,115b0~1,115b3~1,115c3~1,115f3~1,115f5~e,11613~6,1161f~c,11637~1b,11660~1,11663~2,11673~1,116d1~c,116f5~10,11721~c]
>>> > pool 7 'images' replicated size 3 min_size 2 crush_rule 0 object_hash
>>> > rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 94609 flags
>>> > hashpspool,selfmanaged_snaps stripe_width 0 application rbd
>>> > pool 8 'backups' replicated size 3 min_size 2 crush_rule 0 object_hash
>>> > rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1399 flags
>>> > hashpspool stripe_width 0 application rbd
>>> > pool 9 'vms' replicated size 3 min_size 2 crush_rule 0 object_hash
>>> > rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 108783 lfor
>>> > 0/561/559 flags hashpspool,selfmanaged_snaps stripe_width 0
>>> application rbd
>>> >         removed_snaps_queue [3fa~1,3fc~3,400~1,402~1]
>>> > pool 10 'testbench' replicated size 3 min_size 2 crush_rule 0
>>> object_hash
>>> > rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 20931 lfor
>>> > 0/20931/20929 flags hashpspool stripe_width 0
>>> >
>>> >
>>> > On Mon, Jan 29, 2024 at 2:09 PM Michel Niyoyita <micou12@xxxxxxxxx
>>> <mailto:
>>> > micou12@xxxxxxxxx>> wrote:
>>> > Thank you Janne ,
>>> >
>>> > no need of setting some flags like ceph osd set nodeep-scrub  ???
>>> >
>>> > Thank you
>>> >
>>> > On Mon, Jan 29, 2024 at 2:04 PM Janne Johansson <icepic.dz@xxxxxxxxx
>>> > <mailto:icepic.dz@xxxxxxxxx>> wrote:
>>> > Den mån 29 jan. 2024 kl 12:58 skrev Michel Niyoyita <micou12@xxxxxxxxx
>>> > <mailto:micou12@xxxxxxxxx>>:
>>> > >
>>> > > Thank you Frank ,
>>> > >
>>> > > All disks are HDDs . Would like to know if I can increase the number
>>> of
>>> > PGs
>>> > > live in production without a negative impact to the cluster. if yes
>>> which
>>> > > commands to use .
>>> >
>>> > Yes. "ceph osd pool set <poolname> pg_num <number larger than before>"
>>> > where the number usually should be a power of two that leads to a
>>> > number of PGs per OSD between 100-200.
>>> >
>>> > --
>>> > May the most significant bit of your life be positive.
>>> >
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>
>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx