Re: Does CEPH limit the pgp_num which it will increase in one go?

Maarten van Ingen <maarten.vaningen@xxxxxxx> · Tue, 15 Feb 2022 08:08:59 +0000

Hi Dan,

Thanks for your (very) prompt response.

pg_num 4096 pgp_num 2108 pgp_num_target 2248

Also I see this:
#ceph balancer eval
current cluster score 0.068634 (lower is better)

#ceph balancer status
{
    "last_optimize_duration": "0:00:00.025029", 
    "plans": [], 
    "mode": "upmap", 
    "active": true, 
    "optimize_result": "Too many objects (0.010762 > 0.010000) are misplaced; try again later", 
    "last_optimize_started": "Tue Feb 15 09:05:32 2022"
}

Seems it is indeed limiting the data movement by the set 1%
So it is safe to assume I can put the number to 4096 and the total amount of misplaced PG's keeps around 1%. 

Met vriendelijke groet,
Kind Regards,
Maarten van Ingen

Specialist |SURF |maarten.vaningen@xxxxxxx <mailto:voornaam.achternaam@xxxxxxx>| T +31 30 88 787 3000 |M +31 6 19 03 90 19| 
SURF <http://www.surf.nl/> is the collaborative organisation for ICT in Dutch education and research

Op 15-02-2022 09:01 heeft Dan van der Ster <daniel.vanderster@xxxxxxx> geschreven:

    Hi Maarten,

    With `ceph osd pool ls detail` does it have pgp_num_target set to 2248?
    If so, yes it's moving gradually to that number.

    Cheers, Dan

    > On 02/15/2022 8:55 AM Maarten van Ingen <maarten.vaningen@xxxxxxx> wrote:
    > 
    >  
    > Hi,
    > 
    > After enabling the balancer (and set to upmap) on our environment it’s time to get the pgp_num on one of the pools on par with the pg_num.
    > This pool has pg_num set to 4096 and pgp_num to 2048 (by our mistake).
    > I just set the pgp_num to 2248 to keep data movement in check.
    > 
    > Oddly enough I see it’s only increased to 2108, also it’s odd we now get this health warning: 1 pools have pg_num > pgp_num, which we haven’t seen before…
    > 
    > 
    > # ceph -s
    >   cluster:
    >     id:     <id>
    >     health: HEALTH_WARN
    >             1 pools have pg_num > pgp_num
    > 
    >   services:
    >     mon: 5 daemons, quorum mon01,mon02,mon03,mon05,mon04 (age 3d)
    >     mgr: mon01(active, since 3w), standbys: mon05, mon04, mon03, mon02
    >     mds: cephfs:1 {0=mon04=up:active} 4 up:standby
    >     osd: 1278 osds: 1278 up (since 68m), 1278 in (since 22h); 74 remapped pgs
    > 
    >   data:
    >     pools:   28 pools, 13824 pgs
    >     objects: 441.41M objects, 1.5 PiB
    >     usage:   4.5 PiB used, 6.9 PiB / 11 PiB avail
    >     pgs:     15652608/1324221126 objects misplaced (1.182%)
    >              13693 active+clean
    >              74    active+remapped+backfilling
    >              56    active+clean+scrubbing+deep
    >              1     active+clean+scrubbing
    > 
    >   io:
    >     client:   187 MiB/s rd, 2.2 GiB/s wr, 11.11k op/s rd, 5.63k op/s wr
    >     recovery: 1.8 GiB/s, 533 objects/s
    > 
    > 
    > ceph osd pool get <pool> pgp_num
    > pgp_num: 2108
    > 
    > Is this default behaviour from ceph?
    > I get the feeling the balancer might have something to do here as well as we have set the balancer to only allow for 1% misplaced objects, to limit this as well. If that’s true, could I just set pgp_num to 4096 directly and CEPH limits the data movement by itself?
    > 
    > We are running a fully updated Nautilus cluster.
    > 
    > Met vriendelijke groet,
    > Kind Regards,
    > Maarten van Ingen
    > 
    > Specialist |SURF |maarten.vaningen@xxxxxxx<mailto:voornaam.achternaam@xxxxxxx>| T +31 30 88 787 3000 |M +31 6 19 03 90 19|
    > SURF<http://www.surf.nl/> is the collaborative organisation for ICT in Dutch education and research
    > _______________________________________________
    > ceph-users mailing list -- ceph-users@xxxxxxx
    > To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx