Re: Does CEPH limit the pgp_num which it will increase in one go?

Maarten van Ingen <maarten.vaningen@xxxxxxx> · Tue, 15 Feb 2022 10:47:59 +0000

Hi,

I did a small test to see what would happen if it set the amount of "allowed" misplaced object and this indeed does change the amount of PG's will do simultaneously.

While probably not the balancer itself it, at least, shares this setting:

ceph config set mgr target_max_misplaced_ratio .015

Will result in about 1,5% misplaced objects while it was about 1%:
    pgs:     20299219/1324542018 objects misplaced (1.533%)

Which is good to know as you can easily limit the impact on the cluster.

I think we are good to go for now, thanks for the help on understanding this part of CEPH a little better.

Met vriendelijke groet,
Kind Regards,
Maarten van Ingen

Specialist |SURF |maarten.vaningen@xxxxxxx <mailto:voornaam.achternaam@xxxxxxx>| T +31 30 88 787 3000 |M +31 6 19 03 90 19| 
SURF <http://www.surf.nl/> is the collaborative organisation for ICT in Dutch education and research

Op 15-02-2022 09:56 heeft Maarten van Ingen <maarten.vaningen@xxxxxxx> geschreven:

    Hi,

    We have the pg_num set on 4096 for quite some time (months) but only now we increased the pgp_num. So if I understand correctly, the splitting should have been done already months ago. Increasing the pgp_num should only make sure the newly created pg's are actually moved into place.
    I read this here (for example) http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-May/001610.html

    We will keep your tip in mind about the time it takes. I think normally 1% is about a days work and 200 pg's in this pool is about 4ish % so that would mean a bit under a week until it's done. We are keeping the osd_max_backfills on 1 for now, setting it higher would ofc mean it should be done much faster. But also it would mean bigger impact on performance on the cluster.

    Met vriendelijke groet,
    Kind Regards,
    Maarten van Ingen

    Specialist |SURF |maarten.vaningen@xxxxxxx <mailto:voornaam.achternaam@xxxxxxx>| T +31 30 88 787 3000 |M +31 6 19 03 90 19| 
    SURF <http://www.surf.nl/> is the collaborative organisation for ICT in Dutch education and research

    Op 15-02-2022 09:30 heeft Dan van der Ster <daniel.vanderster@xxxxxxx> geschreven:

        Hi,

        You're confused: the `ceph balancer` is not related to pg splitting. The balancer is used to move PGs around to achieve a uniform distribution.

        What you're doing now by increasing pg num and pgp_num is splitting --> large PGs in split into smaller ones. This is achieved through backfilling.

        BTW, while a cluster is continuously backfilling, it will never trim osdmaps. If these accumulate for many days or weeks it can have a service impact on the mons (e.g. disk filling up).
        For this reason I suggest to let it get to 2248, make sure the osdmaps have trimmed [1], then increase pgp_num again.

        (This kind of stepwise process is really only important for large clusters where splitting can take many days to finish).

        Cheers, Dan

        [1] To see the number of osdmaps, go to any host with osds, e.g. osd.123, and do `ceph daemon osd.123 status`. Then find the difference between newest_map and oldest_map, e.g.:

            "oldest_map": 3970333,
            "newest_map": 3971041,

        It should be under 1000 or so. If much larger then your osdmaps are not trimming.

        Cheers, Dan

        > On 02/15/2022 9:08 AM Maarten van Ingen <maarten.vaningen@xxxxxxx> wrote:
        > 
        >  
        > Hi Dan,
        > 
        > Thanks for your (very) prompt response.
        > 
        > pg_num 4096 pgp_num 2108 pgp_num_target 2248
        > 
        > Also I see this:
        > #ceph balancer eval
        > current cluster score 0.068634 (lower is better)
        > 
        > #ceph balancer status
        > {
        >     "last_optimize_duration": "0:00:00.025029", 
        >     "plans": [], 
        >     "mode": "upmap", 
        >     "active": true, 
        >     "optimize_result": "Too many objects (0.010762 > 0.010000) are misplaced; try again later", 
        >     "last_optimize_started": "Tue Feb 15 09:05:32 2022"
        > }
        > 
        > Seems it is indeed limiting the data movement by the set 1%
        > So it is safe to assume I can put the number to 4096 and the total amount of misplaced PG's keeps around 1%. 
        > 
        > Met vriendelijke groet,
        > Kind Regards,
        > Maarten van Ingen
        >  
        > Specialist |SURF |maarten.vaningen@xxxxxxx <mailto:voornaam.achternaam@xxxxxxx>| T +31 30 88 787 3000 |M +31 6 19 03 90 19| 
        > SURF <http://www.surf.nl/> is the collaborative organisation for ICT in Dutch education and research
        > 
        > Op 15-02-2022 09:01 heeft Dan van der Ster <daniel.vanderster@xxxxxxx> geschreven:
        > 
        >     Hi Maarten,
        > 
        >     With `ceph osd pool ls detail` does it have pgp_num_target set to 2248?
        >     If so, yes it's moving gradually to that number.
        > 
        >     Cheers, Dan
        > 
        >     > On 02/15/2022 8:55 AM Maarten van Ingen <maarten.vaningen@xxxxxxx> wrote:
        >     > 
        >     >  
        >     > Hi,
        >     > 
        >     > After enabling the balancer (and set to upmap) on our environment it’s time to get the pgp_num on one of the pools on par with the pg_num.
        >     > This pool has pg_num set to 4096 and pgp_num to 2048 (by our mistake).
        >     > I just set the pgp_num to 2248 to keep data movement in check.
        >     > 
        >     > Oddly enough I see it’s only increased to 2108, also it’s odd we now get this health warning: 1 pools have pg_num > pgp_num, which we haven’t seen before…
        >     > 
        >     > 
        >     > # ceph -s
        >     >   cluster:
        >     >     id:     <id>
        >     >     health: HEALTH_WARN
        >     >             1 pools have pg_num > pgp_num
        >     > 
        >     >   services:
        >     >     mon: 5 daemons, quorum mon01,mon02,mon03,mon05,mon04 (age 3d)
        >     >     mgr: mon01(active, since 3w), standbys: mon05, mon04, mon03, mon02
        >     >     mds: cephfs:1 {0=mon04=up:active} 4 up:standby
        >     >     osd: 1278 osds: 1278 up (since 68m), 1278 in (since 22h); 74 remapped pgs
        >     > 
        >     >   data:
        >     >     pools:   28 pools, 13824 pgs
        >     >     objects: 441.41M objects, 1.5 PiB
        >     >     usage:   4.5 PiB used, 6.9 PiB / 11 PiB avail
        >     >     pgs:     15652608/1324221126 objects misplaced (1.182%)
        >     >              13693 active+clean
        >     >              74    active+remapped+backfilling
        >     >              56    active+clean+scrubbing+deep
        >     >              1     active+clean+scrubbing
        >     > 
        >     >   io:
        >     >     client:   187 MiB/s rd, 2.2 GiB/s wr, 11.11k op/s rd, 5.63k op/s wr
        >     >     recovery: 1.8 GiB/s, 533 objects/s
        >     > 
        >     > 
        >     > ceph osd pool get <pool> pgp_num
        >     > pgp_num: 2108
        >     > 
        >     > Is this default behaviour from ceph?
        >     > I get the feeling the balancer might have something to do here as well as we have set the balancer to only allow for 1% misplaced objects, to limit this as well. If that’s true, could I just set pgp_num to 4096 directly and CEPH limits the data movement by itself?
        >     > 
        >     > We are running a fully updated Nautilus cluster.
        >     > 
        >     > Met vriendelijke groet,
        >     > Kind Regards,
        >     > Maarten van Ingen
        >     > 
        >     > Specialist |SURF |maarten.vaningen@xxxxxxx<mailto:voornaam.achternaam@xxxxxxx>| T +31 30 88 787 3000 |M +31 6 19 03 90 19|
        >     > SURF<http://www.surf.nl/> is the collaborative organisation for ICT in Dutch education and research
        >     > _______________________________________________
        >     > ceph-users mailing list -- ceph-users@xxxxxxx
        >     > To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx