Re: Ceph recovery network speed

Frank Schilder <frans@xxxxxx> · Wed, 29 Jun 2022 12:58:08 +0000

Dear Stefan,

this looks like a terrible "improvement" as it implies a large number of redundant object movements together with an unnecessarily and hugely prolonged state of rebalancing. So far I always disabled rebalancing/recovery, added new OSDs, increased PG[P]_num, waited for peering and let ceph loose. Everything was distributed and went to the right place in one go and was finished after 2-3 weeks (ca. 1000 OSDs now).

Is there really no way to do this in an atomic operation any more? Would target_max_misplaced_ratio=100% do the trick?

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Stefan Kooman <stefan@xxxxxx>
Sent: 29 June 2022 14:42:47
To: Curt; Frank Schilder
Cc: Robert Gallop; ceph-users@xxxxxxx
Subject: Re:  Re: Ceph recovery network speed

On 6/29/22 11:21, Curt wrote:
> On Wed, Jun 29, 2022 at 1:06 PM Frank Schilder <frans@xxxxxx> wrote:
>
>> Hi,
>>
>> did you wait for PG creation and peering to finish after setting pg_num
>> and pgp_num? They should be right on the value you set and not lower.
>>
> Yes, only thing going on was backfill. It's still just slowly expanding pg
> and pgp nums.   I even ran the set command again.  Here's the current info
> ceph osd pool get EC-22-Pool all
> size: 4
> min_size: 3
> pg_num: 226
> pgp_num: 98

This is coded in the mons and works like that from nautilus onwards:

src/mon/OSDMonitor.cc

...
     if (osdmap.require_osd_release < ceph_release_t::nautilus) {
       // pre-nautilus osdmap format; increase pg_num directly
       assert(n > (int)p.get_pg_num());
       // force pre-nautilus clients to resend their ops, since they
       // don't understand pg_num_target changes form a new interval
       p.last_force_op_resend_prenautilus = pending_inc.epoch;
       // force pre-luminous clients to resend their ops, since they
       // don't understand that split PGs now form a new interval.
       p.last_force_op_resend_preluminous = pending_inc.epoch;
       p.set_pg_num(n);
     } else {
       // set targets; mgr will adjust pg_num_actual and pgp_num later.
       // make pgp_num track pg_num if it already matches.  if it is set
       // differently, leave it different and let the user control it
       // manually.
       if (p.get_pg_num_target() == p.get_pgp_num_target()) {
         p.set_pgp_num_target(n);
       }
       p.set_pg_num_target(n);
     }
...

So, when pg_num and pgp_num are the same when pg_num is increased, it
will slowly change pgp_num. If pgp_num is different (smaller, as it
cannot be bigger than pg_num) it will not touch pgp_num.

You might speed up this process by increasing "target_max_misplaced_ratio"

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx