Dear Stefan, this looks like a terrible "improvement" as it implies a large number of redundant object movements together with an unnecessarily and hugely prolonged state of rebalancing. So far I always disabled rebalancing/recovery, added new OSDs, increased PG[P]_num, waited for peering and let ceph loose. Everything was distributed and went to the right place in one go and was finished after 2-3 weeks (ca. 1000 OSDs now). Is there really no way to do this in an atomic operation any more? Would target_max_misplaced_ratio=100% do the trick? Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Stefan Kooman <stefan@xxxxxx> Sent: 29 June 2022 14:42:47 To: Curt; Frank Schilder Cc: Robert Gallop; ceph-users@xxxxxxx Subject: Re: Re: Ceph recovery network speed On 6/29/22 11:21, Curt wrote: > On Wed, Jun 29, 2022 at 1:06 PM Frank Schilder <frans@xxxxxx> wrote: > >> Hi, >> >> did you wait for PG creation and peering to finish after setting pg_num >> and pgp_num? They should be right on the value you set and not lower. >> > Yes, only thing going on was backfill. It's still just slowly expanding pg > and pgp nums. I even ran the set command again. Here's the current info > ceph osd pool get EC-22-Pool all > size: 4 > min_size: 3 > pg_num: 226 > pgp_num: 98 This is coded in the mons and works like that from nautilus onwards: src/mon/OSDMonitor.cc ... if (osdmap.require_osd_release < ceph_release_t::nautilus) { // pre-nautilus osdmap format; increase pg_num directly assert(n > (int)p.get_pg_num()); // force pre-nautilus clients to resend their ops, since they // don't understand pg_num_target changes form a new interval p.last_force_op_resend_prenautilus = pending_inc.epoch; // force pre-luminous clients to resend their ops, since they // don't understand that split PGs now form a new interval. p.last_force_op_resend_preluminous = pending_inc.epoch; p.set_pg_num(n); } else { // set targets; mgr will adjust pg_num_actual and pgp_num later. // make pgp_num track pg_num if it already matches. if it is set // differently, leave it different and let the user control it // manually. if (p.get_pg_num_target() == p.get_pgp_num_target()) { p.set_pgp_num_target(n); } p.set_pg_num_target(n); } ... So, when pg_num and pgp_num are the same when pg_num is increased, it will slowly change pgp_num. If pgp_num is different (smaller, as it cannot be bigger than pg_num) it will not touch pgp_num. You might speed up this process by increasing "target_max_misplaced_ratio" Gr. Stefan _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx