Re: Somehow throotle recovery even further than basic options?

Joachim Kraftmayer <joachim.kraftmayer@xxxxxxxxx> · Sun, 8 Sep 2024 19:39:05 +0200

Hi,
the presentation from Dan van der Ster should answer all your questions:

https://youtu.be/9lsByOMdEwc?si=GfvICgZCnT2L93Tn

If you have no time, start at 17:00 min.

Joachim

joachim.kraftmayer@xxxxxxxxx

www.clyso.com

Janne Johansson <icepic.dz@xxxxxxxxx> schrieb am Sa., 7. Sept. 2024, 09:00:

> The pgremapper (and the python one) will allow you to mark all the PGs
> that a new disk gets as an empty-misplaced PG to be correct where they
> currently are. This means that after you run one of the remappers, the
> upmap will tell the cluster to stay as it is even though new empty
> OSDs have arrived with correct crush weights and all.
>
> So you set norebalance, add N+1 new OSDs, the cluster shudders for a
> short while when the new-empty PGs are created on the new drives, then
> you have lots and lots of misplaced PGs which norebalance prevents
> from starting backfills on.
>
> Then you run the remapper and "fix" the upmap so all the current
> placements are considered "correct", and hence the PGs that were
> supposed to move stop being misplaced. After this, you remove
> "norebalance" to allow moves to start happening.
>
> By now, no movement (or at least very few PGs) should occur. What
> happens next is that the balancer notices there actually is more
> space, and figures out the optimal result is more or less the same as
> above where lots of PGs should go to the new OSDs, but it does this
> with the max-misplaced-ratio in mind, and it will "move" the PGs by
> just unsetting their upmap entry that forced them to want to stay in
> place, so as time passes, it moves a few PGs at a time, and moves them
> by removing the upmaps from them so that most of the time your other
> OSDs will look perfectly healthy and will continue to do all the
> scrubs and things a healthy OSD should do and which it will not do if
> it has a long queue of backfills waiting to eat up all the slots for
> non-client IO.
>
> While you can do soft additions with increasing crush weights, this
> potentially causes lots of more movements since an OSD host with OSDs
> going from 0.1 to 0.2 weight might not place all PGs in the same spot
> in those two cases, so you could have movement within the host and so
> on from the recalculated pseudorandom placements on every increase.
>
> Den lör 7 sep. 2024 kl 00:15 skrev Eugen Block <eblock@xxxxxx>:
> >
> > I can’t say anything about the pgremapper, but have you tried
> > increasing the crush weight gradually? Add new OSDs with crush initial
> > weight 0 and then increase it in small steps. I haven’t used that
> > approach for years, but maybe that can help here. Or are all OSDs
> > already up and in? Or you could reduce the max misplaced ratio to 1%
> > or even lower (default is 5%)?
> >
> > Zitat von "Szabo, Istvan (Agoda)" <Istvan.Szabo@xxxxxxxxx>:
> >
> > > Forgot to paste, somehow I want to reduce this recovery operation:
> > > recovery: 0 B/s, 941.90k keys/s, 188 objects/s
> > > To 2-300Keys/sec
> > >
> > >
> > >
> > > ________________________________
> > > From: Szabo, Istvan (Agoda) <Istvan.Szabo@xxxxxxxxx>
> > > Sent: Friday, September 6, 2024 11:18 PM
> > > To: Ceph Users <ceph-users@xxxxxxx>
> > > Subject:  Somehow throotle recovery even further than
> > > basic options?
> > >
> > > Hi,
> > >
> > > 4 years ago we've created our cluster with all disks 4osds (ssds and
> > > nvme disks) on octopus.
> > > The 15TB SSDs still working properly with 4 osds but the small 1.8T
> > > nvmes with the index pool not.
> > > Each new nvme osd adding to the existing nodes generates slow ops
> > > with scrub off, recovery_op_priority 1, backfill and recovery 1-1.
> > > I even turned off all index pool heavy sync mechanism but the read
> > > latency still high which means recovery op pushes it even higher.
> > >
> > > I'm trying to somehow add resource to the cluster to spread the 2048
> > > index pool pg (in replica 3 means 6144pg index pool) but can't make
> > > it more gentle.
> > >
> > > The balancer is working in upmap with max deviation 1.
> > >
> > > Have this script from digitalocean
> > > https://github.com/digitalocean/pgremapper, is there anybody tried
> > > it before how is it or could this help actually?
> > >
> > > Thank you the ideas.
> > >
> > > ________________________________
> > > This message is confidential and is for the sole use of the intended
> > > recipient(s). It may also be privileged or otherwise protected by
> > > copyright or other legal rules. If you have received it by mistake
> > > please let us know by reply email and delete it from your system. It
> > > is prohibited to copy this message or disclose its content to
> > > anyone. Any confidentiality or privilege is not waived or lost by
> > > any mistaken delivery or unauthorized disclosure of the message. All
> > > messages sent to and from Agoda may be monitored to ensure
> > > compliance with company policies, to protect the company's interests
> > > and to remove potential malware. Electronic messages may be
> > > intercepted, amended, lost or deleted, or contain viruses.
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > > _______________________________________________
> > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> >
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
>
> --
> May the most significant bit of your life be positive.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx