Re: Slow initial boot of OSDs in large cluster with unclean state

Dan van der Ster <dan.vanderster@xxxxxxxxx> · Tue, 7 Jan 2025 13:15:06 -0800

Hi Tom,

On Tue, Jan 7, 2025 at 10:15 AM Thomas Byrne - STFC UKRI
<tom.byrne@xxxxxxxxxx> wrote:
> I realise the obvious answer here is don't leave big cluster in an unclean state for this long. Currently we've got PGs that have been remapped for 5 days, which matches the 30,000 OSDMap epoch range perfectly. This is something we're always looking at from a procedure point of view e.g. keeping max_backfills as high as possible by default, ensuring balancer max_misplaced is appropriate, re-evaluating disk and node addition/removal processes. But the reality on this cluster is that sometimes these 'logjams' happen, and it would be good to understand if we can improve the OSD addition experience so we can continue to be flexible with our operation scheduling.

I find it's always best to aim to have all PGs clean at least once a
week -- that way the osdmaps can be trimmed at least weekly,
preventing all sorts of nastiness, one of which you mentioned here.

Here's my recommended mgr balancer tuning:

# Balance PGs Sunday to Friday, letting the backfilling finish on
Saturdays. (adjust the exact days if needed -- the goal here is that
at some point in the week, there needs to be 0 misplaced and 0
degraded objects.)
ceph config set mgr mgr/balancer/begin_weekday 0
ceph config set mgr mgr/balancer/end_weekday 5

# [Alternatively] Balance PGs during working hours, letting the
backfilling finish over night:
ceph config set mgr mgr/balancer/begin_time 0830
ceph config set mgr mgr/balancer/end_time 1800

# Decrease the max misplaced from the default 5% to 0.5%, to minimize
the impact of backfilling and ensure the tail of backfilling PGs can
finish over the weekend or over night -- increase this percentage if
your cluster can tolerate it. (IMHO 5% is way too many misplaced
objects on large clusters, but this is very use-case-specific).
ceph config set mgr target_max_misplaced_ratio 0.005

# Configure the balancer to aim for +/- 1 PG per pool per OSD -- this
is the best uniformity we can hope for with the mgr balancer
ceph config set mgr mgr/balancer/upmap_max_deviation 1

Then whenever you add/remove hardware, here's my recommended procedure:

1. Set some flags to prevent data from moving immediately when we add new OSDs:
   ceph osd set norebalance
   ceph balancer off

2. Add the new OSDs. (Or start draining -- but note that if you are
draining OSDs, set the crush weights to 0.1 instead of 0.0 -- upmap
magic tools don't work with OSDs having crush weight = 0).

3. Run ./upmap-remapped.py [1] until the number of misplaced objects
is as close as possible to zero.

4. Then unset the flags so data starts rebalancing again. I.e. the mgr
balancer will move data in a controlled manner to those new empty
OSDs:

  ceph osd unset norebalance
  ceph balancer on

I have a couple talks about this for more on this topic:
 - https://www.youtube.com/watch?v=6PQYHlerJ8k
 - https://www.youtube.com/watch?v=A4xG975UWts

We also have a plan to get this logic directly into ceph:
https://tracker.ceph.com/issues/67418

As to what you can do right now -- it's actually a great time to test
out the above approach. Here's exactly what I'd do:

1. Stop those new OSDs (the ones that are not "in" yet) -- no point
having them pull in 30000 osdmaps. Nothing should be degraded at this
point -- if so, you either stopped too many OSDs, or there was some
OSD flap that you need to recover from.

2. Since you have several remapped PGs right now, that's a perfect
time to use upmap-remapped.py [1] -- it'll make the remapped PGs clean
again.  So try running it:

  ceph balancer off # disabled the mgr balancer, otherwise it would
"undo" what we do next
  ./upmap-remapped.py # this just outputs commands directly to stdout.
  ./upmap-remapped.py | sh -x # this will run those commands.
  ./upmap-remapped.py | sh -x # run it again -- normally we need to
just run it twice to get to a minimal number of misplaced PGs.

3. When you run it, you should see the % misplaced objects decreasing.
Ideally it will go to 0, meaning all PGs are active+clean. At that
point the OSDmaps should trim.

4. Confirm that osdmaps have trimmed by looking at the `ceph report`:

  ceph report | jq '(.osdmap_last_committed - .osdmap_first_committed)'

^^ the number above should be less than 750. If not -- then the
osdmaps are not trimmed, and you need to investigate further.

5. Now start those new OSDs, they should pull in the ~750 osdmaps
quickly, and then do the upmap-remapped procedure after configuring
the balancer as I described.

Hope this all helps, Happy New Year Tom.

Cheers, Dan

[1] https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py

--
Dan van der Ster
CTO @ CLYSO
Try our Ceph Analyzer -- https://analyzer.clyso.com/
https://clyso.com | dan.vanderster@xxxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx