Re: Slow initial boot of OSDs in large cluster with unclean state

Thomas Byrne - STFC UKRI <tom.byrne@xxxxxxxxxx> · Wed, 8 Jan 2025 14:20:38 +0000

Hi Dan,

Happy new year!

> I find it's always best to aim to have all PGs clean at least once a
week -- that way the osdmaps can be trimmed at least weekly,
preventing all sorts of nastiness, one of which you mentioned here.

Just to check, are you recommending that at some point each week all PGs are clean *at the same time*, or that no PGs should be unclean for more than a week?

The latter absolutely makes sense, but the former can be quite hard to manage sometimes this cluster, with about one drive failure a week we're somewhat at the mercy of probability. We do always try and aim for 'clean-ish' every so often though :)

Also, just to double check my understanding here, the cluster needs to keep hold of osdmaps going back to the point at which the currently unclean PGs were last clean? So if a cluster has a bunch of backfill being queued continuously for a month, but individual PGs get remapped and then backfilled quickly (e.g. ~1day), the cluster only needs to hold onto maps for the day, rather than the entire month period? Or am I missing something?

The above is how I would imagine an even larger cluster would operate, with the expectation is that there will always be at least one non-clean PG at any time. As long as PGs that are not clean will 'quickly' become clean, the range of maps needing to be kept around will be fairly small and the cluster could carry on in this state indefinitely.

Thanks for your various recommendations, there are definitely a few things we don't do that we should (e.g. a balancer schedule).

We don't make use upmap-remapped for normal operations currently, but I think what you're proposing here makes a lot of sense, especially combined with a balancer schedule. One of the issues I noted with this approach on this cluster is the inevitability of degraded PGs due to an unrelated failed drive/host stopping[1] the movement of data onto new disk/hosts/generations. This causes us issues in planning big data moves, although is something we could easily tweak.

Finally, thanks for the hint about how to identify how many maps are being kept. Being able to track this is really handy, and takes a lot of the guesswork out of understanding the need to take breaks in cluster operations. I think we also need to pay more attention to 'unclean durations' of individual PGs, which is something we can do.

Cheers,
Tom

[1] https://github.com/ceph/ceph/blob/main/src/pybind/mgr/balancer/module.py#L1040
________________________________________
From: Dan van der Ster <dan.vanderster@xxxxxxxxx>
Sent: Tuesday, January 7, 2025 21:15
To: Byrne, Thomas (STFC,RAL,SC) <tom.byrne@xxxxxxxxxx>
Cc: ceph-users@xxxxxxx <ceph-users@xxxxxxx>
Subject: Re:  Slow initial boot of OSDs in large cluster with unclean state

Hi Tom,

On Tue, Jan 7, 2025 at 10:15 AM Thomas Byrne - STFC UKRI
<tom.byrne@xxxxxxxxxx> wrote:
> I realise the obvious answer here is don't leave big cluster in an unclean state for this long. Currently we've got PGs that have been remapped for 5 days, which matches the 30,000 OSDMap epoch range perfectly. This is something we're always looking at from a procedure point of view e.g. keeping max_backfills as high as possible by default, ensuring balancer max_misplaced is appropriate, re-evaluating disk and node addition/removal processes. But the reality on this cluster is that sometimes these 'logjams' happen, and it would be good to understand if we can improve the OSD addition experience so we can continue to be flexible with our operation scheduling.

I find it's always best to aim to have all PGs clean at least once a
week -- that way the osdmaps can be trimmed at least weekly,
preventing all sorts of nastiness, one of which you mentioned here.

Here's my recommended mgr balancer tuning:

# Balance PGs Sunday to Friday, letting the backfilling finish on
Saturdays. (adjust the exact days if needed -- the goal here is that
at some point in the week, there needs to be 0 misplaced and 0
degraded objects.)
ceph config set mgr mgr/balancer/begin_weekday 0
ceph config set mgr mgr/balancer/end_weekday 5

# [Alternatively] Balance PGs during working hours, letting the
backfilling finish over night:
ceph config set mgr mgr/balancer/begin_time 0830
ceph config set mgr mgr/balancer/end_time 1800

# Decrease the max misplaced from the default 5% to 0.5%, to minimize
the impact of backfilling and ensure the tail of backfilling PGs can
finish over the weekend or over night -- increase this percentage if
your cluster can tolerate it. (IMHO 5% is way too many misplaced
objects on large clusters, but this is very use-case-specific).
ceph config set mgr target_max_misplaced_ratio 0.005

# Configure the balancer to aim for +/- 1 PG per pool per OSD -- this
is the best uniformity we can hope for with the mgr balancer
ceph config set mgr mgr/balancer/upmap_max_deviation 1

Then whenever you add/remove hardware, here's my recommended procedure:

1. Set some flags to prevent data from moving immediately when we add new OSDs:
   ceph osd set norebalance
   ceph balancer off

2. Add the new OSDs. (Or start draining -- but note that if you are
draining OSDs, set the crush weights to 0.1 instead of 0.0 -- upmap
magic tools don't work with OSDs having crush weight = 0).

3. Run ./upmap-remapped.py [1] until the number of misplaced objects
is as close as possible to zero.

4. Then unset the flags so data starts rebalancing again. I.e. the mgr
balancer will move data in a controlled manner to those new empty
OSDs:

  ceph osd unset norebalance
  ceph balancer on

I have a couple talks about this for more on this topic:
 - https://www.youtube.com/watch?v=6PQYHlerJ8k
 - https://www.youtube.com/watch?v=A4xG975UWts

We also have a plan to get this logic directly into ceph:
https://tracker.ceph.com/issues/67418

As to what you can do right now -- it's actually a great time to test
out the above approach. Here's exactly what I'd do:

1. Stop those new OSDs (the ones that are not "in" yet) -- no point
having them pull in 30000 osdmaps. Nothing should be degraded at this
point -- if so, you either stopped too many OSDs, or there was some
OSD flap that you need to recover from.

2. Since you have several remapped PGs right now, that's a perfect
time to use upmap-remapped.py [1] -- it'll make the remapped PGs clean
again.  So try running it:

  ceph balancer off # disabled the mgr balancer, otherwise it would
"undo" what we do next
  ./upmap-remapped.py # this just outputs commands directly to stdout.
  ./upmap-remapped.py | sh -x # this will run those commands.
  ./upmap-remapped.py | sh -x # run it again -- normally we need to
just run it twice to get to a minimal number of misplaced PGs.

3. When you run it, you should see the % misplaced objects decreasing.
Ideally it will go to 0, meaning all PGs are active+clean. At that
point the OSDmaps should trim.

4. Confirm that osdmaps have trimmed by looking at the `ceph report`:

  ceph report | jq '(.osdmap_last_committed - .osdmap_first_committed)'

^^ the number above should be less than 750. If not -- then the
osdmaps are not trimmed, and you need to investigate further.

5. Now start those new OSDs, they should pull in the ~750 osdmaps
quickly, and then do the upmap-remapped procedure after configuring
the balancer as I described.

Hope this all helps, Happy New Year Tom.

Cheers, Dan

[1] https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py

--
Dan van der Ster
CTO @ CLYSO
Try our Ceph Analyzer -- https://analyzer.clyso.com/
https://clyso.com ;| dan.vanderster@xxxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx