Re: Slow initial boot of OSDs in large cluster with unclean state

Dan van der Ster <dan.vanderster@xxxxxxxxx> · Wed, 8 Jan 2025 10:36:20 -0800

Hi Tom,

> Just to check, are you recommending that at some point each week all PGs are clean *at the same time*, or that no PGs should be unclean for more than a week?
> The latter absolutely makes sense, but the former can be quite hard to manage sometimes this cluster, with about one drive failure a week we're somewhat at the mercy of probability. We do always try and aim for 'clean-ish' every so often though :)
> Also, just to double check my understanding here, the cluster needs to keep hold of osdmaps going back to the point at which the currently unclean PGs were last clean? So if a cluster has a bunch of backfill being queued continuously for a month, but individual PGs get remapped and then backfilled quickly (e.g. ~1day), the cluster only needs to hold onto maps for the day, rather than the entire month period? Or am I missing something?

The mon keeps track of the "last epoch clean" which is indeed the
lowest osd epoch number out of *all* PGs (and also all *pools*, but
that's usually not relevant). Clean means active, not degraded, and
not remapped. So if you have any single PG which is not active, or it
is degraded, undersized, or remapped -- then that's an unclean state
for the cluster.
The mons trims the osdmaps up to that "last epoch clean". (And all
OSDs similarly keep the same osdmaps that the mon keeps).

So my "rule of thumb" is that you should aim to have a fully clean
cluster about once per week, because one week's worth of OSDmaps is a
relatively manageable number of maps for all the daemons in the
cluster to store and track.

> It's pulling down OSD maps, 40 at a time, taking about 4 seconds each time. With the ~30,000(!) OSD maps it pulls down, it takes approximately an hour. At ~4MB a map, this then matches up with the ~115GB storage consumed by the resulting OSD with no PGs.

One small correction here: the OSDs *normally* get a delta-encoded
version of each map -- a so-called "incremental map". Those
incremental maps are much smaller than the full sized 4MB.
And by default, the OSD stores each map in memory in a "deduplicated"
manner, to minimize the memory consumption.
You can confirm that (at least the memory part) by looking at the
osdmap section in dump_mempools -- it's storring lots of maps, but not
115GB of memory ;-)

But indeed those large size maps, and large number of maps, can create
some pretty big issues:
1. Bootstrapping new OSDs takes a long time, as you're seeing.
2. In some disaster cases, the network bandwidth needed to serve up
those osdmaps is too much for the mons to deliver. This is why we
normally suggest more mons for larger clusters -- just to have the
bandwidth to serve up osdmaps in that rare case it's needed.
3. The mon db needs to store all the maps, and that consumes a lot of
space. You must be getting mon db size warnings, right? Anything more
than ~10GB is getting too hefty, imho.
4...

> One of the issues I noted with this approach on this cluster is the inevitability of degraded PGs due to an unrelated failed drive

You can still use upmap-remaped.py in that case -- if an osd fails,
the "remapped" pgs will get mapped back to the current OSDs to avoid
useless data movement. But the degraded replicas/shards will still be
recovered as usual, which is good.
But you're right -- the mgr balancer pauses if the cluster is
degraded. IMHO that's usually fine -- we don't want a bunch of extra
data movement created while we're trying to recover from a degraded
state.
But there is at least one case where the balancer would help *a lot*
with degraded PGs, namely: https://tracker.ceph.com/issues/66755

I'd be happy to keep chatting to help you all get onto this mode of
operating. Frankly, I don't know how anyone can operate a large Ceph
cluster without upmap-remapped or pgremapper, which is why we're
trying to get this approach upstreamed.

Cheers, Dan

On Wed, Jan 8, 2025 at 6:20 AM Thomas Byrne - STFC UKRI
<tom.byrne@xxxxxxxxxx> wrote:
>
> Hi Dan,
>
> Happy new year!
>
> > I find it's always best to aim to have all PGs clean at least once a
> week -- that way the osdmaps can be trimmed at least weekly,
> preventing all sorts of nastiness, one of which you mentioned here.
>
> Just to check, are you recommending that at some point each week all PGs are clean *at the same time*, or that no PGs should be unclean for more than a week?
>
> The latter absolutely makes sense, but the former can be quite hard to manage sometimes this cluster, with about one drive failure a week we're somewhat at the mercy of probability. We do always try and aim for 'clean-ish' every so often though :)
>
> Also, just to double check my understanding here, the cluster needs to keep hold of osdmaps going back to the point at which the currently unclean PGs were last clean? So if a cluster has a bunch of backfill being queued continuously for a month, but individual PGs get remapped and then backfilled quickly (e.g. ~1day), the cluster only needs to hold onto maps for the day, rather than the entire month period? Or am I missing something?
>
> The above is how I would imagine an even larger cluster would operate, with the expectation is that there will always be at least one non-clean PG at any time. As long as PGs that are not clean will 'quickly' become clean, the range of maps needing to be kept around will be fairly small and the cluster could carry on in this state indefinitely.
>
> Thanks for your various recommendations, there are definitely a few things we don't do that we should (e.g. a balancer schedule).
>
> We don't make use upmap-remapped for normal operations currently, but I think what you're proposing here makes a lot of sense, especially combined with a balancer schedule. One of the issues I noted with this approach on this cluster is the inevitability of degraded PGs due to an unrelated failed drive/host stopping[1] the movement of data onto new disk/hosts/generations. This causes us issues in planning big data moves, although is something we could easily tweak.
>
> Finally, thanks for the hint about how to identify how many maps are being kept. Being able to track this is really handy, and takes a lot of the guesswork out of understanding the need to take breaks in cluster operations. I think we also need to pay more attention to 'unclean durations' of individual PGs, which is something we can do.
>
> Cheers,
> Tom
>
> [1] https://github.com/ceph/ceph/blob/main/src/pybind/mgr/balancer/module.py#L1040
> ________________________________________
> From: Dan van der Ster <dan.vanderster@xxxxxxxxx>
> Sent: Tuesday, January 7, 2025 21:15
> To: Byrne, Thomas (STFC,RAL,SC) <tom.byrne@xxxxxxxxxx>
> Cc: ceph-users@xxxxxxx <ceph-users@xxxxxxx>
> Subject: Re:  Slow initial boot of OSDs in large cluster with unclean state
>
> Hi Tom,
>
> On Tue, Jan 7, 2025 at 10:15 AM Thomas Byrne - STFC UKRI
> <tom.byrne@xxxxxxxxxx> wrote:
> > I realise the obvious answer here is don't leave big cluster in an unclean state for this long. Currently we've got PGs that have been remapped for 5 days, which matches the 30,000 OSDMap epoch range perfectly. This is something we're always looking at from a procedure point of view e.g. keeping max_backfills as high as possible by default, ensuring balancer max_misplaced is appropriate, re-evaluating disk and node addition/removal processes. But the reality on this cluster is that sometimes these 'logjams' happen, and it would be good to understand if we can improve the OSD addition experience so we can continue to be flexible with our operation scheduling.
>
> I find it's always best to aim to have all PGs clean at least once a
> week -- that way the osdmaps can be trimmed at least weekly,
> preventing all sorts of nastiness, one of which you mentioned here.
>
> Here's my recommended mgr balancer tuning:
>
> # Balance PGs Sunday to Friday, letting the backfilling finish on
> Saturdays. (adjust the exact days if needed -- the goal here is that
> at some point in the week, there needs to be 0 misplaced and 0
> degraded objects.)
> ceph config set mgr mgr/balancer/begin_weekday 0
> ceph config set mgr mgr/balancer/end_weekday 5
>
> # [Alternatively] Balance PGs during working hours, letting the
> backfilling finish over night:
> ceph config set mgr mgr/balancer/begin_time 0830
> ceph config set mgr mgr/balancer/end_time 1800
>
> # Decrease the max misplaced from the default 5% to 0.5%, to minimize
> the impact of backfilling and ensure the tail of backfilling PGs can
> finish over the weekend or over night -- increase this percentage if
> your cluster can tolerate it. (IMHO 5% is way too many misplaced
> objects on large clusters, but this is very use-case-specific).
> ceph config set mgr target_max_misplaced_ratio 0.005
>
> # Configure the balancer to aim for +/- 1 PG per pool per OSD -- this
> is the best uniformity we can hope for with the mgr balancer
> ceph config set mgr mgr/balancer/upmap_max_deviation 1
>
> Then whenever you add/remove hardware, here's my recommended procedure:
>
> 1. Set some flags to prevent data from moving immediately when we add new OSDs:
>    ceph osd set norebalance
>    ceph balancer off
>
> 2. Add the new OSDs. (Or start draining -- but note that if you are
> draining OSDs, set the crush weights to 0.1 instead of 0.0 -- upmap
> magic tools don't work with OSDs having crush weight = 0).
>
> 3. Run ./upmap-remapped.py [1] until the number of misplaced objects
> is as close as possible to zero.
>
> 4. Then unset the flags so data starts rebalancing again. I.e. the mgr
> balancer will move data in a controlled manner to those new empty
> OSDs:
>
>   ceph osd unset norebalance
>   ceph balancer on
>
> I have a couple talks about this for more on this topic:
>  - https://www.youtube.com/watch?v=6PQYHlerJ8k
>  - https://www.youtube.com/watch?v=A4xG975UWts
>
> We also have a plan to get this logic directly into ceph:
> https://tracker.ceph.com/issues/67418
>
> As to what you can do right now -- it's actually a great time to test
> out the above approach. Here's exactly what I'd do:
>
> 1. Stop those new OSDs (the ones that are not "in" yet) -- no point
> having them pull in 30000 osdmaps. Nothing should be degraded at this
> point -- if so, you either stopped too many OSDs, or there was some
> OSD flap that you need to recover from.
>
> 2. Since you have several remapped PGs right now, that's a perfect
> time to use upmap-remapped.py [1] -- it'll make the remapped PGs clean
> again.  So try running it:
>
>   ceph balancer off # disabled the mgr balancer, otherwise it would
> "undo" what we do next
>   ./upmap-remapped.py # this just outputs commands directly to stdout.
>   ./upmap-remapped.py | sh -x # this will run those commands.
>   ./upmap-remapped.py | sh -x # run it again -- normally we need to
> just run it twice to get to a minimal number of misplaced PGs.
>
> 3. When you run it, you should see the % misplaced objects decreasing.
> Ideally it will go to 0, meaning all PGs are active+clean. At that
> point the OSDmaps should trim.
>
> 4. Confirm that osdmaps have trimmed by looking at the `ceph report`:
>
>   ceph report | jq '(.osdmap_last_committed - .osdmap_first_committed)'
>
> ^^ the number above should be less than 750. If not -- then the
> osdmaps are not trimmed, and you need to investigate further.
>
> 5. Now start those new OSDs, they should pull in the ~750 osdmaps
> quickly, and then do the upmap-remapped procedure after configuring
> the balancer as I described.
>
> Hope this all helps, Happy New Year Tom.
>
> Cheers, Dan
>
> [1] https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py
>
> --
> Dan van der Ster
> CTO @ CLYSO
> Try our Ceph Analyzer -- https://analyzer.clyso.com/
> https://clyso.com | dan.vanderster@xxxxxxxxx

-- 
Dan van der Ster
CTO @ CLYSO
Try our Ceph Analyzer -- https://analyzer.clyso.com/
https://clyso.com | dan.vanderster@xxxxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx