Hi Dan, Happy new year! > I find it's always best to aim to have all PGs clean at least once a week -- that way the osdmaps can be trimmed at least weekly, preventing all sorts of nastiness, one of which you mentioned here. Just to check, are you recommending that at some point each week all PGs are clean *at the same time*, or that no PGs should be unclean for more than a week? The latter absolutely makes sense, but the former can be quite hard to manage sometimes this cluster, with about one drive failure a week we're somewhat at the mercy of probability. We do always try and aim for 'clean-ish' every so often though :) Also, just to double check my understanding here, the cluster needs to keep hold of osdmaps going back to the point at which the currently unclean PGs were last clean? So if a cluster has a bunch of backfill being queued continuously for a month, but individual PGs get remapped and then backfilled quickly (e.g. ~1day), the cluster only needs to hold onto maps for the day, rather than the entire month period? Or am I missing something? The above is how I would imagine an even larger cluster would operate, with the expectation is that there will always be at least one non-clean PG at any time. As long as PGs that are not clean will 'quickly' become clean, the range of maps needing to be kept around will be fairly small and the cluster could carry on in this state indefinitely. Thanks for your various recommendations, there are definitely a few things we don't do that we should (e.g. a balancer schedule). We don't make use upmap-remapped for normal operations currently, but I think what you're proposing here makes a lot of sense, especially combined with a balancer schedule. One of the issues I noted with this approach on this cluster is the inevitability of degraded PGs due to an unrelated failed drive/host stopping[1] the movement of data onto new disk/hosts/generations. This causes us issues in planning big data moves, although is something we could easily tweak. Finally, thanks for the hint about how to identify how many maps are being kept. Being able to track this is really handy, and takes a lot of the guesswork out of understanding the need to take breaks in cluster operations. I think we also need to pay more attention to 'unclean durations' of individual PGs, which is something we can do. Cheers, Tom [1] https://github.com/ceph/ceph/blob/main/src/pybind/mgr/balancer/module.py#L1040 ________________________________________ From: Dan van der Ster <dan.vanderster@xxxxxxxxx> Sent: Tuesday, January 7, 2025 21:15 To: Byrne, Thomas (STFC,RAL,SC) <tom.byrne@xxxxxxxxxx> Cc: ceph-users@xxxxxxx <ceph-users@xxxxxxx> Subject: Re: Slow initial boot of OSDs in large cluster with unclean state Hi Tom, On Tue, Jan 7, 2025 at 10:15 AM Thomas Byrne - STFC UKRI <tom.byrne@xxxxxxxxxx> wrote: > I realise the obvious answer here is don't leave big cluster in an unclean state for this long. Currently we've got PGs that have been remapped for 5 days, which matches the 30,000 OSDMap epoch range perfectly. This is something we're always looking at from a procedure point of view e.g. keeping max_backfills as high as possible by default, ensuring balancer max_misplaced is appropriate, re-evaluating disk and node addition/removal processes. But the reality on this cluster is that sometimes these 'logjams' happen, and it would be good to understand if we can improve the OSD addition experience so we can continue to be flexible with our operation scheduling. I find it's always best to aim to have all PGs clean at least once a week -- that way the osdmaps can be trimmed at least weekly, preventing all sorts of nastiness, one of which you mentioned here. Here's my recommended mgr balancer tuning: # Balance PGs Sunday to Friday, letting the backfilling finish on Saturdays. (adjust the exact days if needed -- the goal here is that at some point in the week, there needs to be 0 misplaced and 0 degraded objects.) ceph config set mgr mgr/balancer/begin_weekday 0 ceph config set mgr mgr/balancer/end_weekday 5 # [Alternatively] Balance PGs during working hours, letting the backfilling finish over night: ceph config set mgr mgr/balancer/begin_time 0830 ceph config set mgr mgr/balancer/end_time 1800 # Decrease the max misplaced from the default 5% to 0.5%, to minimize the impact of backfilling and ensure the tail of backfilling PGs can finish over the weekend or over night -- increase this percentage if your cluster can tolerate it. (IMHO 5% is way too many misplaced objects on large clusters, but this is very use-case-specific). ceph config set mgr target_max_misplaced_ratio 0.005 # Configure the balancer to aim for +/- 1 PG per pool per OSD -- this is the best uniformity we can hope for with the mgr balancer ceph config set mgr mgr/balancer/upmap_max_deviation 1 Then whenever you add/remove hardware, here's my recommended procedure: 1. Set some flags to prevent data from moving immediately when we add new OSDs: ceph osd set norebalance ceph balancer off 2. Add the new OSDs. (Or start draining -- but note that if you are draining OSDs, set the crush weights to 0.1 instead of 0.0 -- upmap magic tools don't work with OSDs having crush weight = 0). 3. Run ./upmap-remapped.py [1] until the number of misplaced objects is as close as possible to zero. 4. Then unset the flags so data starts rebalancing again. I.e. the mgr balancer will move data in a controlled manner to those new empty OSDs: ceph osd unset norebalance ceph balancer on I have a couple talks about this for more on this topic: - https://www.youtube.com/watch?v=6PQYHlerJ8k - https://www.youtube.com/watch?v=A4xG975UWts We also have a plan to get this logic directly into ceph: https://tracker.ceph.com/issues/67418 As to what you can do right now -- it's actually a great time to test out the above approach. Here's exactly what I'd do: 1. Stop those new OSDs (the ones that are not "in" yet) -- no point having them pull in 30000 osdmaps. Nothing should be degraded at this point -- if so, you either stopped too many OSDs, or there was some OSD flap that you need to recover from. 2. Since you have several remapped PGs right now, that's a perfect time to use upmap-remapped.py [1] -- it'll make the remapped PGs clean again. So try running it: ceph balancer off # disabled the mgr balancer, otherwise it would "undo" what we do next ./upmap-remapped.py # this just outputs commands directly to stdout. ./upmap-remapped.py | sh -x # this will run those commands. ./upmap-remapped.py | sh -x # run it again -- normally we need to just run it twice to get to a minimal number of misplaced PGs. 3. When you run it, you should see the % misplaced objects decreasing. Ideally it will go to 0, meaning all PGs are active+clean. At that point the OSDmaps should trim. 4. Confirm that osdmaps have trimmed by looking at the `ceph report`: ceph report | jq '(.osdmap_last_committed - .osdmap_first_committed)' ^^ the number above should be less than 750. If not -- then the osdmaps are not trimmed, and you need to investigate further. 5. Now start those new OSDs, they should pull in the ~750 osdmaps quickly, and then do the upmap-remapped procedure after configuring the balancer as I described. Hope this all helps, Happy New Year Tom. Cheers, Dan [1] https://github.com/cernceph/ceph-scripts/blob/master/tools/upmap/upmap-remapped.py -- Dan van der Ster CTO @ CLYSO Try our Ceph Analyzer -- https://analyzer.clyso.com/ https://clyso.com ;| dan.vanderster@xxxxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx