Dave, Worth just looking at utilisation across your OSD’s. I’ve had Pgs get stuck in backfill-wait-too big when I’ve added new osds. Ceph was unable to move Pg around onto a smaller capacity osd that was quite full. I had to increase the number of pgs (and pg_num) for it to get sorted (and do some reweighting). Reeds plan is a good one. Because my setup has been in quite a state of flux recently, I’ve kept autoscale set to warn and the number of pgs higher than normal for the short term. Cheers A Sent from my iPhone On 12 Mar 2021, at 04:38, Dave Hall <kdhall@xxxxxxxxxxxxxx> wrote: Reed, Thank you. This seems like a very well thought approach. Your note about the balancer and the auto_scaler seem quite relevant as well. I'll give it a try when I add my next two nodes. -Dave -- Dave Hall Binghamton University On Thu, Mar 11, 2021 at 5:53 PM Reed Dier <reed.dier@xxxxxxxxxxx> wrote: > I'm sure there is a "correct" way, but I think it mostly relates to how > busy your cluster is, and how tolerant it is of the added load from the > backfills. > > My current modus operandi is to set the noin, noout, nobackfill, > norecover, and norebalance flags first. > This makes sure that new OSDs don't come in, current OSDs don't go out, it > doesn't start backfilling or try to rebalance (yet). > > Add all of my OSDs. > > Then unset noin and norebalance. > In all of the new OSDs. > Let it work out the new crush map so that data isn't constantly in motion > moving back and forth as new OSD hosts are added. > Inject osd_max_backfills and osd_recovery_max_active to 1 > Then unset norecover and nobackfill and noout. > > Then it should slowly but surely chip away at recovery. > During times of lighter load I can ratchet up the max backfills and > recovery max actives to a higher level to chug through more of it while > iops aren't being burned. > > I'm sure everyone has their own way, but I've been very comfortable with > this approach over the last few years. > > NOTE: you probably want to make sure that the balancer and the > pg_autoscaler are set to off during this, otherwise they might throw > backfills on the pile and you will feel like you'll never reach the bottom. > > Reed > >> On Mar 10, 2021, at 9:55 AM, Dave Hall <kdhall@xxxxxxxxxxxxxx> wrote: >> >> Hello, >> >> I am currently in the process of expanding my Nautilus cluster from 3 > nodes (combined OSD/MGR/MON/MDS) to 6 OSD nodes and 3 management nodes. > The old and new OSD nodes all have 8 x 12TB HDDs plus NVMe. The front and > back networks are 10GB. >> >> Last Friday evening I injected a whole new OSD node, increasing the OSD > HDDs from 24 to 32. As of this morning the cluster is still re-balancing - > with periodic warnings about degraded PGs and missed deep-scrub deadlines. > So after 4.5 days my misplaced PGs are down from 33% to 2%. >> >> My question: For a cluster of this size, what is the best-practice > procedure for adding OSDs? Should I use 'ceph-volume prepare' to layout > the new OSDs, but only add them a couple at a time, or should I continue > adding whole nodes? >> >> Maybe this has to do with a maximum percentage of misplaced PGs. The > first new node increased the OSD capacity by 33% and resulted in 33% PG > misplacement. The next node will only result in 25% misplacement. If a > too high percentage of misplaced PGs negatively impacts rebalancing or data > availability, what is a reasonable ceiling for this percentage? >> >> Thanks. >> >> -Dave >> >> -- >> Dave Hall >> Binghamton University >> kdhall@xxxxxxxxxxxxxx >> 607-760-2328 (Cell) >> 607-777-4641 (Office) >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx