Re: Schrödinger's Server

Tim Holloway <timh@xxxxxxxxxxxxx> · Sat, 01 Mar 2025 14:40:47 -0500

Standard disclaimer: Ceph is WAY overpowered for my current needs, but
when GlusterFS looked like it would be unsupported and all the other
options I could find for triple-redundancy systems looked to either be
proprietary or also unsupported, I switched to Ceph. I was also
influenced by it's connections to OpenStack.

On the whole, though, I don't regret the switch. For all the occasional
grief and OSD woes, I've always been able to depend on it keeping the
data intact.

So, having said that, let me describe my farm. Don't laugh. It's not
quite a homelab, but it was set up as an R&D installation to be able to
support clients and be able to develop enterprise systems on captive
equipment before porting projects to client sites.

Because clients around here are cheapskates, I depend a lot on generic
equipment and open-source solutions. My primary servers are all
"frankenboxes", built by recycling old equipment. My standard for
motherboards up until now has been Asus units dating to 2011. More
recently, I've been snapping up refurbished Dell micro-form factor
units, as they make excellent OSD servers for my needs, take up little
space and not much power. These units came with M.2 slots for the
second drive, so that's where the OSD data is going.

I don't make heavy demands for data. There's only about 78GB in the
Ceph filesystem, and the 2 biggest strains are weekly when the full
filesystem backups run and whenever OSDs need re-balancing.

Since so much of my kit in antiquated, my SSD experience up to now has
when I migrated ny OS drives to 2.5-inch SATA units, but going forward,
I'll definitely be putting in M.2's for Ceph OSDs. I wouldn't feel as
guilty about running multiple OSDs per box with the reduced power
requirements, and I'm not going to feel bad if they run faster, too!

Over the years, I've learned to expect 1 or 2 hard drives to fail
annually(hence the need for Ceph!),  an occasional power supply outage
(just replaced one, which helped launch my latest sad story), and
downtime from overheating CPUs. Since this isn't a "clean room", after
a year or 2, the cooling fans on the CPU get full of dust and have to
be cleaned out. Anything else is likely to be thunderstorm damage and
I've been lucky for the most part on that.

Specific stats on my current Ceph FS. As mentioned, about 78G in use
with 1.6 TB potential, Ceph version is Pacific. There are 560 PSs/OSD
and the system is clean but complains that that's not a power of 2.
I've tried setting the PG count to something more like recommendations
I've seen, but it looks like the auto-scaler is taking over and re-
setting back to 560 when I do. There are presently 5 OSDs at 300GB
capacity per each.

Almost everything is default settings, including only having 1 crush
rule for everything, as up until now, it was all triple-redundancy HDD.
Obviously I need to review that. Since I not only maintain Ceph, but
every other service on the farm, including appservers, LDAP, NFS, DNS,
and much more, I haven't had the luxury to dig into Ceph as deeply as
I'd like, so the fact that it works so well under such shoddy
administration is also a point in its favor.

   Tim

On Thu, 2025-02-27 at 15:47 -0500, Anthony D'Atri wrote:
> 
> 
> > On Feb 27, 2025, at 8:14 AM, Tim Holloway <timh@xxxxxxxxxxxxx>
> > wrote:
> > 
> > System is now stable. The rebalancing was doing what it should,
> > finished after a couple of hours.
> > 
> > I decided to re-visit the primary problem, which was the infamous
> > "too
> > many pgs per OSD" and did some tweaks to the pool settings.
> > 
> > It appears that it was actually the auto-sizer that was creating so
> > many PGs for the biggest pool. I was apparently just misguided in
> > my
> > expectations based on guidelines that appear to be out of date now.
> > 
> > So I've concluded that there's nothing wrong with how things are
> > laid
> > out, but I need to increase the monitor alert level for pgs per
> > OSD.
> > 
> > That SHOULD be straightforward, but every time I go looking for
> > info on
> > how to do it, I either end up with how to set initial PG allocation
> > when creating pools or how to set the alert level the old way via
> > ceph.config file options. Very annoying.
> > 
> > On the plus side, once again Ceph has weathered a major disruption
> > without losing any data.
> 
> Ceph is first and foremost about strong consistency.  So long as you
> don’t do something like R2 pools, then you’re on your own ;)
> 
> > On the minus side, I really wish it wouldn't
> > simply silently stall out with no reason given while re-balancing. 
> 
> Do you have HDD EC pools?  They seem to be the most affected by
> mclock implementation issues; reverting to wpq for now might help
> until staged improvements are released.
> 
> > When I saw it happening (usually after about 15 minutes), I could
> > kick it
> > back into operation by rebooting a server,
> 
> Yikes!  Way larger hammer than you need.  Rebooting the whole node is
> rarely needed.  A first thing to try is
> 
> 	ceph health detail
> 
> to see the acting sets of affected PGs.  The first in each set is the
> lead OSD for that PG.  Pick one of those, say 1701, and issue
> 
>          ceph osd down 1701
> 
> That will only mark the OSD down, it will not kill or restart
> daemons.  The OSD will mark itself right back up, but sometimes this
> is enough to goose stuck processes.
> 
> You want to wait for the OSD to be marked back up and the cluster to
> plateau before issuing against another OSD.
> 
> > but since there was no
> > single or set of OSDs that it seemed to hang on, I just picked a
> > server
> > with the most OSDs reported and rebooted that on. I suspect,
> > however,
> > that any server would have done.
> 
> 
> 
> > 
> >   Thanks,
> >      Tim
> > 
> > On Thu, 2025-02-27 at 08:28 +0100, Frédéric Nass wrote:
> > > 
> > > 
> > > ----- Le 26 Fév 25, à 16:40, Tim Holloway timh@xxxxxxxxxxxxx a
> > > écrit
> > > :
> > > 
> > > > Thanks. I did resolve that problem, though I haven't had a
> > > > chance
> > > > to
> > > > update until now.
> > > > 
> > > > I had already attempted to use ceph orch to remove the daemons,
> > > > but
> > > > they didn't succeed.
> > > > 
> > > > Fortunately, I was able to bring the host online, which allowed
> > > > the
> > > > scheduled removals to complete. I confirmed everything was
> > > > drained,
> > > > again removed the host from inventory and powered down.
> > > > 
> > > > Still got complaints from cephadm about the decommissioned
> > > > host.
> > > > 
> > > > I took a break - impatience and ceph don't mix - and came back
> > > > to
> > > > address the next problem. which was lots of stuck PGs. Either
> > > > because
> > > > cephadm timed out or something kicked in when I started
> > > > randomly
> > > > rebooting OSDs. the host complaint finally disappeared. End of
> > > > story.
> > > > 
> > > > Now for what sent me down that path.
> > > > 
> > > > I had 2 OSDs on one server and felt that that was probably not
> > > > a
> > > > good
> > > > idea, so I marked one for deletion. 4 days later it was still
> > > > in
> > > > "destroying" state. More concerning, all signs indicated that
> > > > despite
> > > > having been reweighted to 0, the "destroying" OSD was still an
> > > > essential participant and no indication that its PGs were being
> > > > relocared to active servers. Shutting down the "destroying" OSD
> > > > would
> > > > immediately trigger a re-allocation panic, but that didn't
> > > > clean
> > > > anything. The re-allocation would proceed at a furious pace,
> > > > then
> > > > slowly stall out and hang, and the system was degraded.
> > > > Restarting
> > > > the
> > > > OSD brought the PG inventory back up, but stuff still wasn't
> > > > moving
> > > > off
> > > > the OSD,
> > > > 
> > > > Right about that time I decommissioned the questionable host.
> > > > 
> > > > Finally, I did a "ceph orch rm osd.x", and terminated the
> > > > "destroying"
> > > > permanently, making it finally disappear from the OSD tree
> > > > list.
> > > > 
> > > > I also deleted a number of OSD pools that are (hopefully) not
> > > > going
> > > > to
> > > > be missed.
> > > > 
> > > > Kicking and randomly repeatedly rebooting the other OSDs
> > > > finally
> > > > cleared all the stuck OSDs, some of which hadn't resolved in
> > > > over 2
> > > > days.
> > > > 
> > > > So at the moment, it's either rebalancing the cleaned-up OSDs
> > > > or in
> > > > a
> > > > loop thinking that it is.
> > > 
> > > Since you deleted some pools, it's probably the upmap balancer
> > > rebalancing PGs across the OSDs.
> > > 
> > > > And the PG/per-OSD count seems way too high,
> > > 
> > > How much is it right now? With what hardware?
> > > 
> > > > but the auto-sized doesn't seem to want to do anything about
> > > > that.
> > > 
> > > If the PG autoscaler is enabled you could try adjusting per pool
> > > settings [1] and see if the # of PGs decreases.
> > > If disabled you could manually reduce the number of PGs on the
> > > remaining pools to lower the PG/OSD ratio.
> > > 
> > > Regards,
> > > Frédéric.
> > > 
> > > > 
> > > > Of course, the whole shebang has been unavailable to clients
> > > > this
> > > > whole
> > > > week because of that.
> > > > 
> > > > I've been considering upgrading to reef, but recent posts
> > > > regarding
> > > > issues resembling what I've been going through are making me
> > > > pause.
> > > > 
> > > >  Again, thanks!
> > > >    Tim
> > > > 
> > > > On Wed, 2025-02-26 at 13:57 +0100, Frédéric Nass wrote:
> > > > > Hi Tim,
> > > > > 
> > > > > If you can't bring the host back online so that cephadm can
> > > > > remove
> > > > > these services itself, I guess you'll have to clean up the
> > > > > mess
> > > > > by:
> > > > > 
> > > > > - removing these services from the cluster (for example with
> > > > > a
> > > > > 'ceph
> > > > > mon remove {mon-id}' for the monitor)
> > > > > - forcing their removal from the orchestrator with the --
> > > > > force
> > > > > option
> > > > > on the commands 'ceph orch daemon rm <names>' and 'ceph orch
> > > > > host
> > > > > rm
> > > > > <hostname>'. If the --force option doesn't help, then looking
> > > > > into/editing/removing ceph-config keys like
> > > > > 'mgr/cephadm/inventory'
> > > > > and 'mgr/cephadm/host.ceph07.internal.mousetech.com' that
> > > > > 'ceph
> > > > > config-key dump' output shows might help.
> > > > > 
> > > > > Regards,
> > > > > Frédéric.
> > > > > 
> > > > > ----- Le 25 Fév 25, à 16:42, Tim Holloway
> > > > > timh@xxxxxxxxxxxxx a
> > > > > écrit
> > > > > :
> > > > > 
> > > > > > Ack. Another fine mess.
> > > > > > 
> > > > > > I was trying to clean things up and the process of tossing
> > > > > > around
> > > > > > OSD's
> > > > > > kept getting me reports of slow responses and hanging PG
> > > > > > operations.
> > > > > > 
> > > > > > This is Ceph Pacific, by the way.
> > > > > > 
> > > > > > I found a deprecated server that claimed to have an OSD
> > > > > > even
> > > > > > though
> > > > > > it
> > > > > > didn't show in either "ceph osd tree" or the dashboard OSD
> > > > > > list. I
> > > > > > suspect that a lot of the grief came from it attempting to
> > > > > > use
> > > > > > resources that weren't always seen as resources.
> > > > > > 
> > > > > > I shut down the server's OSD (removed the daemon using ceph
> > > > > > orch),
> > > > > > then
> > > > > > foolishly deleted the server from the inventory without
> > > > > > doing a
> > > > > > drain
> > > > > > first.
> > > > > > 
> > > > > > Now cephadmin hates me (key not found), and there are still
> > > > > > an
> > > > > > MDS
> > > > > > and
> > > > > > MON listed as ceph orch ls daemons even after I powered the
> > > > > > host
> > > > > > off.
> > > > > > 
> > > > > > I cannot do a ceph orch daemon delete because there's no
> > > > > > longer
> > > > > > an
> > > > > > IP
> > > > > > address available to the daemon delete, and I cannot clear
> > > > > > the
> > > > > > cephadmin queue:
> > > > > > 
> > > > > > [ERR] MGR_MODULE_ERROR: Module 'cephadm' has failed:
> > > > > > 'ceph07.internal.mousetech.com'
> > > > > > 
> > > > > > Any suggestions?
> > > > > > _______________________________________________
> > > > > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > > > > _______________________________________________
> > > > > ceph-users mailing list -- ceph-users@xxxxxxx
> > > > > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx