Re: Schrödinger's Server

"Anthony D'Atri" <anthony.datri@xxxxxxxxx> · Fri, 28 Feb 2025 14:00:17 -0500

> Thanks for the advice.
> 
> Previously I was all HDDs, but I'm beginning to migrate to M.2 SSDs.
> But so far, only a few.

Manage your CRUSH device classes and rules carefully.

Also, are you selecting *enterprise* NVMe M.2 SSDs?  Many of them out there are client-class and/or SATA.

Are you doing so via PCIe cards that accept between one and many M.2 sticks?

Those are not all created equal, so if so select carefully.  Some may just split PCIe lanes to sticks so each may have lower throughput, others are fancier.  Also ensure that such an adapter is capable of at least the PCIe generation of your motherboard.  It would be a crying shame to hobble Gen 4/5 SSDs and motherboard with a Gen 3 adapter.

> I'll have to look regarding WPQ. I'm running whatever came out of the
> box, possibly inherited from the pre-upgrade installation (Octopus,
> RIP!)

A Quincy or later install in many cases will set mclock unilaterally; you can find other discussions of reverting to wpq.

> 
> As far as kick-starting stalled recovery, I had been doing what you
> said, although the hanging PGs were all over the map, hence the more or
> less random restarts. Sometimes I would restart just the OSD, but the
> machines reboot fast and there's no other apps on the ceph nodes, so
> it's generally simpler to reboot and in the process flush out any other
> issues that might have arisen.

If you do that too aggressively and in the wrong order in the wrong failure domains, you can lose ops in flight and have data be unavailable or even lost, so I like to try a lighter hammer first.

> 
> On Thu, 2025-02-27 at 15:47 -0500, Anthony D'Atri wrote:
>> 
>> 
>>> On Feb 27, 2025, at 8:14 AM, Tim Holloway <timh@xxxxxxxxxxxxx>
>>> wrote:
>>> 
>>> System is now stable. The rebalancing was doing what it should,
>>> finished after a couple of hours.
>>> 
>>> I decided to re-visit the primary problem, which was the infamous
>>> "too
>>> many pgs per OSD" and did some tweaks to the pool settings.
>>> 
>>> It appears that it was actually the auto-sizer that was creating so
>>> many PGs for the biggest pool. I was apparently just misguided in
>>> my
>>> expectations based on guidelines that appear to be out of date now.
>>> 
>>> So I've concluded that there's nothing wrong with how things are
>>> laid
>>> out, but I need to increase the monitor alert level for pgs per
>>> OSD.
>>> 
>>> That SHOULD be straightforward, but every time I go looking for
>>> info on
>>> how to do it, I either end up with how to set initial PG allocation
>>> when creating pools or how to set the alert level the old way via
>>> ceph.config file options. Very annoying.
>>> 
>>> On the plus side, once again Ceph has weathered a major disruption
>>> without losing any data.
>> 
>> Ceph is first and foremost about strong consistency.  So long as you
>> don’t do something like R2 pools, then you’re on your own ;)
>> 
>>> On the minus side, I really wish it wouldn't
>>> simply silently stall out with no reason given while re-balancing.
>> 
>> Do you have HDD EC pools?  They seem to be the most affected by
>> mclock implementation issues; reverting to wpq for now might help
>> until staged improvements are released.
>> 
>>> When I saw it happening (usually after about 15 minutes), I could
>>> kick it
>>> back into operation by rebooting a server,
>> 
>> Yikes!  Way larger hammer than you need.  Rebooting the whole node is
>> rarely needed.  A first thing to try is
>> 
>> 	ceph health detail
>> 
>> to see the acting sets of affected PGs.  The first in each set is the
>> lead OSD for that PG.  Pick one of those, say 1701, and issue
>> 
>>          ceph osd down 1701
>> 
>> That will only mark the OSD down, it will not kill or restart
>> daemons.  The OSD will mark itself right back up, but sometimes this
>> is enough to goose stuck processes.
>> 
>> You want to wait for the OSD to be marked back up and the cluster to
>> plateau before issuing against another OSD.
>> 
>>> but since there was no
>>> single or set of OSDs that it seemed to hang on, I just picked a
>>> server
>>> with the most OSDs reported and rebooted that on. I suspect,
>>> however,
>>> that any server would have done.
>> 
>> 
>> 
>>> 
>>>   Thanks,
>>>      Tim
>>> 
>>> On Thu, 2025-02-27 at 08:28 +0100, Frédéric Nass wrote:
>>>> 
>>>> 
>>>> ----- Le 26 Fév 25, à 16:40, Tim Holloway timh@xxxxxxxxxxxxx a
>>>> écrit
>>>> :
>>>> 
>>>>> Thanks. I did resolve that problem, though I haven't had a
>>>>> chance
>>>>> to
>>>>> update until now.
>>>>> 
>>>>> I had already attempted to use ceph orch to remove the daemons,
>>>>> but
>>>>> they didn't succeed.
>>>>> 
>>>>> Fortunately, I was able to bring the host online, which allowed
>>>>> the
>>>>> scheduled removals to complete. I confirmed everything was
>>>>> drained,
>>>>> again removed the host from inventory and powered down.
>>>>> 
>>>>> Still got complaints from cephadm about the decommissioned
>>>>> host.
>>>>> 
>>>>> I took a break - impatience and ceph don't mix - and came back
>>>>> to
>>>>> address the next problem. which was lots of stuck PGs. Either
>>>>> because
>>>>> cephadm timed out or something kicked in when I started
>>>>> randomly
>>>>> rebooting OSDs. the host complaint finally disappeared. End of
>>>>> story.
>>>>> 
>>>>> Now for what sent me down that path.
>>>>> 
>>>>> I had 2 OSDs on one server and felt that that was probably not
>>>>> a
>>>>> good
>>>>> idea, so I marked one for deletion. 4 days later it was still
>>>>> in
>>>>> "destroying" state. More concerning, all signs indicated that
>>>>> despite
>>>>> having been reweighted to 0, the "destroying" OSD was still an
>>>>> essential participant and no indication that its PGs were being
>>>>> relocared to active servers. Shutting down the "destroying" OSD
>>>>> would
>>>>> immediately trigger a re-allocation panic, but that didn't
>>>>> clean
>>>>> anything. The re-allocation would proceed at a furious pace,
>>>>> then
>>>>> slowly stall out and hang, and the system was degraded.
>>>>> Restarting
>>>>> the
>>>>> OSD brought the PG inventory back up, but stuff still wasn't
>>>>> moving
>>>>> off
>>>>> the OSD,
>>>>> 
>>>>> Right about that time I decommissioned the questionable host.
>>>>> 
>>>>> Finally, I did a "ceph orch rm osd.x", and terminated the
>>>>> "destroying"
>>>>> permanently, making it finally disappear from the OSD tree
>>>>> list.
>>>>> 
>>>>> I also deleted a number of OSD pools that are (hopefully) not
>>>>> going
>>>>> to
>>>>> be missed.
>>>>> 
>>>>> Kicking and randomly repeatedly rebooting the other OSDs
>>>>> finally
>>>>> cleared all the stuck OSDs, some of which hadn't resolved in
>>>>> over 2
>>>>> days.
>>>>> 
>>>>> So at the moment, it's either rebalancing the cleaned-up OSDs
>>>>> or in
>>>>> a
>>>>> loop thinking that it is.
>>>> 
>>>> Since you deleted some pools, it's probably the upmap balancer
>>>> rebalancing PGs across the OSDs.
>>>> 
>>>>> And the PG/per-OSD count seems way too high,
>>>> 
>>>> How much is it right now? With what hardware?
>>>> 
>>>>> but the auto-sized doesn't seem to want to do anything about
>>>>> that.
>>>> 
>>>> If the PG autoscaler is enabled you could try adjusting per pool
>>>> settings [1] and see if the # of PGs decreases.
>>>> If disabled you could manually reduce the number of PGs on the
>>>> remaining pools to lower the PG/OSD ratio.
>>>> 
>>>> Regards,
>>>> Frédéric.
>>>> 
>>>>> 
>>>>> Of course, the whole shebang has been unavailable to clients
>>>>> this
>>>>> whole
>>>>> week because of that.
>>>>> 
>>>>> I've been considering upgrading to reef, but recent posts
>>>>> regarding
>>>>> issues resembling what I've been going through are making me
>>>>> pause.
>>>>> 
>>>>>  Again, thanks!
>>>>>    Tim
>>>>> 
>>>>> On Wed, 2025-02-26 at 13:57 +0100, Frédéric Nass wrote:
>>>>>> Hi Tim,
>>>>>> 
>>>>>> If you can't bring the host back online so that cephadm can
>>>>>> remove
>>>>>> these services itself, I guess you'll have to clean up the
>>>>>> mess
>>>>>> by:
>>>>>> 
>>>>>> - removing these services from the cluster (for example with
>>>>>> a
>>>>>> 'ceph
>>>>>> mon remove {mon-id}' for the monitor)
>>>>>> - forcing their removal from the orchestrator with the --
>>>>>> force
>>>>>> option
>>>>>> on the commands 'ceph orch daemon rm <names>' and 'ceph orch
>>>>>> host
>>>>>> rm
>>>>>> <hostname>'. If the --force option doesn't help, then looking
>>>>>> into/editing/removing ceph-config keys like
>>>>>> 'mgr/cephadm/inventory'
>>>>>> and 'mgr/cephadm/host.ceph07.internal.mousetech.com' that
>>>>>> 'ceph
>>>>>> config-key dump' output shows might help.
>>>>>> 
>>>>>> Regards,
>>>>>> Frédéric.
>>>>>> 
>>>>>> ----- Le 25 Fév 25, à 16:42, Tim Holloway
>>>>>> timh@xxxxxxxxxxxxx a
>>>>>> écrit
>>>>>> :
>>>>>> 
>>>>>>> Ack. Another fine mess.
>>>>>>> 
>>>>>>> I was trying to clean things up and the process of tossing
>>>>>>> around
>>>>>>> OSD's
>>>>>>> kept getting me reports of slow responses and hanging PG
>>>>>>> operations.
>>>>>>> 
>>>>>>> This is Ceph Pacific, by the way.
>>>>>>> 
>>>>>>> I found a deprecated server that claimed to have an OSD
>>>>>>> even
>>>>>>> though
>>>>>>> it
>>>>>>> didn't show in either "ceph osd tree" or the dashboard OSD
>>>>>>> list. I
>>>>>>> suspect that a lot of the grief came from it attempting to
>>>>>>> use
>>>>>>> resources that weren't always seen as resources.
>>>>>>> 
>>>>>>> I shut down the server's OSD (removed the daemon using ceph
>>>>>>> orch),
>>>>>>> then
>>>>>>> foolishly deleted the server from the inventory without
>>>>>>> doing a
>>>>>>> drain
>>>>>>> first.
>>>>>>> 
>>>>>>> Now cephadmin hates me (key not found), and there are still
>>>>>>> an
>>>>>>> MDS
>>>>>>> and
>>>>>>> MON listed as ceph orch ls daemons even after I powered the
>>>>>>> host
>>>>>>> off.
>>>>>>> 
>>>>>>> I cannot do a ceph orch daemon delete because there's no
>>>>>>> longer
>>>>>>> an
>>>>>>> IP
>>>>>>> address available to the daemon delete, and I cannot clear
>>>>>>> the
>>>>>>> cephadmin queue:
>>>>>>> 
>>>>>>> [ERR] MGR_MODULE_ERROR: Module 'cephadm' has failed:
>>>>>>> 'ceph07.internal.mousetech.com'
>>>>>>> 
>>>>>>> Any suggestions?
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx