I think I figured it out. The problem was that my ceph.conf file only
listed the first machine in mon_initial_members and in mon_host. I'm not
sure why. I added the other monitors, restarted the monitors and the
managers, and everything is now working as expected. I have now upgraded
all the monitors and all the managers to Pacific and Rocky 9. Now on to the
OSDs. Well, maybe next week...

On Thu, Oct 26, 2023 at 5:37 PM Tyler Stachecki <stachecki.tyler@xxxxxxxxx>

> On Thu, Oct 26, 2023, 8:11 PM Jorge Garcia <jgarcia@xxxxxxxxxxxx> wrote:
>> Oh, I meant that "ceph -s" just hangs. I didn't even try to look at the
>> I/O. Maybe I can do that, but the "ceph -s" hang just freaked me out.
>> Also, I know that the recommended order is mon->mgr->osd->mds->rgw, but
>> when you run mgr on the same hardware as the monitors, it's hard to not
>> upgrade both at the same time. Particularly if you're upgrading the whole
>> machine at once. Here's where upgrading to the new container method will
>> help a lot! FWIW, the managers seem to be running fine.
> I recently did something like this, so I understand that it's difficult.
> Most of my testing and prep-work was centered around exactly this problem,
> which was avoided by first upgrading mons/mgrs to an interim OS while
> remaining on Octopus -- solely for the purposes of opening an avenue from
> Octopus to Quincy separate from tbe OS upgrade.
> In my pre-prod resting, trying to upgrade the mons/mgrs without that
> middle step that allowed mgrs to be upgraded separately did result in `ceph
> -s` locking up. Client I/O remained non-impacted in this state though.
> Maybe look at which mgr is active and/or try stopping all but the Octopus
> mgr when stopping the mon as well?
> Cheers,
> Tyler
>> On Thu, Oct 26, 2023 at 4:57 PM Tyler Stachecki <
>> stachecki.tyler@xxxxxxxxx> wrote:
>>> On Thu, Oct 26, 2023 at 6:52 PM Jorge Garcia <jgarcia@xxxxxxxxxxxx>
>>> wrote:
>>> >
>>> > Hi Tyler,
>>> >
>>> > Maybe you didn't read the full message, but in the message you will
>>> notice that I'm doing exactly that, and the problem just occurred when I
>>> was doing the upgrade from Octopus to Pacific. I'm nowhere near Quincy yet.
>>> The original goal was to move from Nautilus to Quincy, but I have gone to
>>> Octopus (no problems) and now to Pacific (problems).
>>> I did not, apologies -- though do see my second message about ordering
>>> mon/mgr ordering...
>>> When you say "the cluster becomes unresponsive" -- does the client I/O
>>> lock up, or do you mean that `ceph -s` and such hangs?
>>> May help to look to Pacific mons via the asok and see if they respond
>>> in such a state (and their status) if I/O is not locked up and you can
>>> afford to leave it in that state for a couple minutes:
>>> $ ceph daemon mon_status
>>> Cheers,
>>> Tyler
