Issue with very long connection times for newly upgraded OSD's.

Trey Palmer <nerdmagicatl@xxxxxxxxx> · Mon, 14 Feb 2022 16:31:23 -0500

Hi all,

I'm trying to upgrade some clusters from luminous to nautilus 14.2.22 (I
know, I know!).

It's taking about 16-18 minutes for each HDD OSD to connect into the
cluster after the upgrade, but it only takes a minute or two for the SSD
OSD's to connect.

The cluster is dockerized using the standard ceph/daemon stable containers,
and I'm using a simple ansible playbook to start the OSD dockers.

The cluster has 42 OSD nodes and each node has 12 x 14TB disks and 2 x
3.8TB SSD's.  Each SSD is partitioned into 6 block.db devices and one OSD,
and the SSD pool is used for RGW metadata and indexes.

I have of course upgraded the 5 mon/mgr nodes beforehand.

The nodes are Debian Stretch, which might be suboptimal but that's what my
shop uses.

The cluster is still receiving writes, and with these disks down for 18
minutes, we end up with so many degraded objects that I have to wait an
hour or two to do the next node.  The primary RGW data pool is 3+2 EC so I
expect that recovery is a little slower than it would be in a replicated
pool.

Under Luminous they were only taking a few minutes to connect.

Any ideas what could be happening here?

Thanks,

Trey Palmer
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx