Re: Issue with very long connection times for newly upgraded OSD's.

Alexandre Marangone <a.marangone@xxxxxxxxx> · Mon, 14 Feb 2022 14:24:12 -0800

This should only happen while upgrading. I can't remember the reason
why but there's a fsck (for stat repair maybe?) happening on the first
boot after upgrade.
There should be a message in the OSD log about it.

Alex

On Mon, Feb 14, 2022 at 1:31 PM Trey Palmer <nerdmagicatl@xxxxxxxxx> wrote:
>
> Hi all,
>
> I'm trying to upgrade some clusters from luminous to nautilus 14.2.22 (I
> know, I know!).
>
> It's taking about 16-18 minutes for each HDD OSD to connect into the
> cluster after the upgrade, but it only takes a minute or two for the SSD
> OSD's to connect.
>
> The cluster is dockerized using the standard ceph/daemon stable containers,
> and I'm using a simple ansible playbook to start the OSD dockers.
>
> The cluster has 42 OSD nodes and each node has 12 x 14TB disks and 2 x
> 3.8TB SSD's.  Each SSD is partitioned into 6 block.db devices and one OSD,
> and the SSD pool is used for RGW metadata and indexes.
>
> I have of course upgraded the 5 mon/mgr nodes beforehand.
>
> The nodes are Debian Stretch, which might be suboptimal but that's what my
> shop uses.
>
> The cluster is still receiving writes, and with these disks down for 18
> minutes, we end up with so many degraded objects that I have to wait an
> hour or two to do the next node.  The primary RGW data pool is 3+2 EC so I
> expect that recovery is a little slower than it would be in a replicated
> pool.
>
> Under Luminous they were only taking a few minutes to connect.
>
> Any ideas what could be happening here?
>
> Thanks,
>
> Trey Palmer
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx