Hi all, Unfortunately, we experienced some issues with the upgrade to 16.2.8 on one of our larger clusters. Within a few hours of the upgrade, all 5 of our managers had become unavailable. We found that they were all deadlocked due to (what appears to be) a regression with GIL and mutex handling. See https://tracker.ceph.com/issues/39264 and https://github.com/ceph/ceph/pull/38677 for context on previous manifestations of the issue. I discovered some mistakes within a recent Pacific backport that seem to be responsible. Here is the tracker for the regression: https://tracker.ceph.com/issues/55687. Here is an open PR that should resolve the problem: https://github.com/ceph/ceph/pull/38677. Note that this is a sort of race condition, and the issue tends to manifest itself more frequently in larger clusters. Enabling certain modules may also make it more likely to occur. On our cluster, MGRs are consistently deadlocking within about an hour. Hopefully this is useful to others who are considering an upgrade! Thanks, Cory Snyder On Mon, May 16, 2022 at 3:46 PM David Galloway <dgallowa@xxxxxxxxxx> wrote: > > We're happy to announce the 8th backport release in the Pacific series. > We recommend users to update to this release. For a detailed release > notes with links & changelog please refer to the official blog entry at > https://ceph.io/en/news/blog/2022/v16-2-8-pacific-released > > Notable Changes > --------------- > > * MON/MGR: Pools can now be created with `--bulk` flag. Any pools > created with `bulk` will use a profile of the `pg_autoscaler` that > provides more performance from the start. However, any pools created > without the `--bulk` flag will remain using it's old behavior by > default. For more details, see: > https://docs.ceph.com/en/latest/rados/operations/placement-groups/ > > * MGR: The pg_autoscaler can now be turned `on` and `off` globally with > the `noautoscale` flag. By default this flag is unset and the default > pg_autoscale mode remains the same. For more details, see: > https://docs.ceph.com/en/latest/rados/operations/placement-groups/ > > * A health warning will now be reported if the ``require-osd-release`` > flag is not set to the appropriate release after a cluster upgrade. > > * CephFS: Upgrading Ceph Metadata Servers when using multiple active > MDSs requires ensuring no pending stray entries which are directories > are present for active ranks except rank 0. See > https://docs.ceph.com/en/latest/releases/pacific/#upgrading-from-octopus-or-nautilus. > > Getting Ceph > ------------ > * Git at git://github.com/ceph/ceph.git > * Tarball at https://download.ceph.com/tarballs/ceph-16.2.8.tar.gz > * Containers at https://quay.io/repository/ceph/ceph > * For packages, see https://docs.ceph.com/docs/master/install/get-packages/ > * Release git sha1: 209e51b856505df4f2f16e54c0d7a9e070973185 > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx