ceph mgr fail after upgrade to pacific

Eugen Block <eblock@xxxxxx> · Mon, 12 Dec 2022 10:52:16 +0000

Hi,

last week we successfully upgraded from Nautilus to Pacific, and since  
today I'm experiencing failing MGR daemons. The pods are still running  
but stopped logging. The standby MGRs take over until all MGRs become  
unresponsive, we currently have three MGRs. I'm not sure if [1] is the  
exact thing I'm facing here but it looks like a deadlock to me. I  
commented the tracker issue but since it's been marked as resolved I'm  
not sure if anybody will read my comment. I noticed the same (also  
today) in a customer cluster upgraded from Octopus to Pacific about  
two months ago (16.2.9). The only thing I did in those clusters today  
was to browse the dashboard to compare log settings.
I read somewhere that the prometheus module could play a role in this,  
but it's not enabled in our cluster (while it is running in the  
customer cluster).
Please let me know if you need more information on this.

Thanks,
Eugen

Our current versions are:

ceph01:~ # ceph versions
{
    "mon": {
        "ceph version 16.2.10  
(45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.10  
(45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable)": 2
    },
    "osd": {
        "ceph version 16.2.10  
(45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable)": 35
    },
    "mds": {
        "ceph version 16.2.10  
(45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable)": 3
    },
    "rgw": {
        "ceph version 16.2.10  
(45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable)": 1
    },
    "overall": {
        "ceph version 16.2.10  
(45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable)": 44
    }
}

[1] https://tracker.ceph.com/issues/55687

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx