Yeah I think that’s along the lines of what I’ve faced here. Hopefully i’ve managed to disable the auto updates. Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows From: Clyso GmbH - Ceph Foundation Member<mailto:joachim.kraftmayer@xxxxxxxxx> Sent: 07 August 2021 10:46 To: Andrew Walker-Brown<mailto:andrew_jbrown@xxxxxxxxxxx>; David Caro<mailto:dcaro@xxxxxxxxxxxxx> Cc: Marc<mailto:Marc@xxxxxxxxxxxxxxxxx>; ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx> Subject: Re: Re: All OSDs on one host down Hi Andrew, we have had bad experiences with ubuntu's auto update, especially when updating packages from systemd,dbus and docker. for example: one effect was internal communication errors, only a restart of the node helped. Cheers, Joachim ___________________________________ Clyso GmbH - Ceph Foundation Member support@xxxxxxxxx https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.clyso.com%2F&data=04%7C01%7C%7Cb914f016fb804af1cdcd08d959882b99%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637639263643148448%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=UEAe8op8%2B9DVtQfbXdoRM1s%2F3GZ64riD3y8QD3Y8EVU%3D&reserved=0 Am 07.08.2021 um 11:04 schrieb Andrew Walker-Brown: > Thanks David, > > Spent some more time digging in the logs/google. Also had a further 2 nodes fail this morning (different nodes). > > Looks like it’s related to apt-auto updates on Ubuntu 20.04, although we don’t run unattended upgrades. Docker appears to get a terminate signal which shutsdown/restarts all the containers but some don’t come back cleanly. There’s was also some legacy unused interfaces/bonds in the netplan config. > > Anyway, cleaned all that up...so hopefully it’s resolved. > > Cheers, > > A. > > > > Sent from Mail<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgo.microsoft.com%2Ffwlink%2F%3FLinkId%3D550986&data=04%7C01%7C%7Cb914f016fb804af1cdcd08d959882b99%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637639263643148448%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=t97przWP%2BRHkRr3o%2BrO64rLmk6KuysGnxnrmQ4S9Mn8%3D&reserved=0> for Windows 10 > > From: David Caro<mailto:dcaro@xxxxxxxxxxxxx> > Sent: 06 August 2021 09:20 > To: Andrew Walker-Brown<mailto:andrew_jbrown@xxxxxxxxxxx> > Cc: Marc<mailto:Marc@xxxxxxxxxxxxxxxxx>; ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx> > Subject: Re: Re: All OSDs on one host down > > On 08/06 07:59, Andrew Walker-Brown wrote: >> Hi Marc, >> >> Yes i’m probably doing just that. >> >> The ceph admin guides aren’t exactly helpful on this. The cluster was deployed using cephadm and it’s been running perfectly until now. >> >> Wouldn’t running “journalctl -u ceph-osd@5” on host ceph-004 show me the logs for osd.5 on that host? > On my containerized setup, the services that cephadm created are: > > dcaro@node1:~ $ sudo systemctl list-units | grep ceph > ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@crash.node1.service loaded active running Ceph crash.node1 for d49b287a-b680-11eb-95d4-e45f010c03a8 > ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@mgr.node1.mhqltg.service loaded active running Ceph mgr.node1.mhqltg for d49b287a-b680-11eb-95d4-e45f010c03a8 > ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@mon.node1.service loaded active running Ceph mon.node1 for d49b287a-b680-11eb-95d4-e45f010c03a8 > ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@osd.3.service loaded active running Ceph osd.3 for d49b287a-b680-11eb-95d4-e45f010c03a8 > ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@osd.7.service loaded active running Ceph osd.7 for d49b287a-b680-11eb-95d4-e45f010c03a8 > system-ceph\x2dd49b287a\x2db680\x2d11eb\x2d95d4\x2de45f010c03a8.slice loaded active active system-ceph\x2dd49b287a\x2db680\x2d11eb\x2d95d4\x2de45f010c03a8.slice > ceph-d49b287a-b680-11eb-95d4-e45f010c03a8.target loaded active active Ceph cluster d49b287a-b680-11eb-95d4-e45f010c03a8 > ceph.target loaded active active All Ceph clusters and services > > where the string after 'ceph-' is the fsid of the cluster. > Hope that helps (you can use the systemctl list-units also to search the specific ones on yours). > > >> Cheers, >> A >> >> >> >> >> >> Sent from Mail<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgo.microsoft.com%2Ffwlink%2F%3FLinkId%3D550986&data=04%7C01%7C%7Cb914f016fb804af1cdcd08d959882b99%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637639263643148448%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=t97przWP%2BRHkRr3o%2BrO64rLmk6KuysGnxnrmQ4S9Mn8%3D&reserved=0> for Windows 10 >> >> From: Marc<mailto:Marc@xxxxxxxxxxxxxxxxx> >> Sent: 06 August 2021 08:54 >> To: Andrew Walker-Brown<mailto:andrew_jbrown@xxxxxxxxxxx>; ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx> >> Subject: RE: All OSDs on one host down >> >>> I’ve tried restarting on of the osds but that fails, journalctl shows >>> osd not found.....not convinced I’ve got the systemctl command right. >>> >> You are not mixing 'not container commands' with 'container commands'. As in, if you execute this journalctl outside of the container it will not find anything of course. >> >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx > -- > David Caro > SRE - Cloud Services > Wikimedia Foundation <https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwikimediafoundation.org%2F&data=04%7C01%7C%7Cb914f016fb804af1cdcd08d959882b99%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637639263643148448%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=mGyOz2xywTS2AfblkXUtS6GE%2FKNY8P5255gWfRmqjHk%3D&reserved=0> > PGP Signature: 7180 83A2 AC8B 314F B4CE 1171 4071 C7E1 D262 69C3 > > "Imagine a world in which every single human being can freely share in the > sum of all knowledge. That's our commitment." > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx