Re: All OSDs on one host down

Clyso GmbH - Ceph Foundation Member <joachim.kraftmayer@xxxxxxxxx> · Sat, 7 Aug 2021 11:45:57 +0200

Hi Andrew,

we have had bad experiences with ubuntu's auto update, especially when 
updating packages from systemd,dbus and docker.
for example: one effect was internal communication errors, only a 
restart of the node helped.

Cheers, Joachim

___________________________________
Clyso GmbH - Ceph Foundation Member
support@xxxxxxxxx
https://www.clyso.com

Am 07.08.2021 um 11:04 schrieb Andrew Walker-Brown:
Thanks David,

Spent some more time digging in the logs/google.  Also had a further 2 nodes fail this morning (different nodes).

Looks like it’s related to apt-auto updates on Ubuntu 20.04, although we don’t run unattended upgrades.  Docker appears to get a terminate signal which shutsdown/restarts all the containers but some don’t come back cleanly.  There’s was also some legacy unused interfaces/bonds in the netplan config.

Anyway, cleaned all that up...so hopefully it’s resolved.

Cheers,

A.

Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10

From: David Caro<mailto:dcaro@xxxxxxxxxxxxx>
Sent: 06 August 2021 09:20
To: Andrew Walker-Brown<mailto:andrew_jbrown@xxxxxxxxxxx>
Cc: Marc<mailto:Marc@xxxxxxxxxxxxxxxxx>; ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
Subject: Re:  Re: All OSDs on one host down

On 08/06 07:59, Andrew Walker-Brown wrote:
Hi Marc,

Yes i’m probably doing just that.

The ceph admin guides aren’t exactly helpful on this.  The cluster was deployed using cephadm and it’s been running perfectly until now.

Wouldn’t running “journalctl -u ceph-osd@5” on host ceph-004 show me the logs for osd.5 on that host?
On my containerized setup, the services that cephadm created are:

dcaro@node1:~ $ sudo systemctl list-units | grep ceph
   ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@crash.node1.service                                                                                 loaded active running   Ceph crash.node1 for d49b287a-b680-11eb-95d4-e45f010c03a8
   ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@mgr.node1.mhqltg.service                                                                            loaded active running   Ceph mgr.node1.mhqltg for d49b287a-b680-11eb-95d4-e45f010c03a8
   ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@mon.node1.service                                                                                   loaded active running   Ceph mon.node1 for d49b287a-b680-11eb-95d4-e45f010c03a8
   ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@osd.3.service                                                                                       loaded active running   Ceph osd.3 for d49b287a-b680-11eb-95d4-e45f010c03a8
   ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@osd.7.service                                                                                       loaded active running   Ceph osd.7 for d49b287a-b680-11eb-95d4-e45f010c03a8
   system-ceph\x2dd49b287a\x2db680\x2d11eb\x2d95d4\x2de45f010c03a8.slice                                                                         loaded active active    system-ceph\x2dd49b287a\x2db680\x2d11eb\x2d95d4\x2de45f010c03a8.slice
   ceph-d49b287a-b680-11eb-95d4-e45f010c03a8.target                                                                                              loaded active active    Ceph cluster d49b287a-b680-11eb-95d4-e45f010c03a8
   ceph.target                                                                                                                                   loaded active active    All Ceph clusters and services

where the string after 'ceph-' is the fsid of the cluster.
Hope that helps (you can use the systemctl list-units also to search the specific ones on yours).

Cheers,
A

Sent from Mail<https://go.microsoft.com/fwlink/?LinkId=550986> for Windows 10

From: Marc<mailto:Marc@xxxxxxxxxxxxxxxxx>
Sent: 06 August 2021 08:54
To: Andrew Walker-Brown<mailto:andrew_jbrown@xxxxxxxxxxx>; ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx>
Subject: RE: All OSDs on one host down

I’ve tried restarting on of the osds but that fails, journalctl shows
osd not found.....not convinced I’ve got the systemctl command right.

You are not mixing 'not container commands' with 'container commands'. As in, if you execute this journalctl outside of the container it will not find anything of course.

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
--
David Caro
SRE - Cloud Services
Wikimedia Foundation <https://wikimediafoundation.org/>
PGP Signature: 7180 83A2 AC8B 314F B4CE  1171 4071 C7E1 D262 69C3

"Imagine a world in which every single human being can freely share in the
sum of all knowledge. That's our commitment."

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx