Re: All OSDs on one host down

Clyso GmbH - Ceph Foundation Member <joachim.kraftmayer@xxxxxxxxx> · Sat, 7 Aug 2021 12:37:11 +0200

we have been working and using Ceph adm for more than 2 years..

for this and other reasons have changed our update strategy to immutable 
infrastructure and are currently in the middle of migrating to different 
flavours of https://github.com/gardenlinux/gardenlinux.

___________________________________
Clyso GmbH - Ceph Foundation Member
support@xxxxxxxxx
https://www.clyso.com

Am 07.08.2021 um 11:51 schrieb Andrew Walker-Brown:

Yeah I think that’s along the lines of what I’ve faced here.  
Hopefully i’ve managed to disable the auto updates.

Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for 
Windows

*From: *Clyso GmbH - Ceph Foundation Member 
<mailto:joachim.kraftmayer@xxxxxxxxx>
*Sent: *07 August 2021 10:46
*To: *Andrew Walker-Brown <mailto:andrew_jbrown@xxxxxxxxxxx>; David 
Caro <mailto:dcaro@xxxxxxxxxxxxx>
*Cc: *Marc <mailto:Marc@xxxxxxxxxxxxxxxxx>; ceph-users@xxxxxxx 
<mailto:ceph-users@xxxxxxx>
*Subject: *Re:  Re: All OSDs on one host down

Hi Andrew,

we have had bad experiences with ubuntu's auto update, especially when
updating packages from systemd,dbus and docker.
for example: one effect was internal communication errors, only a
restart of the node helped.

Cheers, Joachim

___________________________________
Clyso GmbH - Ceph Foundation Member
support@xxxxxxxxx
https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.clyso.com%2F&amp;data=04%7C01%7C%7Cb914f016fb804af1cdcd08d959882b99%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637639263643148448%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=UEAe8op8%2B9DVtQfbXdoRM1s%2F3GZ64riD3y8QD3Y8EVU%3D&amp;reserved=0 
<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.clyso.com%2F&amp;data=04%7C01%7C%7Cb914f016fb804af1cdcd08d959882b99%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637639263643148448%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=UEAe8op8%2B9DVtQfbXdoRM1s%2F3GZ64riD3y8QD3Y8EVU%3D&amp;reserved=0>

Am 07.08.2021 um 11:04 schrieb Andrew Walker-Brown:
> Thanks David,
>
> Spent some more time digging in the logs/google.  Also had a further 
2 nodes fail this morning (different nodes).
>
> Looks like it’s related to apt-auto updates on Ubuntu 20.04, 
although we don’t run unattended upgrades.  Docker appears to get a 
terminate signal which shutsdown/restarts all the containers but some 
don’t come back cleanly.  There’s was also some legacy unused 
interfaces/bonds in the netplan config.
>
> Anyway, cleaned all that up...so hopefully it’s resolved.
>
> Cheers,
>
> A.
>
>
>
> Sent from 
Mail<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgo.microsoft.com%2Ffwlink%2F%3FLinkId%3D550986&amp;data=04%7C01%7C%7Cb914f016fb804af1cdcd08d959882b99%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637639263643148448%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=t97przWP%2BRHkRr3o%2BrO64rLmk6KuysGnxnrmQ4S9Mn8%3D&amp;reserved=0 
<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgo.microsoft.com%2Ffwlink%2F%3FLinkId%3D550986&amp;data=04%7C01%7C%7Cb914f016fb804af1cdcd08d959882b99%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637639263643148448%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=t97przWP%2BRHkRr3o%2BrO64rLmk6KuysGnxnrmQ4S9Mn8%3D&amp;reserved=0>> 
for Windows 10
>
> From: David Caro<mailto:dcaro@xxxxxxxxxxxxx 
<mailto:dcaro@xxxxxxxxxxxxx>>
> Sent: 06 August 2021 09:20
> To: Andrew Walker-Brown<mailto:andrew_jbrown@xxxxxxxxxxx 
<mailto:andrew_jbrown@xxxxxxxxxxx>>
> Cc: Marc<mailto:Marc@xxxxxxxxxxxxxxxxx 
<mailto:Marc@xxxxxxxxxxxxxxxxx>>; 
ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>>
> Subject: Re:  Re: All OSDs on one host down
>
> On 08/06 07:59, Andrew Walker-Brown wrote:
>> Hi Marc,
>>
>> Yes i’m probably doing just that.
>>
>> The ceph admin guides aren’t exactly helpful on this.  The cluster 
was deployed using cephadm and it’s been running perfectly until now.
>>
>> Wouldn’t running “journalctl -u ceph-osd@5” on host ceph-004 show 
me the logs for osd.5 on that host?
> On my containerized setup, the services that cephadm created are:
>
> dcaro@node1:~ $ sudo systemctl list-units | grep ceph
> ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@crash.node1.service loaded 
active running   Ceph crash.node1 for d49b287a-b680-11eb-95d4-e45f010c03a8
> ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@mgr.node1.mhqltg.service 
loaded active running   Ceph mgr.node1.mhqltg for 
d49b287a-b680-11eb-95d4-e45f010c03a8
> ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@mon.node1.service loaded 
active running   Ceph mon.node1 for d49b287a-b680-11eb-95d4-e45f010c03a8
> ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@osd.3.service loaded 
active running   Ceph osd.3 for d49b287a-b680-11eb-95d4-e45f010c03a8
> ceph-d49b287a-b680-11eb-95d4-e45f010c03a8@osd.7.service loaded 
active running   Ceph osd.7 for d49b287a-b680-11eb-95d4-e45f010c03a8
> 
system-ceph\x2dd49b287a\x2db680\x2d11eb\x2d95d4\x2de45f010c03a8.slice 
loaded active active 
system-ceph\x2dd49b287a\x2db680\x2d11eb\x2d95d4\x2de45f010c03a8.slice
> ceph-d49b287a-b680-11eb-95d4-e45f010c03a8.target loaded active 
active    Ceph cluster d49b287a-b680-11eb-95d4-e45f010c03a8
> ceph.target loaded active active    All Ceph clusters and services
>
> where the string after 'ceph-' is the fsid of the cluster.
> Hope that helps (you can use the systemctl list-units also to search 
the specific ones on yours).
>
>
>> Cheers,
>> A
>>
>>
>>
>>
>>
>> Sent from 
Mail<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgo.microsoft.com%2Ffwlink%2F%3FLinkId%3D550986&amp;data=04%7C01%7C%7Cb914f016fb804af1cdcd08d959882b99%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637639263643148448%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=t97przWP%2BRHkRr3o%2BrO64rLmk6KuysGnxnrmQ4S9Mn8%3D&amp;reserved=0 
<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgo.microsoft.com%2Ffwlink%2F%3FLinkId%3D550986&amp;data=04%7C01%7C%7Cb914f016fb804af1cdcd08d959882b99%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637639263643148448%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=t97przWP%2BRHkRr3o%2BrO64rLmk6KuysGnxnrmQ4S9Mn8%3D&amp;reserved=0>> 
for Windows 10
>>
>> From: Marc<mailto:Marc@xxxxxxxxxxxxxxxxx 
<mailto:Marc@xxxxxxxxxxxxxxxxx>>
>> Sent: 06 August 2021 08:54
>> To: Andrew Walker-Brown<mailto:andrew_jbrown@xxxxxxxxxxx 
<mailto:andrew_jbrown@xxxxxxxxxxx>>; 
ceph-users@xxxxxxx<mailto:ceph-users@xxxxxxx <mailto:ceph-users@xxxxxxx>>
>> Subject: RE: All OSDs on one host down
>>
>>> I’ve tried restarting on of the osds but that fails, journalctl shows
>>> osd not found.....not convinced I’ve got the systemctl command right.
>>>
>> You are not mixing 'not container commands' with 'container 
commands'. As in, if you execute this journalctl outside of the 
container it will not find anything of course.
>>
>>
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> --
> David Caro
> SRE - Cloud Services
> Wikimedia Foundation 
<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwikimediafoundation.org%2F&amp;data=04%7C01%7C%7Cb914f016fb804af1cdcd08d959882b99%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637639263643148448%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=mGyOz2xywTS2AfblkXUtS6GE%2FKNY8P5255gWfRmqjHk%3D&amp;reserved=0 
<https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwikimediafoundation.org%2F&amp;data=04%7C01%7C%7Cb914f016fb804af1cdcd08d959882b99%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637639263643148448%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=mGyOz2xywTS2AfblkXUtS6GE%2FKNY8P5255gWfRmqjHk%3D&amp;reserved=0>>
> PGP Signature: 7180 83A2 AC8B 314F B4CE  1171 4071 C7E1 D262 69C3
>
> "Imagine a world in which every single human being can freely share 
in the
> sum of all knowledge. That's our commitment."
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx