Re: How to find out why osd crashed with cephadm/podman containers?

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 05/06 14:03, mabi wrote:
> Hello,
> 
> I have a small 6 nodes Octopus 15.2.11 cluster installed on bare metal with cephadm and I added a second OSD to one of my 3 OSD nodes. I started then copying data to my ceph fs mounted with kernel mount but then both OSDs on that specific nodes crashed.
> 
> To this topic I have the following questions:
> 
> 1) How can I find out why the two OSD crashed? because everything is in podman containers I don't know where are the logs to find out the reason why this happened. From the OS itself everything looks ok, there was no out of memory error.

There should be some logs under /var/log/ceph/<cluster_fsid>/osd.<osd_id>/ on the host/hosts that were running the osds.
I found myself sometimes though disabling the '--rm' flag for the pod in the 'unit.run' script under
/va/lib/ceph/<ceph_fsid>/osd.<id>/unit.run to make podman persist the container and be able to do a 'podman logs' on it.
Though that's probably sensible only when debugging.

> 
> 2) I would assume the two OSD container would restart on their own but this is not the case it looks like. How can I restart manually these 2 OSD containers on that node? I believe this should be a "cephadm orch" command?

I think 'ceph orch daemon redeploy' might do it? What is the output of 'ceph orch ls' and 'ceph orch ps'?
> 
> The health of the cluster right now is:
> 
>     CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
>     PG_DEGRADED: Degraded data redundancy: 132518/397554 objects degraded (33.333%), 65 pgs degraded, 65 pgs undersized
> 
> Thank your for your hints.
> 
> Best regards,
> Mabi
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

-- 
David Caro
SRE - Cloud Services
Wikimedia Foundation <https://wikimediafoundation.org/>
PGP Signature: 7180 83A2 AC8B 314F B4CE  1171 4071 C7E1 D262 69C3

"Imagine a world in which every single human being can freely share in the
sum of all knowledge. That's our commitment."

Attachment: signature.asc
Description: PGP signature

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux