Re: How to find out why osd crashed with cephadm/podman containers?

mabi <mabi@xxxxxxxxxxxxx> · Thu, 06 May 2021 16:28:18 +0000

Thank you very much for the hint regarding the log files, I wasn't aware that it still saves the logs on the host although everything is running in containers nowadays.

So there was nothing in the log files but I could find out that finally the host (a RasPi4) could not cope with 2 SSD external USB disks connected to it. Probably due to not enough power, so the disks disappeared and the OSD went away with them. After a restart of the host the disks where back as well as the OSD containers. So I have now remove that second OSD and will keep only one single OSD per server.

For reference here is the relevant part of the kernel log I saw:

[Thu May  6 15:24:34 2021] blk_update_request: I/O error, dev sda, sector 40063143 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0
[Thu May  6 15:24:34 2021] usb 1-1-port4: over-current change #1

and of course it did that for both sda and sdb.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Thursday, May 6, 2021 4:17 PM, David Caro <dcaro@xxxxxxxxxxxxx> wrote:

> On 05/06 14:03, mabi wrote:
>
> > Hello,
> > I have a small 6 nodes Octopus 15.2.11 cluster installed on bare metal with cephadm and I added a second OSD to one of my 3 OSD nodes. I started then copying data to my ceph fs mounted with kernel mount but then both OSDs on that specific nodes crashed.
> > To this topic I have the following questions:
> >
> > 1.  How can I find out why the two OSD crashed? because everything is in podman containers I don't know where are the logs to find out the reason why this happened. From the OS itself everything looks ok, there was no out of memory error.
>
> There should be some logs under /var/log/ceph/<cluster_fsid>/osd.<osd_id>/ on the host/hosts that were running the osds.
> I found myself sometimes though disabling the '--rm' flag for the pod in the 'unit.run' script under
> /va/lib/ceph/<ceph_fsid>/osd.<id>/unit.run to make podman persist the container and be able to do a 'podman logs' on it.
> Though that's probably sensible only when debugging.
>
> > 2.  I would assume the two OSD container would restart on their own but this is not the case it looks like. How can I restart manually these 2 OSD containers on that node? I believe this should be a "cephadm orch" command?
>
> I think 'ceph orch daemon redeploy' might do it? What is the output of 'ceph orch ls' and 'ceph orch ps'?
>
> > The health of the cluster right now is:
> >
> >     CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
> >     PG_DEGRADED: Degraded data redundancy: 132518/397554 objects degraded (33.333%), 65 pgs degraded, 65 pgs undersized
> >
> >
> > Thank your for your hints.
> > Best regards,
> > Mabi
> >
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
> --
>
> David Caro
> SRE - Cloud Services
> Wikimedia Foundation https://wikimediafoundation.org/
> PGP Signature: 7180 83A2 AC8B 314F B4CE 1171 4071 C7E1 D262 69C3
>
> "Imagine a world in which every single human being can freely share in the
> sum of all knowledge. That's our commitment."
>
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx