Thank you very much for the hint regarding the log files, I wasn't aware that it still saves the logs on the host although everything is running in containers nowadays. So there was nothing in the log files but I could find out that finally the host (a RasPi4) could not cope with 2 SSD external USB disks connected to it. Probably due to not enough power, so the disks disappeared and the OSD went away with them. After a restart of the host the disks where back as well as the OSD containers. So I have now remove that second OSD and will keep only one single OSD per server. For reference here is the relevant part of the kernel log I saw: [Thu May 6 15:24:34 2021] blk_update_request: I/O error, dev sda, sector 40063143 op 0x1:(WRITE) flags 0x8800 phys_seg 1 prio class 0 [Thu May 6 15:24:34 2021] usb 1-1-port4: over-current change #1 and of course it did that for both sda and sdb. ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Thursday, May 6, 2021 4:17 PM, David Caro <dcaro@xxxxxxxxxxxxx> wrote: > On 05/06 14:03, mabi wrote: > > > Hello, > > I have a small 6 nodes Octopus 15.2.11 cluster installed on bare metal with cephadm and I added a second OSD to one of my 3 OSD nodes. I started then copying data to my ceph fs mounted with kernel mount but then both OSDs on that specific nodes crashed. > > To this topic I have the following questions: > > > > 1. How can I find out why the two OSD crashed? because everything is in podman containers I don't know where are the logs to find out the reason why this happened. From the OS itself everything looks ok, there was no out of memory error. > > There should be some logs under /var/log/ceph/<cluster_fsid>/osd.<osd_id>/ on the host/hosts that were running the osds. > I found myself sometimes though disabling the '--rm' flag for the pod in the 'unit.run' script under > /va/lib/ceph/<ceph_fsid>/osd.<id>/unit.run to make podman persist the container and be able to do a 'podman logs' on it. > Though that's probably sensible only when debugging. > > > 2. I would assume the two OSD container would restart on their own but this is not the case it looks like. How can I restart manually these 2 OSD containers on that node? I believe this should be a "cephadm orch" command? > > I think 'ceph orch daemon redeploy' might do it? What is the output of 'ceph orch ls' and 'ceph orch ps'? > > > The health of the cluster right now is: > > > > CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s) > > PG_DEGRADED: Degraded data redundancy: 132518/397554 objects degraded (33.333%), 65 pgs degraded, 65 pgs undersized > > > > > > Thank your for your hints. > > Best regards, > > Mabi > > > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > -- > > David Caro > SRE - Cloud Services > Wikimedia Foundation https://wikimediafoundation.org/ > PGP Signature: 7180 83A2 AC8B 314F B4CE 1171 4071 C7E1 D262 69C3 > > "Imagine a world in which every single human being can freely share in the > sum of all knowledge. That's our commitment." > > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx