ceph octopus mysterious OSD crash

Philip Brown <pbrown@xxxxxxxxxx> · Thu, 18 Mar 2021 13:28:52 -0700 (PDT)

I've been banging on my ceph octopus test cluster for a few days now.
8 nodes. each node has 2 SSDs and 8 HDDs. 
They were all autoprovisioned so that each HDD gets an LVM slice of an SSD as a db partition.

service_type: osd
service_id: osd_spec_default
placement:
  host_pattern: '*'
data_devices:
  rotational: 1
db_devices:
  rotational: 0

things were going pretty good, until... yesterday.. i noticed TWO of the OSDs were "down".

I went to check the logs, with 
journalctl -u ceph-xxxx@xxxxxxx

all it showed were a bunch of generic debug info, and the fact that it stopped.
and various automatic attempts to restart.
but no indication of what was wrong, and why the restarts KEEP failing.

sample output:

systemd[1]: Stopped Ceph osd.33 for e51eb2fa-7f82-11eb-94d5-78e3b5148f00.
systemd[1]: Starting Ceph osd.33 for e51eb2fa-7f82-11eb-94d5-78e3b5148f00...
bash[9340]: ceph-e51eb2fa-7f82-11eb-94d5-78e3b5148f00-osd.33-activate
bash[9340]: WARNING: The same type, major and minor should not be used for multiple devices.
bash[9340]: WARNING: The same type, major and minor should not be used for multiple devices.
podman[9369]: 2021-03-07 16:00:15.543010794 -0800 PST m=+0.318475882 container create
podman[9369]: 2021-03-07 16:00:15.73461926 -0800 PST m=+0.510084288 container init
.....
bash[1611473]: --> ceph-volume lvm activate successful for osd ID: 33
podman[1611501]: 2021-03-18 10:23:02.564242824 -0700 PDT m=+1.379793448 container died 
bash[1611473]: ceph-xx-xx-xx-xx-osd.33
bash[1611473]: WARNING: The same type, major and minor should not be used for multiple devices.
(repeat, repeat...)
podman[1611615]: 2021-03-18 10:23:03.530992487 -0700 PDT m=+0.333130660 container create

....
systemd[1]: Started Ceph osd.33 for xx-xx-xx-xx
systemd[1]: ceph-xx-xx-xx-xx@osd.33.service: main process exited, code=exited, status=1/FAILURE
bash[1611797]: ceph-xx-xx-xx-xx-osd.33-deactivate

and eventually it just gives up.

smartctl -a doesnt show any errors on the HDD

dmesg doesnt show anything.

So... what do I do?

--
Philip Brown| Sr. Linux System Administrator | Medata, Inc. 
5 Peters Canyon Rd Suite 250 
Irvine CA 92606 
Office 714.918.1310| Fax 714.918.1325 
pbrown@xxxxxxxxxx| www.medata.com
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx