Use journalctl -xe (maybe with -S/-U if you want to filter) to find the time period in which a restart attempt has happened, and see what's logged at that period. If that's not helpful, then what you may want to do is disable that service (systemctl disable blah) then get the ExecStart out of it, then try running it by hand and seeing what happens (the symlinked systemd unit will will refer to a unit.run file in /var/lib/ceph that will have the actual podman cmd). If the pod dies, you should still see it in podman ps -a and you can podman logs on it to get the details. Then you can correct the issue then re-enable the service and restart it properly to do the housekeeping. Follow these directions at your own risk; make sure you understand the ramifications of whatever you might be doing! David On Thu, Mar 18, 2021 at 3:29 PM Philip Brown <pbrown@xxxxxxxxxx> wrote: > > I've been banging on my ceph octopus test cluster for a few days now. > 8 nodes. each node has 2 SSDs and 8 HDDs. > They were all autoprovisioned so that each HDD gets an LVM slice of an SSD as a db partition. > > service_type: osd > service_id: osd_spec_default > placement: > host_pattern: '*' > data_devices: > rotational: 1 > db_devices: > rotational: 0 > > > things were going pretty good, until... yesterday.. i noticed TWO of the OSDs were "down". > > I went to check the logs, with > journalctl -u ceph-xxxx@xxxxxxx > > all it showed were a bunch of generic debug info, and the fact that it stopped. > and various automatic attempts to restart. > but no indication of what was wrong, and why the restarts KEEP failing. > > > sample output: > > > systemd[1]: Stopped Ceph osd.33 for e51eb2fa-7f82-11eb-94d5-78e3b5148f00. > systemd[1]: Starting Ceph osd.33 for e51eb2fa-7f82-11eb-94d5-78e3b5148f00... > bash[9340]: ceph-e51eb2fa-7f82-11eb-94d5-78e3b5148f00-osd.33-activate > bash[9340]: WARNING: The same type, major and minor should not be used for multiple devices. > bash[9340]: WARNING: The same type, major and minor should not be used for multiple devices. > podman[9369]: 2021-03-07 16:00:15.543010794 -0800 PST m=+0.318475882 container create > podman[9369]: 2021-03-07 16:00:15.73461926 -0800 PST m=+0.510084288 container init > ..... > bash[1611473]: --> ceph-volume lvm activate successful for osd ID: 33 > podman[1611501]: 2021-03-18 10:23:02.564242824 -0700 PDT m=+1.379793448 container died > bash[1611473]: ceph-xx-xx-xx-xx-osd.33 > bash[1611473]: WARNING: The same type, major and minor should not be used for multiple devices. > (repeat, repeat...) > podman[1611615]: 2021-03-18 10:23:03.530992487 -0700 PDT m=+0.333130660 container create > > .... > systemd[1]: Started Ceph osd.33 for xx-xx-xx-xx > systemd[1]: ceph-xx-xx-xx-xx@osd.33.service: main process exited, code=exited, status=1/FAILURE > bash[1611797]: ceph-xx-xx-xx-xx-osd.33-deactivate > > and eventually it just gives up. > > smartctl -a doesnt show any errors on the HDD > > > dmesg doesnt show anything. > > So... what do I do? > > > > > > -- > Philip Brown| Sr. Linux System Administrator | Medata, Inc. > 5 Peters Canyon Rd Suite 250 > Irvine CA 92606 > Office 714.918.1310| Fax 714.918.1325 > pbrown@xxxxxxxxxx| www.medata.com > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx