Re: ceph octopus mysterious OSD crash

David Orman <ormandj@xxxxxxxxxxxx> · Thu, 18 Mar 2021 17:18:12 -0500

Use journalctl -xe (maybe with -S/-U if you want to filter) to find
the time period in which a restart attempt has happened, and see
what's logged at that period. If that's not helpful, then what you may
want to do is disable that service (systemctl disable blah) then get
the ExecStart out of it, then try running it by hand and seeing what
happens (the symlinked systemd unit will will refer to a unit.run file
in /var/lib/ceph that will have the actual podman cmd). If the pod
dies, you should still see it in podman ps -a and you can podman logs
on it to get the details. Then you can correct the issue then
re-enable the service and restart it properly to do the housekeeping.
Follow these directions at your own risk; make sure you understand the
ramifications of whatever you might be doing!

David

On Thu, Mar 18, 2021 at 3:29 PM Philip Brown <pbrown@xxxxxxxxxx> wrote:
>
> I've been banging on my ceph octopus test cluster for a few days now.
> 8 nodes. each node has 2 SSDs and 8 HDDs.
> They were all autoprovisioned so that each HDD gets an LVM slice of an SSD as a db partition.
>
> service_type: osd
> service_id: osd_spec_default
> placement:
>   host_pattern: '*'
> data_devices:
>   rotational: 1
> db_devices:
>   rotational: 0
>
>
> things were going pretty good, until... yesterday.. i noticed TWO of the OSDs were "down".
>
> I went to check the logs, with
> journalctl -u ceph-xxxx@xxxxxxx
>
> all it showed were a bunch of generic debug info, and the fact that it stopped.
> and various automatic attempts to restart.
> but no indication of what was wrong, and why the restarts KEEP failing.
>
>
> sample output:
>
>
> systemd[1]: Stopped Ceph osd.33 for e51eb2fa-7f82-11eb-94d5-78e3b5148f00.
> systemd[1]: Starting Ceph osd.33 for e51eb2fa-7f82-11eb-94d5-78e3b5148f00...
> bash[9340]: ceph-e51eb2fa-7f82-11eb-94d5-78e3b5148f00-osd.33-activate
> bash[9340]: WARNING: The same type, major and minor should not be used for multiple devices.
> bash[9340]: WARNING: The same type, major and minor should not be used for multiple devices.
> podman[9369]: 2021-03-07 16:00:15.543010794 -0800 PST m=+0.318475882 container create
> podman[9369]: 2021-03-07 16:00:15.73461926 -0800 PST m=+0.510084288 container init
> .....
> bash[1611473]: --> ceph-volume lvm activate successful for osd ID: 33
> podman[1611501]: 2021-03-18 10:23:02.564242824 -0700 PDT m=+1.379793448 container died
> bash[1611473]: ceph-xx-xx-xx-xx-osd.33
> bash[1611473]: WARNING: The same type, major and minor should not be used for multiple devices.
> (repeat, repeat...)
> podman[1611615]: 2021-03-18 10:23:03.530992487 -0700 PDT m=+0.333130660 container create
>
> ....
> systemd[1]: Started Ceph osd.33 for xx-xx-xx-xx
> systemd[1]: ceph-xx-xx-xx-xx@osd.33.service: main process exited, code=exited, status=1/FAILURE
> bash[1611797]: ceph-xx-xx-xx-xx-osd.33-deactivate
>
> and eventually it just gives up.
>
> smartctl -a doesnt show any errors on the HDD
>
>
> dmesg doesnt show anything.
>
> So... what do I do?
>
>
>
>
>
> --
> Philip Brown| Sr. Linux System Administrator | Medata, Inc.
> 5 Peters Canyon Rd Suite 250
> Irvine CA 92606
> Office 714.918.1310| Fax 714.918.1325
> pbrown@xxxxxxxxxx| www.medata.com
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx