Re: ceph octopus mysterious OSD crash

胡玮文 <huww98@xxxxxxxxxxx> · Fri, 19 Mar 2021 00:56:20 +0000

“podman logs ceph-xxxxxxx-osd-xxx” may contains additional logs.

> 在 2021年3月19日，04:29，Philip Brown <pbrown@xxxxxxxxxx> 写道：
> 
> I've been banging on my ceph octopus test cluster for a few days now.
> 8 nodes. each node has 2 SSDs and 8 HDDs. 
> They were all autoprovisioned so that each HDD gets an LVM slice of an SSD as a db partition.
> 
> service_type: osd
> service_id: osd_spec_default
> placement:
>  host_pattern: '*'
> data_devices:
>  rotational: 1
> db_devices:
>  rotational: 0
> 
> 
> things were going pretty good, until... yesterday.. i noticed TWO of the OSDs were "down".
> 
> I went to check the logs, with 
> journalctl -u ceph-xxxx@xxxxxxx
> 
> all it showed were a bunch of generic debug info, and the fact that it stopped.
> and various automatic attempts to restart.
> but no indication of what was wrong, and why the restarts KEEP failing.
> 
> 
> sample output:
> 
> 
> systemd[1]: Stopped Ceph osd.33 for e51eb2fa-7f82-11eb-94d5-78e3b5148f00.
> systemd[1]: Starting Ceph osd.33 for e51eb2fa-7f82-11eb-94d5-78e3b5148f00...
> bash[9340]: ceph-e51eb2fa-7f82-11eb-94d5-78e3b5148f00-osd.33-activate
> bash[9340]: WARNING: The same type, major and minor should not be used for multiple devices.
> bash[9340]: WARNING: The same type, major and minor should not be used for multiple devices.
> podman[9369]: 2021-03-07 16:00:15.543010794 -0800 PST m=+0.318475882 container create
> podman[9369]: 2021-03-07 16:00:15.73461926 -0800 PST m=+0.510084288 container init
> .....
> bash[1611473]: --> ceph-volume lvm activate successful for osd ID: 33
> podman[1611501]: 2021-03-18 10:23:02.564242824 -0700 PDT m=+1.379793448 container died 
> bash[1611473]: ceph-xx-xx-xx-xx-osd.33
> bash[1611473]: WARNING: The same type, major and minor should not be used for multiple devices.
> (repeat, repeat...)
> podman[1611615]: 2021-03-18 10:23:03.530992487 -0700 PDT m=+0.333130660 container create
> 
> ....
> systemd[1]: Started Ceph osd.33 for xx-xx-xx-xx
> systemd[1]: ceph-xx-xx-xx-xx@osd.33.service: main process exited, code=exited, status=1/FAILURE
> bash[1611797]: ceph-xx-xx-xx-xx-osd.33-deactivate
> 
> and eventually it just gives up.
> 
> smartctl -a doesnt show any errors on the HDD
> 
> 
> dmesg doesnt show anything.
> 
> So... what do I do?
> 
> 
> 
> 
> 
> --
> Philip Brown| Sr. Linux System Administrator | Medata, Inc. 
> 5 Peters Canyon Rd Suite 250 
> Irvine CA 92606 
> Office 714.918.1310| Fax 714.918.1325 
> pbrown@xxxxxxxxxx| https://apac01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.medata.com%2F&amp;data=04%7C01%7C%7C739f028cfcc04020c94c08d8ea4c9673%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637516961950804014%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=TRkxSSU8BhLWM7cNpyJ8lX6J7U6Fdfi7ubrkFt91DkU%3D&amp;reserved=0
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx