Re: OSD repeatedly marked down

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Jan,

On 01.12.21 17:31, Jan Kasprzak wrote:
In "ceph -s", they "2 osds down"
message disappears, and the number of degraded objects steadily decreases.
However, after some time the number of degraded objects starts going up
and down again, and osds appear to be down (and then up again). After 5 minutes
the OSDs are kicked out from the cluster, and the ceph-osd daemons stop
Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 7f8c38e02700 -1 received  signal: Interrupt from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 7f8c38e02700 -1 osd.32 1119559 *** Got signal Interrupt ***
Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 7f8c38e02700 -1 osd.32 1119559 *** Immediate shutdown (osd_fast_shutdown=true) ***


Do you have enough memory on your host? You might want to look for oom messages in dmesg / journal and monitor your memory usage throughout the recovery.

If the osd processes are indeed killed by OOM killer, you have a few options. Adding more memory would probably be best to future-proof the system. Maybe you could also work with some Ceph config setting, e.g. lowering osd_max_backfills (although I'm definitely not an expert on which parameters would give you the best result). Adding swap will most likely only produce other issues, but might be a method of last resort.

Cheers
Sebastian
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux