Re: OSD repeatedly marked down

Sebastian Knust <sknust@xxxxxxxxxxxxxxxxxxxxxxx> · Wed, 1 Dec 2021 17:47:17 +0100

Hi Jan,

On 01.12.21 17:31, Jan Kasprzak wrote:
In "ceph -s", they "2 osds down"
message disappears, and the number of degraded objects steadily decreases.
However, after some time the number of degraded objects starts going up
and down again, and osds appear to be down (and then up again). After 5 minutes
the OSDs are kicked out from the cluster, and the ceph-osd daemons stop
Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 7f8c38e02700 -1 received  signal: Interrupt from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 7f8c38e02700 -1 osd.32 1119559 *** Got signal Interrupt ***
Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 7f8c38e02700 -1 osd.32 1119559 *** Immediate shutdown (osd_fast_shutdown=true) ***

Do you have enough memory on your host? You might want to look for oom 
messages in dmesg / journal and monitor your memory usage throughout the 
recovery.

If the osd processes are indeed killed by OOM killer, you have a few 
options. Adding more memory would probably be best to future-proof the 
system. Maybe you could also work with some Ceph config setting, e.g. 
lowering osd_max_backfills (although I'm definitely not an expert on 
which parameters would give you the best result). Adding swap will most 
likely only produce other issues, but might be a method of last resort.

Cheers
Sebastian
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx