Sebastian, Sebastian Knust wrote: : On 01.12.21 17:31, Jan Kasprzak wrote: : >In "ceph -s", they "2 osds down" : >message disappears, and the number of degraded objects steadily decreases. : >However, after some time the number of degraded objects starts going up : >and down again, and osds appear to be down (and then up again). After 5 minutes : >the OSDs are kicked out from the cluster, and the ceph-osd daemons stop : >Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 7f8c38e02700 -1 received signal: Interrupt from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0 : >Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 7f8c38e02700 -1 osd.32 1119559 *** Got signal Interrupt *** : >Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 7f8c38e02700 -1 osd.32 1119559 *** Immediate shutdown (osd_fast_shutdown=true) *** : > : : Do you have enough memory on your host? You might want to look for : oom messages in dmesg / journal and monitor your memory usage : throughout the recovery. Yes, I have lots of memory. This particular node has 512 GB, and according to top(1), the ceph-osd daemon has VSZ around 1.1 GB. OOM would be visible in dmesg(8) (it is not). AFAIK, CentOS 8 Stream does not have systemd-oomd(8) yet. : If the osd processes are indeed killed by OOM killer, you have a few : options. Adding more memory would probably be best to future-proof : the system. Maybe you could also work with some Ceph config setting, : e.g. lowering osd_max_backfills (although I'm definitely not an : expert on which parameters would give you the best result). Adding : swap will most likely only produce other issues, but might be a : method of last resort. I tend to add a small swap partition to my systems (this one has 8 GB of swap) just to get rid of initialization code in various processes. But after starting ceph-osd daemons (and them being killed exactly after 600.0 seconds), there are exactly zero bytes of swap space used. So I don't think my problem is OOM. It might be communication, but I tried to tcpdump and look for example for ICMP port unreachable messages, but nothing interesting there. -Yenya -- | Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> | | http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | We all agree on the necessity of compromise. We just can't agree on when it's necessary to compromise. --Larry Wall _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx