Re: OSD Container keeps restarting after drive crash

Eugen Block <eblock@xxxxxx> · Fri, 25 Feb 2022 07:45:12 +0000

Hi,

these are the defaults set by cephadm in Octopus and Pacific:

---snip---
[Service]
LimitNOFILE=1048576
LimitNPROC=1048576
EnvironmentFile=-/etc/environment
ExecStart=/bin/bash {data_dir}/{fsid}/%i/unit.run
ExecStop=-{container_path} stop ceph-{fsid}-%i
ExecStopPost=-/bin/bash {data_dir}/{fsid}/%i/unit.poststop
KillMode=none
Restart=on-failure
RestartSec=10s
TimeoutStartSec=120
TimeoutStopSec=120
StartLimitInterval=30min
StartLimitBurst=5
---snip---

So there are StartLimit options.

What are other options to prevent OSD containers from trying to  
restart after a valid crash?

The question is how you determine a "valid" crash. I wouldn't want the  
first crash to result in an out OSD. First I would try to get behind  
the root cause for the crash. Of course, if there are signs of a disk  
failure it's only a matter of time until the OSD won't recover. But  
since there are a lot more things that could kill a process I would  
want ceph to try to bring the OSDs back online. I think the defaults  
are a valid compromise, although one might argue about the specific  
values, of course.

Regards,
Eugen

Zitat von "Frank de Bot (lists)" <lists@xxxxxxxxxxx>:

Hi,

I've a small ceph containerized cluster rolled out with  
ceph-ansible. wal and db from each drive are on a seperate nvme  
drive, the data is on spinning sas disks. The cluster is running  
16.2.7
Today a disk failed, but not quite catastrophic. The block device is  
present, lvm metadata is good, but reading certain blocks gives  
'Sense: Unrecovered read error' in the syslog (smart is indicating  
the drive is failing). The OSD crashes on reading/writing.

But the container kept restarting and crashing until manual  
intervention was done. By doing this the faulty was flapping up and  
down, causing the OSD not going out and not rebalancing the cluster.
I could set StartLimitIntervalSec and StartLimitBurst in the osd  
service file, but it's not there by default and I like to keep  
everything as standard as possible.
What are other options to prevent OSD containers from trying to  
restart after a valid crash?

Regards,

Frank de Bot
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx