Hi,
these are the defaults set by cephadm in Octopus and Pacific:
---snip---
[Service]
LimitNOFILE=1048576
LimitNPROC=1048576
EnvironmentFile=-/etc/environment
ExecStart=/bin/bash {data_dir}/{fsid}/%i/unit.run
ExecStop=-{container_path} stop ceph-{fsid}-%i
ExecStopPost=-/bin/bash {data_dir}/{fsid}/%i/unit.poststop
KillMode=none
Restart=on-failure
RestartSec=10s
TimeoutStartSec=120
TimeoutStopSec=120
StartLimitInterval=30min
StartLimitBurst=5
---snip---
So there are StartLimit options.
What are other options to prevent OSD containers from trying to
restart after a valid crash?
The question is how you determine a "valid" crash. I wouldn't want the
first crash to result in an out OSD. First I would try to get behind
the root cause for the crash. Of course, if there are signs of a disk
failure it's only a matter of time until the OSD won't recover. But
since there are a lot more things that could kill a process I would
want ceph to try to bring the OSDs back online. I think the defaults
are a valid compromise, although one might argue about the specific
values, of course.
Regards,
Eugen
Zitat von "Frank de Bot (lists)" <lists@xxxxxxxxxxx>:
Hi,
I've a small ceph containerized cluster rolled out with
ceph-ansible. wal and db from each drive are on a seperate nvme
drive, the data is on spinning sas disks. The cluster is running
16.2.7
Today a disk failed, but not quite catastrophic. The block device is
present, lvm metadata is good, but reading certain blocks gives
'Sense: Unrecovered read error' in the syslog (smart is indicating
the drive is failing). The OSD crashes on reading/writing.
But the container kept restarting and crashing until manual
intervention was done. By doing this the faulty was flapping up and
down, causing the OSD not going out and not rebalancing the cluster.
I could set StartLimitIntervalSec and StartLimitBurst in the osd
service file, but it's not there by default and I like to keep
everything as standard as possible.
What are other options to prevent OSD containers from trying to
restart after a valid crash?
Regards,
Frank de Bot
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx