Re: _committed_osd_maps shutdown OSD via async signal, bug or feature?

Gregory Farnum <gfarnum@xxxxxxxxxx> · Thu, 05 Oct 2017 17:41:33 +0000

On Thu, Oct 5, 2017 at 6:48 AM Stefan Kooman <stefan@xxxxxx> wrote:
Hi,

During testing (mimicking BGP / port flaps) on our cluster we are able

to trigger a "_committed_osd_maps shutdown OSD via async signal" on the

the affected OSD servers in that datacenter (OSDs in that DC become

intermittent isolated from their peers). Result is that all OSD

processes stop. Is this a bug or a feature? I.e. is there a "flap"

detection mechanism in Ceph OSD?

If it's a bug it might be related to

http://tracker.ceph.com/issues/20174. We get similiar error message on

"12.2.0". Version "12.2.1" does not log

"-1 Fail to open

'/proc/0/cmdline' error = (2) No such file or directory

-1 received  signal: Interrupt from  PID: 0 task name: <unknown> UID: 0

-1 osd.21 1846 *** Got signal Interrupt ***

0 osd.21 1846 prepare_to_stop starting shutdown

-1 osd.21 1846 shutdown"

That's a feature, but invoking it may indicate the presence of another issue. The OSD shuts down if
1) it has been deleted from the cluster, or
2) it has been incorrectly marked down a bunch of times by the cluster, and gives up, or
3) it has been incorrectly marked down by the cluster, and encounters an error when it rebinds to new network ports

In your case, with the port flapping, OSDs are presumably getting marked down by their peers (since they can't communicate), and eventually give up on trying to stay alive. You can prevent/reduce that by setting the osd_max_markdown_count config to a very large number, if you really want to.
-Greg
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com