On Thu, Oct 5, 2017 at 6:48 AM Stefan Kooman <stefan@xxxxxx> wrote:
Hi,
During testing (mimicking BGP / port flaps) on our cluster we are able
to trigger a "_committed_osd_maps shutdown OSD via async signal" on the
the affected OSD servers in that datacenter (OSDs in that DC become
intermittent isolated from their peers). Result is that all OSD
processes stop. Is this a bug or a feature? I.e. is there a "flap"
detection mechanism in Ceph OSD?
If it's a bug it might be related to
http://tracker.ceph.com/issues/20174. We get similiar error message on
"12.2.0". Version "12.2.1" does not log
"-1 Fail to open
'/proc/0/cmdline' error = (2) No such file or directory
-1 received signal: Interrupt from PID: 0 task name: <unknown> UID: 0
-1 osd.21 1846 *** Got signal Interrupt ***
0 osd.21 1846 prepare_to_stop starting shutdown
-1 osd.21 1846 shutdown"
That's a feature, but invoking it may indicate the presence of another issue. The OSD shuts down if
1) it has been deleted from the cluster, or
2) it has been incorrectly marked down a bunch of times by the cluster, and gives up, or
3) it has been incorrectly marked down by the cluster, and encounters an error when it rebinds to new network ports
In your case, with the port flapping, OSDs are presumably getting marked down by their peers (since they can't communicate), and eventually give up on trying to stay alive. You can prevent/reduce that by setting the osd_max_markdown_count config to a very large number, if you really want to.
-Greg
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com