hi there, we are seeing osd occasionally getting kicked out of our cluster, after having been marked down by other osds. most of the time, the affected osd rejoins the cluster after about ~5 minutes, but sometimes this takes much longer. during that time, the osd seems to run just fine. this happens more often that we'd like it to … is "OSD::osd_op_tp thread … had timed out" a real error condition or just a warning about certain operations on the osd taking a long time? i already set osd_op_thread_timeout to 120 (was 60 before, default should be 15 according to the docs), but apparently that doesn't make any difference. are there any other settings that prevent this kind of behaviour? mon_osd_report_timeout maybe, as in frank schilder's case? the cluster runs nautilus 14.2.7, osds are backed by spinning platters with their rocksdb and wals on nvmes. in general, there seems to be the following pattern: - it happens under moderate to heavy load, eg. while creating pools with a lot of pgs - the affected osd logs a lot of: "heartbeat_map is_healthy 'OSD::osd_op_tp thread ${thread-id}' had timed out after 60" … and finally something along the lines of: May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.211 7fb25cc80700 0 bluestore(/var/lib/ceph/osd/ceph-293) log_latency_fn slow operation observed for _collection_list, latency = 96.337s, lat = 96s cid =2.0s2_head start GHMAX end GHMAX max 30 May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.219 7fb25cc80700 1 heartbeat_map clear_timeout 'OSD::osd_op_tp thread 0x7fb25cc80700' had timed out after 60 May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: osd.293 osd.293 2 : Monitor daemon marked osd.293 down, but it is still running May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.315 7fb267c96700 0 log_channel(cluster) log [WRN] : Monitor daemon marked osd.293 down, but it is still running May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.315 7fb267c96700 0 log_channel(cluster) do_log log to syslog May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.315 7fb267c96700 0 log_channel(cluster) log [DBG] : map e646639 wrongly marked me down at e646638 May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.315 7fb267c96700 0 log_channel(cluster) do_log log to syslog May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.371 7fb272cac700 -1 osd.293 646639 set_numa_affinity unable to identify public interface 'br-bond0' numa node: (2) No such file or directory May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.371 7fb272cac700 -1 osd.293 646639 set_numa_affinity unable to identify public interface 'br-bond0' numa node: (2) No such file or directory - meanwhile on the mon: 2020-05-18 21:12:16.440 7f08f7933700 0 mon.ceph-mon-01@0(leader) e4 handle_command mon_command({"prefix": "status"} v 0) v1 entity='client.admin' cmd=[{"prefix": "status"}]: dispatch 2020-05-18 21:12:18.436 7f08f7933700 0 log_channel(cluster) log [DBG] : osd.293 reported failed by osd.101 2020-05-18 21:12:18.848 7f08f7933700 0 log_channel(cluster) log [DBG] : osd.293 reported failed by osd.533 [… lots of these from various osds] 2020-05-18 21:12:24.992 7f08f7933700 0 log_channel(cluster) log [DBG] : osd.293 reported failed by osd.421 2020-05-18 21:12:26.124 7f08f7933700 0 log_channel(cluster) log [DBG] : osd.293 reported failed by osd.504 2020-05-18 21:12:26.132 7f08f7933700 0 log_channel(cluster) log [INF] : osd.293 failed (root=tuberlin,datacenter=barz,host=ceph-osd-05) (16 reporters from different host after 27.137527 >= grace 26.361774) 2020-05-18 21:12:26.236 7f08fa138700 0 log_channel(cluster) log [WRN] : Health check failed: 1 osds down (OSD_DOWN) 2020-05-18 21:12:26.280 7f08f6130700 0 log_channel(cluster) log [DBG] : osdmap e646638: 604 total, 603 up, 604 in 2020-05-18 21:12:27.336 7f08f6130700 0 log_channel(cluster) log [DBG] : osdmap e646639: 604 total, 603 up, 604 in 2020-05-18 21:12:28.248 7f08fa138700 0 log_channel(cluster) log [WRN] : Health check failed: Reduced data availability: 17 pgs peering (PG_AVAILABILITY) 2020-05-18 21:12:29.392 7f08fa138700 0 log_channel(cluster) log [WRN] : Health check failed: Degraded data redundancy: 80091/181232010 objects degraded (0.044%), 18 pgs degraded (PG_DEGRADED) 2020-05-18 21:12:33.927 7f08fa138700 0 log_channel(cluster) log [INF] : Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1 pg inactive, 22 pgs peering) 2020-05-18 21:12:35.095 7f08fa138700 0 log_channel(cluster) log [INF] : Health check cleared: OSD_DOWN (was: 1 osds down) 2020-05-18 21:12:35.119 7f08f6130700 0 log_channel(cluster) log [INF] : osd.293 [v2:172.28.9.26:6936/2356578,v1:172.28.9.26:6937/2356578] boot 2020-05-18 21:12:35.119 7f08f6130700 0 log_channel(cluster) log [DBG] : osdmap e646640: 604 total, 604 up, 604 in 2020-05-18 21:12:36.175 7f08f6130700 0 log_channel(cluster) log [DBG] : osdmap e646641: 604 total, 604 up, 604 in i can happily provide more detailed logs, if that helps. thank you very much & with kind regards, thoralf.
Attachment:
signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx