osds dropping out of the cluster w/ "OSD::osd_op_tp thread … had timed out"

thoralf schulze <t.schulze@xxxxxxxxxxxx> · Tue, 19 May 2020 11:07:59 +0200

hi there,

we are seeing osd occasionally getting kicked out of our cluster, after
having been marked down by other osds. most of the time, the affected
osd rejoins the cluster after about ~5 minutes, but sometimes this takes
much longer. during that time, the osd seems to run just fine.

this happens more often that we'd like it to … is "OSD::osd_op_tp thread
… had timed out" a real error condition or just a warning about certain
operations on the osd taking a long time? i already set
osd_op_thread_timeout to 120 (was 60 before, default should be 15
according to the docs), but apparently that doesn't make any difference.

are there any other settings that prevent this kind of behaviour?
mon_osd_report_timeout maybe, as in frank schilder's case?

the cluster runs nautilus 14.2.7, osds are backed by spinning platters
with their rocksdb and wals on nvmes. in general, there seems to be the
following pattern:

- it happens under moderate to heavy load, eg. while creating pools with
a lot of pgs
- the affected osd logs a lot of:
"heartbeat_map is_healthy 'OSD::osd_op_tp thread ${thread-id}' had timed
out after 60"
 … and finally something along the lines of:
May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.211
7fb25cc80700  0 bluestore(/var/lib/ceph/osd/ceph-293) log_latency_fn
slow operation observed for _collection_list, latency = 96.337s, lat =
96s cid =2.0s2_head start GHMAX end GHMAX max 30
May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.219
7fb25cc80700  1 heartbeat_map clear_timeout 'OSD::osd_op_tp thread
0x7fb25cc80700' had timed out after 60
May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: osd.293 osd.293 2 :
Monitor daemon marked osd.293 down, but it is still running
May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.315
7fb267c96700  0 log_channel(cluster) log [WRN] : Monitor daemon marked
osd.293 down, but it is still running
May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.315
7fb267c96700  0 log_channel(cluster) do_log log to syslog
May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.315
7fb267c96700  0 log_channel(cluster) log [DBG] : map e646639 wrongly
marked me down at e646638
May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.315
7fb267c96700  0 log_channel(cluster) do_log log to syslog
May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.371
7fb272cac700 -1 osd.293 646639 set_numa_affinity unable to identify
public interface 'br-bond0' numa node: (2) No such file or directory
May 18 21:12:34 ceph-osd-05 ceph-osd[2356578]: 2020-05-18 21:12:34.371
7fb272cac700 -1 osd.293 646639 set_numa_affinity unable to identify
public interface 'br-bond0' numa node: (2) No such file or directory

- meanwhile on the mon:
2020-05-18 21:12:16.440 7f08f7933700  0 mon.ceph-mon-01@0(leader) e4
handle_command mon_command({"prefix": "status"} v 0) v1
entity='client.admin' cmd=[{"prefix": "status"}]: dispatch
2020-05-18 21:12:18.436 7f08f7933700  0 log_channel(cluster) log [DBG] :
osd.293 reported failed by osd.101
2020-05-18 21:12:18.848 7f08f7933700  0 log_channel(cluster) log [DBG] :
osd.293 reported failed by osd.533
[… lots of these from various osds]
2020-05-18 21:12:24.992 7f08f7933700  0 log_channel(cluster) log [DBG] :
osd.293 reported failed by osd.421
2020-05-18 21:12:26.124 7f08f7933700  0 log_channel(cluster) log [DBG] :
osd.293 reported failed by osd.504
2020-05-18 21:12:26.132 7f08f7933700  0 log_channel(cluster) log [INF] :
osd.293 failed (root=tuberlin,datacenter=barz,host=ceph-osd-05) (16
reporters from different host after 27.137527 >= grace 26.361774)
2020-05-18 21:12:26.236 7f08fa138700  0 log_channel(cluster) log [WRN] :
Health check failed: 1 osds down (OSD_DOWN)
2020-05-18 21:12:26.280 7f08f6130700  0 log_channel(cluster) log [DBG] :
osdmap e646638: 604 total, 603 up, 604 in
2020-05-18 21:12:27.336 7f08f6130700  0 log_channel(cluster) log [DBG] :
osdmap e646639: 604 total, 603 up, 604 in
2020-05-18 21:12:28.248 7f08fa138700  0 log_channel(cluster) log [WRN] :
Health check failed: Reduced data availability: 17 pgs peering
(PG_AVAILABILITY)
2020-05-18 21:12:29.392 7f08fa138700  0 log_channel(cluster) log [WRN] :
Health check failed: Degraded data redundancy: 80091/181232010 objects
degraded (0.044%), 18 pgs degraded (PG_DEGRADED)
2020-05-18 21:12:33.927 7f08fa138700  0 log_channel(cluster) log [INF] :
Health check cleared: PG_AVAILABILITY (was: Reduced data availability: 1
pg inactive, 22 pgs peering)
2020-05-18 21:12:35.095 7f08fa138700  0 log_channel(cluster) log [INF] :
Health check cleared: OSD_DOWN (was: 1 osds down)
2020-05-18 21:12:35.119 7f08f6130700  0 log_channel(cluster) log [INF] :
osd.293 [v2:172.28.9.26:6936/2356578,v1:172.28.9.26:6937/2356578] boot
2020-05-18 21:12:35.119 7f08f6130700  0 log_channel(cluster) log [DBG] :
osdmap e646640: 604 total, 604 up, 604 in
2020-05-18 21:12:36.175 7f08f6130700  0 log_channel(cluster) log [DBG] :
osdmap e646641: 604 total, 604 up, 604 in

i can happily provide more detailed logs, if that helps.

thank you very much & with kind regards,
thoralf.

Attachment:
signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx