Re: Rocksdb compaction and OSD timeout

"Alexander E. Patrakov" <patrakov@xxxxxxxxx> · Thu, 7 Sep 2023 19:29:36 +0800

On an HDD-based Quincy 17.2.5 cluster (with DB/WAL on datacenter-class
NVMe with enhanced power loss protection), I sometimes (once or twice
per week) see log entries similar to what I reproduced below (a bit
trimmed):

Wed 2023-09-06 22:41:54 UTC ceph-osd09 ceph-osd@39.service[5574]:
2023-09-06T22:41:54.429+0000 7f83d813d700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f83b9ec1700' had timed out after
15.000000954s
Wed 2023-09-06 22:41:54 UTC ceph-osd09 ceph-osd@39.service[5574]:
2023-09-06T22:41:54.429+0000 7f83d793c700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f83b9ec1700' had timed out after
15.000000954s
Wed 2023-09-06 22:41:54 UTC ceph-osd09 ceph-osd@39.service[5574]:
2023-09-06T22:41:54.469+0000 7f83d713b700  1 heartbeat_map is_healthy
'OSD::osd_op_tp thread 0x7f83b9ec1700' had timed out after
15.000000954s
<many more repeats>
Wed 2023-09-06 22:42:04 UTC mds02 ceph-mgr@mds02.service[4954]:
2023-09-06T22:42:04.735+0000 7ffab1a20700  0 log_channel(cluster) log
[DBG] : pgmap v227303: 5449 pgs: 1 active+clean+laggy, 7
active+clean+snaptrim, 34 active+remapped+backfilling, 13
active+clean+scrubbing, 4738 active+clean, 19
active+clean+scrubbing+deep, 637 active+remapped+backfill_wait; 2.0
PiB data, 2.9 PiB used, 2.2 PiB / 5.2 PiB avail; 103 MiB/s rd, 25
MiB/s wr, 2.76k op/s; 923967567/22817724437 objects misplaced
(4.049%); 128 MiB/s, 142 objects/s recovering
<some more osd_op_tp threads timed out>
Wed 2023-09-06 22:42:06 UTC mds02 ceph-mgr@mds02.service[4954]:
2023-09-06T22:42:06.767+0000 7ffab1a20700  0 log_channel(cluster) log
[DBG] : pgmap v227304: 5449 pgs: 2 active+clean+laggy, 7
active+clean+snaptrim, 34 active+remapped+backfilling, 13
active+clean+scrubbing, 4737 active+clean, 19
active+clean+scrubbing+deep, 637 active+remapped+backfill_wait; 2.0
PiB data, 2.9 PiB used, 2.2 PiB / 5.2 PiB avail; 88 MiB/s rd, 23 MiB/s
wr, 3.85k op/s; 923967437/22817727283 objects misplaced (4.049%); 115
MiB/s, 126 objects/s recovering
<some more osd_op_tp threads timed out>
Wed 2023-09-06 22:42:15 UTC mds02 ceph-mgr@mds02.service[4954]:
2023-09-06T22:42:15.311+0000 7ffab1a20700  0 log_channel(cluster) log
[DBG] : pgmap v227308: 5449 pgs: 1
active+remapped+backfill_wait+laggy, 3 active+clean+laggy, 7
active+clean+snaptrim, 34 active+remapped+backfilling, 13
active+clean+scrubbing, 4736 active+clean, 19
active+clean+scrubbing+deep, 636 active+remapped+backfill_wait; 2.0
PiB data, 2.9 PiB used, 2.2 PiB / 5.2 PiB avail; 67 MiB/s rd, 13 MiB/s
wr, 3.21k op/s; 923966538/22817734196 objects misplaced (4.049%); 134
MiB/s, 137 objects/s recovering
<some more osd_op_tp threads timed out>
Wed 2023-09-06 22:42:22 UTC ceph-osd09 ceph-osd@39.service[5574]:
2023-09-06T22:42:22.708+0000 7f83b9ec1700  1 heartbeat_map
reset_timeout 'OSD::osd_op_tp thread 0x7f83b9ec1700' had timed out
after 15.000000954s
<no more osd_op_tp threads time out>
Wed 2023-09-06 22:42:24 UTC ceph-osd06 ceph-osd@12.service[5527]:
2023-09-06T22:42:24.637+0000 7f4fbfcc3700 -1 osd.12 1031871
heartbeat_check: no reply from 10.3.0.9:6864 osd.39 since back
2023-09-06T22:41:54.079627+0000 front 2023-09-06T22:41:54.079742+0000
(oldest deadline 2023-09-06T22:42:24.579224+0000)
Wed 2023-09-06 22:42:24 UTC ceph-osd08 ceph-osd@126.service[5505]:
2023-09-06T22:42:24.671+0000 7feb033d8700 -1 osd.126 1031871
heartbeat_check: no reply from 10.3.0.9:6864 osd.39 since back
2023-09-06T22:41:53.996908+0000 front 2023-09-06T22:41:53.997184+0000
(oldest deadline 2023-09-06T22:42:24.496552+0000)
Wed 2023-09-06 22:42:24 UTC ceph-osd06 ceph-osd@75.service[5532]:
2023-09-06T22:42:24.797+0000 7fe15bc77700 -1 osd.75 1031871
heartbeat_check: no reply from 10.3.0.9:6864 osd.39 since back
2023-09-06T22:41:49.384037+0000 front 2023-09-06T22:41:49.384031+0000
(oldest deadline 2023-09-06T22:42:24.683399+0000)
<some more "no reply" complaints - curiously, most of them come from
the old part of the cluster, to which the timed-out OSD also belongs>
Wed 2023-09-06 22:42:25 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:25.669+0000 7f8f72216700  1 mon.mds01@0(leader).osd
e1031871 prepare_failure osd.39 v1:10.3.0.9:6860/5574 from osd.1 is
reporting failure:1
Wed 2023-09-06 22:42:25 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:25.669+0000 7f8f72216700  0 log_channel(cluster) log
[DBG] : osd.39 reported failed by osd.1
Wed 2023-09-06 22:42:25 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:25.905+0000 7f8f72216700  1 mon.mds01@0(leader).osd
e1031871 prepare_failure osd.39 v1:10.3.0.9:6860/5574 from osd.408 is
reporting failure:1
Wed 2023-09-06 22:42:25 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:25.905+0000 7f8f72216700  0 log_channel(cluster) log
[DBG] : osd.39 reported failed by osd.408
Wed 2023-09-06 22:42:25 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:25.905+0000 7f8f72216700  1 mon.mds01@0(leader).osd
e1031871  we have enough reporters to mark osd.39 down
Wed 2023-09-06 22:42:25 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:25.905+0000 7f8f72216700  0 log_channel(cluster) log
[INF] : osd.39 failed (root=default,host=ceph-osd09) (2 reporters from
different host after 32.232905 >= grace 30.000000)
Wed 2023-09-06 22:42:26 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:26.445+0000 7f8f74a1b700  0 log_channel(cluster) log
[WRN] : Health check failed: 1 osds down (OSD_DOWN)
Wed 2023-09-06 22:42:26 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:26.453+0000 7f8f70212700  1 mon.mds01@0(leader).osd
e1031872 e1031872: 459 total, 437 up, 438 in
Wed 2023-09-06 22:42:26 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:26.453+0000 7f8f70212700  0 log_channel(cluster) log
[DBG] : osdmap e1031872: 459 total, 437 up, 438 in
<other MONs report the same>
Wed 2023-09-06 22:42:26 UTC ceph-osd09 ceph-osd@39.service[5574]:
2023-09-06T22:42:26.532+0000 7f83d813d700  0 --1-
v1:10.3.0.9:6862/5574 >> v1:10.3.0.7:6857/5579 conn(0x55cd63342000
0x55cd63344000 :-1 s=OPENED pgs=12691 cs=1079 l=0).fault initiating
reconnect
Wed 2023-09-06 22:42:26 UTC ceph-osd09 ceph-osd@39.service[5574]:
2023-09-06T22:42:26.532+0000 7f83d813d700  0 --1-
v1:10.3.0.9:6862/5574 >> v1:10.3.0.7:6857/5579 conn(0x55cd63342000
0x55cd63344000 :-1 s=CONNECTING_SEND_CONNECT_MSG pgs=12691 cs=1080
l=0).handle_connect_reply_2 connect got RESETSESSION
Wed 2023-09-06 22:42:26 UTC ceph-osd07 ceph-osd@91.service[5579]:
2023-09-06T22:42:26.529+0000 7fbf8f3ed700  0 --1-
v1:10.3.0.7:6857/5579 >> v1:10.3.0.9:6862/5574 conn(0x5588d67cac00
0x5588f210b000 :6857 s=ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
l=0).handle_connect_message_2 accept we reset (peer sent cseq 1080),
sending RESETSESSION
<some more reconnects and RESETSESSIONs>
Wed 2023-09-06 22:42:26 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:26.769+0000 7f8f72216700  0 log_channel(cluster) log
[INF] : osd.39 marked itself dead as of e1031872
Wed 2023-09-06 22:42:26 UTC ceph-osd09 ceph-osd@39.service[5574]:
2023-09-06T22:42:26.768+0000 7f83c4ed7700  0 log_channel(cluster) log
[WRN] : Monitor daemon marked osd.39 down, but it is still running
Wed 2023-09-06 22:42:26 UTC ceph-osd09 ceph-osd@39.service[5574]:
2023-09-06T22:42:26.768+0000 7f83c4ed7700  0 log_channel(cluster) log
[DBG] : map e1031872 wrongly marked me down at e1031872
Wed 2023-09-06 22:42:26 UTC ceph-osd09 ceph-osd@39.service[5574]:
2023-09-06T22:42:26.768+0000 7f83c4ed7700  1 osd.39 1031872
start_waiting_for_healthy
Wed 2023-09-06 22:42:26 UTC ceph-osd09 ceph-osd@39.service[5574]:
2023-09-06T22:42:26.772+0000 7f83c4ed7700  1 osd.39 1031872 is_healthy
false -- only 0/12 up peers (less than 33%)
Wed 2023-09-06 22:42:26 UTC ceph-osd09 ceph-osd@39.service[5574]:
2023-09-06T22:42:26.772+0000 7f83c4ed7700  1 osd.39 1031872 not
healthy; waiting to boot
Wed 2023-09-06 22:42:27 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:27.481+0000 7f8f70212700  1 mon.mds01@0(leader).osd
e1031873 e1031873: 459 total, 437 up, 438 in

Then osd.39 comes up by itself:

Wed 2023-09-06 22:42:28 UTC ceph-osd09 ceph-osd@39.service[5574]:
2023-09-06T22:42:28.516+0000 7f83d20fe700  1 osd.39 1031873 tick
checking mon for new map
Wed 2023-09-06 22:42:28 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:28.541+0000 7f8f70212700  0 log_channel(cluster) log
[INF] : osd.39 v1:10.3.0.9:6860/5574 boot
Wed 2023-09-06 22:42:28 UTC mds01 ceph-mon@mds01.service[3258]:
2023-09-06T22:42:28.541+0000 7f8f70212700  0 log_channel(cluster) log
[DBG] : osdmap e1031874: 459 total, 438 up, 438 in

1) Is this the same?
2) It's strange that the majority (but not all) of OSDs saying "no
reply" come from the old part of the cluster. Is there any debug
option that would allow us to discriminate between the automatic
compaction and network issues? At which debug level are automatic
compactions announced?

I am asking the question (2) because, prior to the first osd_op_tp
timeout, there is nothing that announces a compaction.

On Thu, Sep 7, 2023 at 5:08 PM J-P Methot <jp.methot@xxxxxxxxxxxxxxxxx> wrote:
>
> We're talking about automatic online compaction here, not running the
> command.
>
> On 9/7/23 04:04, Konstantin Shalygin wrote:
> > Hi,
> >
> >> On 7 Sep 2023, at 10:05, J-P Methot <jp.methot@xxxxxxxxxxxxxxxxx> wrote:
> >>
> >> We're running latest Pacific on our production cluster and we've been
> >> seeing the dreaded 'OSD::osd_op_tp thread 0x7f346aa64700' had timed
> >> out after 15.000000954s' error. We have reasons to believe this
> >> happens each time the RocksDB compaction process is launched on an
> >> OSD. My question is, does the cluster detecting that an OSD has timed
> >> out interrupt the compaction process? This seems to be what's
> >> happening, but it's not immediately obvious. We are currently facing
> >> an infinite loop of random OSDs timing out and if the compaction
> >> process is interrupted without finishing, it may explain that.
> >
> > You run the online compacting for this OSD's (`ceph osd compact
> > ${osd_id}` command), right?
> >
> >
> >
> > k
>
> --
> Jean-Philippe Méthot
> Senior Openstack system administrator
> Administrateur système Openstack sénior
> PlanetHoster inc.
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

-- 
Alexander E. Patrakov
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx