Re: OSD crashes during upgrade mimic->octopus

Igor Fedotov <igor.fedotov@xxxxxxxx> · Thu, 6 Oct 2022 14:15:34 +0300

Hi Frank,

you might want to compact RocksDB by ceph-kvstore-tool for those OSDs 
which are showing

"heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1886536700' had timed out after 15"

I could see such an error after bulk data removal and following severe 
DB performance drop pretty often.

Thanks,
Igor

On 10/6/2022 2:06 PM, Frank Schilder wrote:
Hi all,

we are stuck with a really unpleasant situation and we would appreciate help. Yesterday we completed the ceph deamon upgrade from mimic to octopus all he way through with bluestore_fsck_quick_fix_on_mount = false and started the OSD OMAP conversion today in the morning. Everything went well at the beginning. The conversion went much faster than expected and OSDs came slowly back up. Unfortunately, trouble was only around the corner.

We have 12 hosts with 2 SSDs, 4 OSDs per disk and >65 HDDs. On the host where we started the conversion, the OSDs on the SSDs either crashed or didn't come up. These OSDs are part of our FS meta data pool, which is replicated 4(2). So far so unusual.

The problems now are:
   - I cannot restart the crashed OSDs, because a D-state LVM process is blocking access to the drives, and
   - that OSDs on other hosts in that pool also started crashing. And they crash bad (cannot restart either). The OSD processes' last log lines look something like that:

2022-10-06T12:21:09.473+0200 7f18a1ddd700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1885d357
00' had timed out after 15
2022-10-06T12:21:09.473+0200 7f18a1ddd700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f18865367
00' had timed out after 15
2022-10-06T12:21:12.526+0200 7f18a25de700  0 --1- [v2:192.168.16.78:6886/3645427,v1:192.168.16.78:6888/36
45427] >> v1:192.168.16.73:6903/861744 conn(0x55f32322e800 0x55f32320a000 :-1 s=CONNECTING_SEND_CONNECT_M
SG pgs=330 cs=1 l=0).handle_connect_reply_2 connect got RESETSESSION
2022-10-06T12:21:12.526+0200 7f18a1ddd700  0 --1- [v2:192.168.16.78:6886/3645427,v1:192.168.16.78:6888/36
45427] >> v1:192.168.16.81:6891/3651159 conn(0x55f341dae000 0x55f30c65b000 :-1 s=CONNECTING_SEND_CONNECT_
MSG pgs=117 cs=1 l=0).handle_connect_reply_2 connect got RESETSESSION
2022-10-06T12:21:13.437+0200 7f18a1ddd700  0 --1- [v2:192.168.16.78:6886/3645427,v1:192.168.16.78:6888/36
45427] >> v1:192.168.16.77:6917/3645255 conn(0x55f323b44000 0x55f325576000 :-1 s=CONNECTING_SEND_CONNECT_
MSG pgs=306 cs=1 l=0).handle_connect_reply_2 connect got RESETSESSION
2022-10-06T12:21:45.224+0200 7f189054a700  4 rocksdb: [db/db_impl.cc:777] ------- DUMPING STATS -------
2022-10-06T12:21:45.224+0200 7f189054a700  4 rocksdb: [db/db_impl.cc:778]
** DB Stats **

With stats following. At this point it gets stuck. Trying to stop the OSD results in the log lines:

2022-10-06T12:26:08.990+0200 7f189fdd9700 -1 received  signal: Terminated from  (PID: 3728898) UID: 0
2022-10-06T12:26:08.990+0200 7f189fdd9700 -1 osd.959 900605 *** Got signal Terminated ***
2022-10-06T12:26:08.990+0200 7f189fdd9700  0 osd.959 900605 prepare_to_stop telling mon we are shutting down and dead
2022-10-06T12:26:13.990+0200 7f189fdd9700  0 osd.959 900605 prepare_to_stop starting shutdown

and the OSD process get stuck in Dl-state. It is not possible to terminate the process. I'm slowly loosing redundancy and already lost service:

# ceph status
   cluster:
     id:     e4ece518-f2cb-4708-b00f-b6bf511e91d9
     health: HEALTH_ERR
             953 OSD(s) reporting legacy (not per-pool) BlueStore stats
             953 OSD(s) reporting legacy (not per-pool) BlueStore omap usage stats
             nosnaptrim flag(s) set
             1 scrub errors
             Reduced data availability: 22 pgs inactive
             Possible data damage: 1 pg inconsistent
             Degraded data redundancy: 130313597/11974669139 objects degraded (1.088%), 1111 pgs degraded, 1120 pgs undersized

   services:
     mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 18h)
     mgr: ceph-25(active, since 26h), standbys: ceph-26, ceph-03, ceph-02, ceph-01
     mds: con-fs2:1 {0=ceph-11=up:active} 11 up:standby
     osd: 1086 osds: 1038 up (since 34m), 1038 in (since 24m); 1120 remapped pgs
          flags nosnaptrim

   task status:

   data:
     pools:   14 pools, 17375 pgs
     objects: 1.39G objects, 2.5 PiB
     usage:   3.1 PiB used, 8.3 PiB / 11 PiB avail
     pgs:     0.127% pgs not active
              130313597/11974669139 objects degraded (1.088%)
              339825/11974669139 objects misplaced (0.003%)
              16238 active+clean
              747   active+undersized+degraded+remapped+backfilling
              342   active+undersized+degraded+remapped+backfill_wait
              22    forced_recovery+undersized+degraded+remapped+backfilling+peered
              16    active+clean+scrubbing+deep
              5     active+undersized+remapped+backfilling
              4     active+undersized+remapped+backfill_wait
              1     active+clean+inconsistent

   io:
     client:   77 MiB/s rd, 42 MiB/s wr, 2.58k op/s rd, 1.81k op/s wr
     recovery: 8.5 MiB/s, 2.52k keys/s, 3.39k objects/s

If SSDs on yet another host go down, we are stuck. Right now I hope recovery gets the inactive PGs up, but it takes a really long time.

Any ideas as to why the OSDs crash so badly into D-state and what could prevent that are very much appreciated. I have already un-startable OSDs on 2 hosts and any further fail will spell doom. I'm also afraid that peering load is one of the factors and am very reluctant to reboot hosts to clear D-state processes. I really don't want to play this whack-a-mole game.

Thanks for your help and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Igor Fedotov
Ceph Lead Developer

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263
Web: https://croit.io | YouTube: https://goo.gl/PGE1Bx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx