OSD crashes during upgrade mimic->octopus

Frank Schilder <frans@xxxxxx> · Thu, 6 Oct 2022 11:06:00 +0000

Hi all,

we are stuck with a really unpleasant situation and we would appreciate help. Yesterday we completed the ceph deamon upgrade from mimic to octopus all he way through with bluestore_fsck_quick_fix_on_mount = false and started the OSD OMAP conversion today in the morning. Everything went well at the beginning. The conversion went much faster than expected and OSDs came slowly back up. Unfortunately, trouble was only around the corner.

We have 12 hosts with 2 SSDs, 4 OSDs per disk and >65 HDDs. On the host where we started the conversion, the OSDs on the SSDs either crashed or didn't come up. These OSDs are part of our FS meta data pool, which is replicated 4(2). So far so unusual.

The problems now are:
  - I cannot restart the crashed OSDs, because a D-state LVM process is blocking access to the drives, and
  - that OSDs on other hosts in that pool also started crashing. And they crash bad (cannot restart either). The OSD processes' last log lines look something like that:

2022-10-06T12:21:09.473+0200 7f18a1ddd700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1885d357
00' had timed out after 15
2022-10-06T12:21:09.473+0200 7f18a1ddd700  1 heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f18865367
00' had timed out after 15
2022-10-06T12:21:12.526+0200 7f18a25de700  0 --1- [v2:192.168.16.78:6886/3645427,v1:192.168.16.78:6888/36
45427] >> v1:192.168.16.73:6903/861744 conn(0x55f32322e800 0x55f32320a000 :-1 s=CONNECTING_SEND_CONNECT_M
SG pgs=330 cs=1 l=0).handle_connect_reply_2 connect got RESETSESSION
2022-10-06T12:21:12.526+0200 7f18a1ddd700  0 --1- [v2:192.168.16.78:6886/3645427,v1:192.168.16.78:6888/36
45427] >> v1:192.168.16.81:6891/3651159 conn(0x55f341dae000 0x55f30c65b000 :-1 s=CONNECTING_SEND_CONNECT_
MSG pgs=117 cs=1 l=0).handle_connect_reply_2 connect got RESETSESSION
2022-10-06T12:21:13.437+0200 7f18a1ddd700  0 --1- [v2:192.168.16.78:6886/3645427,v1:192.168.16.78:6888/36
45427] >> v1:192.168.16.77:6917/3645255 conn(0x55f323b44000 0x55f325576000 :-1 s=CONNECTING_SEND_CONNECT_
MSG pgs=306 cs=1 l=0).handle_connect_reply_2 connect got RESETSESSION
2022-10-06T12:21:45.224+0200 7f189054a700  4 rocksdb: [db/db_impl.cc:777] ------- DUMPING STATS -------
2022-10-06T12:21:45.224+0200 7f189054a700  4 rocksdb: [db/db_impl.cc:778] 
** DB Stats **

With stats following. At this point it gets stuck. Trying to stop the OSD results in the log lines:

2022-10-06T12:26:08.990+0200 7f189fdd9700 -1 received  signal: Terminated from  (PID: 3728898) UID: 0
2022-10-06T12:26:08.990+0200 7f189fdd9700 -1 osd.959 900605 *** Got signal Terminated ***
2022-10-06T12:26:08.990+0200 7f189fdd9700  0 osd.959 900605 prepare_to_stop telling mon we are shutting down and dead 
2022-10-06T12:26:13.990+0200 7f189fdd9700  0 osd.959 900605 prepare_to_stop starting shutdown

and the OSD process get stuck in Dl-state. It is not possible to terminate the process. I'm slowly loosing redundancy and already lost service:

# ceph status
  cluster:
    id:     e4ece518-f2cb-4708-b00f-b6bf511e91d9
    health: HEALTH_ERR
            953 OSD(s) reporting legacy (not per-pool) BlueStore stats
            953 OSD(s) reporting legacy (not per-pool) BlueStore omap usage stats
            nosnaptrim flag(s) set
            1 scrub errors
            Reduced data availability: 22 pgs inactive
            Possible data damage: 1 pg inconsistent
            Degraded data redundancy: 130313597/11974669139 objects degraded (1.088%), 1111 pgs degraded, 1120 pgs undersized

  services:
    mon: 5 daemons, quorum ceph-01,ceph-02,ceph-03,ceph-25,ceph-26 (age 18h)
    mgr: ceph-25(active, since 26h), standbys: ceph-26, ceph-03, ceph-02, ceph-01
    mds: con-fs2:1 {0=ceph-11=up:active} 11 up:standby
    osd: 1086 osds: 1038 up (since 34m), 1038 in (since 24m); 1120 remapped pgs
         flags nosnaptrim

  task status:

  data:
    pools:   14 pools, 17375 pgs
    objects: 1.39G objects, 2.5 PiB
    usage:   3.1 PiB used, 8.3 PiB / 11 PiB avail
    pgs:     0.127% pgs not active
             130313597/11974669139 objects degraded (1.088%)
             339825/11974669139 objects misplaced (0.003%)
             16238 active+clean
             747   active+undersized+degraded+remapped+backfilling
             342   active+undersized+degraded+remapped+backfill_wait
             22    forced_recovery+undersized+degraded+remapped+backfilling+peered
             16    active+clean+scrubbing+deep
             5     active+undersized+remapped+backfilling
             4     active+undersized+remapped+backfill_wait
             1     active+clean+inconsistent

  io:
    client:   77 MiB/s rd, 42 MiB/s wr, 2.58k op/s rd, 1.81k op/s wr
    recovery: 8.5 MiB/s, 2.52k keys/s, 3.39k objects/s

If SSDs on yet another host go down, we are stuck. Right now I hope recovery gets the inactive PGs up, but it takes a really long time.

Any ideas as to why the OSDs crash so badly into D-state and what could prevent that are very much appreciated. I have already un-startable OSDs on 2 hosts and any further fail will spell doom. I'm also afraid that peering load is one of the factors and am very reluctant to reboot hosts to clear D-state processes. I really don't want to play this whack-a-mole game.

Thanks for your help and best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx