Re: OSDs cannot join cluster anymore

Adam King <adking@xxxxxxxxxx> · Sat, 24 Jun 2023 18:14:55 -0400

Reminds me of https://tracker.ceph.com/issues/57007 which wasn't fixed in
pacific until 16.2.11, so this is probably just the result of a cephadm bug
unfortunately.

On Fri, Jun 23, 2023 at 5:16 PM Malte Stroem <malte.stroem@xxxxxxxxx> wrote:

> Hello Eugen,
>
> thanks.
>
> We found the cause.
>
> Somehow all
>
> /var/lib/ceph/fsid/osd.XX/config
>
> files on every host were still filled with expired information about the
> mons.
>
> So refreshing the files helped to bring the osds up again. Damn.
>
> All other configs for the mons, mds', rgws and so on were up to date.
>
> I do not know why the osd config files did not get refreshed however I
> guess something went wrong draining the nodes we removed from the cluster.
>
> Best regards,
> Malte
>
> Am 21.06.23 um 22:11 schrieb Eugen Block:
> > I still can’t really grasp what might have happened here. But could you
> > please clarify which of the down OSDs (or Hosts) are supposed to be down
> > and which you’re trying to bring back online? Obviously osd.40 is one of
> > your attempts. But what about the hosts cephx01 and cephx08? Are those
> > the ones refusing to start their OSDs? And the remaining up OSDs you
> > haven’t touched yet, correct?
> > And regarding debug logs, you should set it with ceph config set because
> > the local ceph.conf won’t have an effect. It could help to have the
> > startup debug logs from one of the OSDs.
> >
> > Zitat von Malte Stroem <malte.stroem@xxxxxxxxx>:
> >
> >> Hello Eugen,
> >>
> >> recovery and rebalancing was finished however now all PGs show missing
> >> OSDs.
> >>
> >> Everything looks like the PGs are missing OSDs although it finished
> >> correctly.
> >>
> >> As if we shut down the servers immediately.
> >>
> >> But we removed the nodes the way it is described in the documentation.
> >>
> >> We just added new disks and they join the cluster immediately.
> >>
> >> So the old OSDs removed from the cluster are available, I restored
> >> OSD.40 but it does not want to join the cluster.
> >>
> >> Following are the outputs of the mentioned commands:
> >>
> >> ceph -s
> >>
> >>   cluster:
> >>     id:     X
> >>     health: HEALTH_WARN
> >>             1 failed cephadm daemon(s)
> >>             1 filesystem is degraded
> >>             1 MDSs report slow metadata IOs
> >>             19 osds down
> >>             4 hosts (50 osds) down
> >>             Reduced data availability: 1220 pgs inactive
> >>             Degraded data redundancy: 132 pgs undersized
> >>
> >>   services:
> >>     mon: 3 daemons, quorum cephx02,cephx04,cephx06 (age 4m)
> >>     mgr: cephx02.xxxxxx(active, since 92s), standbys: cephx04.yyyyyy,
> >> cephx06.zzzzzz     mds: 2/2 daemons up, 2 standby
> >>     osd: 130 osds: 78 up (since 13m), 97 in (since 35m); 171 remapped
> pgs
> >>     rgw: 1 daemon active (1 hosts, 1 zones)
> >>
> >>   data:
> >>     volumes: 1/2 healthy, 1 recovering
> >>     pools:   12 pools, 1345 pgs
> >>     objects: 11.02k objects, 1.9 GiB
> >>     usage:   145 TiB used, 669 TiB / 814 TiB avail
> >>     pgs:     86.617% pgs unknown
> >>              4.089% pgs not active
> >>              39053/33069 objects misplaced (118.095%)
> >>              1165 unknown
> >>              77   active+undersized+remapped
> >>              55   undersized+remapped+peered
> >>              38   active+clean+remapped
> >>              10   active+clean
> >>
> >> ceph osd tree
> >>
> >> ID   CLASS  WEIGHT      TYPE NAME                STATUS  REWEIGHT
> >> PRI-AFF
> >> -21            4.36646  root ssds
> >> -61            0.87329      host cephx01-ssd
> >> 186    ssd     0.87329          osd.186            down   1.00000
> >> 1.00000
> >> -76            0.87329      host cephx02-ssd
> >> 263    ssd     0.87329          osd.263              up   1.00000
> >> 1.00000
> >> -85            0.87329      host cephx04-ssd
> >> 237    ssd     0.87329          osd.237              up   1.00000
> >> 1.00000
> >> -88            0.87329      host cephx06-ssd
> >> 236    ssd     0.87329          osd.236              up   1.00000
> >> 1.00000
> >> -94            0.87329      host cephx08-ssd
> >> 262    ssd     0.87329          osd.262            down   1.00000
> >> 1.00000
> >>  -1         1347.07397  root default
> >> -62          261.93823      host cephx01
> >> 139    hdd    10.91409          osd.139            down         0
> >> 1.00000
> >> 140    hdd    10.91409          osd.140            down         0
> >> 1.00000
> >> 142    hdd    10.91409          osd.142            down         0
> >> 1.00000
> >> 144    hdd    10.91409          osd.144            down         0
> >> 1.00000
> >> 146    hdd    10.91409          osd.146            down         0
> >> 1.00000
> >> 148    hdd    10.91409          osd.148            down         0
> >> 1.00000
> >> 150    hdd    10.91409          osd.150            down         0
> >> 1.00000
> >> 152    hdd    10.91409          osd.152            down         0
> >> 1.00000
> >> 154    hdd    10.91409          osd.154            down   1.00000
> >> 1.00000
> >> 156    hdd    10.91409          osd.156            down   1.00000
> >> 1.00000
> >> 158    hdd    10.91409          osd.158            down   1.00000
> >> 1.00000
> >> 160    hdd    10.91409          osd.160            down   1.00000
> >> 1.00000
> >> 162    hdd    10.91409          osd.162            down   1.00000
> >> 1.00000
> >> 164    hdd    10.91409          osd.164            down   1.00000
> >> 1.00000
> >> 166    hdd    10.91409          osd.166            down   1.00000
> >> 1.00000
> >> 168    hdd    10.91409          osd.168            down   1.00000
> >> 1.00000
> >> 170    hdd    10.91409          osd.170            down   1.00000
> >> 1.00000
> >> 172    hdd    10.91409          osd.172            down   1.00000
> >> 1.00000
> >> 174    hdd    10.91409          osd.174            down   1.00000
> >> 1.00000
> >> 176    hdd    10.91409          osd.176            down   1.00000
> >> 1.00000
> >> 178    hdd    10.91409          osd.178            down   1.00000
> >> 1.00000
> >> 180    hdd    10.91409          osd.180            down   1.00000
> >> 1.00000
> >> 182    hdd    10.91409          osd.182            down   1.00000
> >> 1.00000
> >> 184    hdd    10.91409          osd.184            down   1.00000
> >> 1.00000
> >> -67          261.93823      host cephx02
> >> 138    hdd    10.91409          osd.138              up   1.00000
> >> 1.00000
> >> 141    hdd    10.91409          osd.141              up   1.00000
> >> 1.00000
> >> 143    hdd    10.91409          osd.143              up   1.00000
> >> 1.00000
> >> 145    hdd    10.91409          osd.145              up   1.00000
> >> 1.00000
> >> 147    hdd    10.91409          osd.147              up   1.00000
> >> 1.00000
> >> 149    hdd    10.91409          osd.149              up   1.00000
> >> 1.00000
> >> 151    hdd    10.91409          osd.151              up   1.00000
> >> 1.00000
> >> 153    hdd    10.91409          osd.153              up   1.00000
> >> 1.00000
> >> 155    hdd    10.91409          osd.155              up   1.00000
> >> 1.00000
> >> 157    hdd    10.91409          osd.157              up   1.00000
> >> 1.00000
> >> 159    hdd    10.91409          osd.159              up   1.00000
> >> 1.00000
> >> 161    hdd    10.91409          osd.161              up   1.00000
> >> 1.00000
> >> 163    hdd    10.91409          osd.163              up   1.00000
> >> 1.00000
> >> 165    hdd    10.91409          osd.165              up   1.00000
> >> 1.00000
> >> 167    hdd    10.91409          osd.167              up   1.00000
> >> 1.00000
> >> 169    hdd    10.91409          osd.169              up   1.00000
> >> 1.00000
> >> 171    hdd    10.91409          osd.171              up   1.00000
> >> 1.00000
> >> 173    hdd    10.91409          osd.173              up   1.00000
> >> 1.00000
> >> 175    hdd    10.91409          osd.175              up   1.00000
> >> 1.00000
> >> 177    hdd    10.91409          osd.177              up   1.00000
> >> 1.00000
> >> 179    hdd    10.91409          osd.179              up   1.00000
> >> 1.00000
> >> 181    hdd    10.91409          osd.181              up   1.00000
> >> 1.00000
> >> 183    hdd    10.91409          osd.183              up   1.00000
> >> 1.00000
> >> 185    hdd    10.91409          osd.185              up   1.00000
> >> 1.00000
> >> -82          261.93823      host cephx04
> >> 189    hdd    10.91409          osd.189              up   1.00000
> >> 1.00000
> >> 191    hdd    10.91409          osd.191              up   1.00000
> >> 1.00000
> >> 193    hdd    10.91409          osd.193              up   1.00000
> >> 1.00000
> >> 195    hdd    10.91409          osd.195              up   1.00000
> >> 1.00000
> >> 197    hdd    10.91409          osd.197              up   1.00000
> >> 1.00000
> >> 199    hdd    10.91409          osd.199              up   1.00000
> >> 1.00000
> >> 201    hdd    10.91409          osd.201              up   1.00000
> >> 1.00000
> >> 202    hdd    10.91409          osd.202              up   1.00000
> >> 1.00000
> >> 204    hdd    10.91409          osd.204              up   1.00000
> >> 1.00000
> >> 206    hdd    10.91409          osd.206              up   1.00000
> >> 1.00000
> >> 208    hdd    10.91409          osd.208              up   1.00000
> >> 1.00000
> >> 210    hdd    10.91409          osd.210              up   1.00000
> >> 1.00000
> >> 212    hdd    10.91409          osd.212              up   1.00000
> >> 1.00000
> >> 214    hdd    10.91409          osd.214              up   1.00000
> >> 1.00000
> >> 217    hdd    10.91409          osd.217              up   1.00000
> >> 1.00000
> >> 219    hdd    10.91409          osd.219              up   1.00000
> >> 1.00000
> >> 221    hdd    10.91409          osd.221              up   1.00000
> >> 1.00000
> >> 223    hdd    10.91409          osd.223              up   1.00000
> >> 1.00000
> >> 225    hdd    10.91409          osd.225              up   1.00000
> >> 1.00000
> >> 227    hdd    10.91409          osd.227              up   1.00000
> >> 1.00000
> >> 229    hdd    10.91409          osd.229              up   1.00000
> >> 1.00000
> >> 231    hdd    10.91409          osd.231              up   1.00000
> >> 1.00000
> >> 233    hdd    10.91409          osd.233              up   1.00000
> >> 1.00000
> >> 235    hdd    10.91409          osd.235              up   1.00000
> >> 1.00000
> >> -79          261.93823      host cephx06
> >> 188    hdd    10.91409          osd.188              up   1.00000
> >> 1.00000
> >> 190    hdd    10.91409          osd.190              up   1.00000
> >> 1.00000
> >> 192    hdd    10.91409          osd.192              up   1.00000
> >> 1.00000
> >> 194    hdd    10.91409          osd.194              up   1.00000
> >> 1.00000
> >> 196    hdd    10.91409          osd.196              up   1.00000
> >> 1.00000
> >> 198    hdd    10.91409          osd.198              up   1.00000
> >> 1.00000
> >> 200    hdd    10.91409          osd.200              up   1.00000
> >> 1.00000
> >> 203    hdd    10.91409          osd.203              up   1.00000
> >> 1.00000
> >> 205    hdd    10.91409          osd.205              up   1.00000
> >> 1.00000
> >> 207    hdd    10.91409          osd.207              up   1.00000
> >> 1.00000
> >> 209    hdd    10.91409          osd.209              up   1.00000
> >> 1.00000
> >> 211    hdd    10.91409          osd.211              up   1.00000
> >> 1.00000
> >> 213    hdd    10.91409          osd.213              up   1.00000
> >> 1.00000
> >> 215    hdd    10.91409          osd.215              up   1.00000
> >> 1.00000
> >> 216    hdd    10.91409          osd.216              up   1.00000
> >> 1.00000
> >> 218    hdd    10.91409          osd.218              up   1.00000
> >> 1.00000
> >> 220    hdd    10.91409          osd.220              up   1.00000
> >> 1.00000
> >> 222    hdd    10.91409          osd.222              up   1.00000
> >> 1.00000
> >> 224    hdd    10.91409          osd.224              up   1.00000
> >> 1.00000
> >> 226    hdd    10.91409          osd.226              up   1.00000
> >> 1.00000
> >> 228    hdd    10.91409          osd.228              up   1.00000
> >> 1.00000
> >> 230    hdd    10.91409          osd.230              up   1.00000
> >> 1.00000
> >> 232    hdd    10.91409          osd.232              up   1.00000
> >> 1.00000
> >> 234    hdd    10.91409          osd.234            down   1.00000
> >> 1.00000
> >> -91          261.93823      host cephx08
> >> 238    hdd    10.91409          osd.238            down         0
> >> 1.00000
> >> 239    hdd    10.91409          osd.239            down         0
> >> 1.00000
> >> 240    hdd    10.91409          osd.240            down         0
> >> 1.00000
> >> 241    hdd    10.91409          osd.241            down         0
> >> 1.00000
> >> 242    hdd    10.91409          osd.242            down         0
> >> 1.00000
> >> 243    hdd    10.91409          osd.243            down         0
> >> 1.00000
> >> 244    hdd    10.91409          osd.244            down         0
> >> 1.00000
> >> 245    hdd    10.91409          osd.245            down         0
> >> 1.00000
> >> 246    hdd    10.91409          osd.246            down         0
> >> 1.00000
> >> 247    hdd    10.91409          osd.247            down         0
> >> 1.00000
> >> 248    hdd    10.91409          osd.248            down         0
> >> 1.00000
> >> 249    hdd    10.91409          osd.249            down         0
> >> 1.00000
> >> 250    hdd    10.91409          osd.250            down         0
> >> 1.00000
> >> 251    hdd    10.91409          osd.251            down         0
> >> 1.00000
> >> 252    hdd    10.91409          osd.252            down         0
> >> 1.00000
> >> 253    hdd    10.91409          osd.253            down         0
> >> 1.00000
> >> 254    hdd    10.91409          osd.254            down         0
> >> 1.00000
> >> 255    hdd    10.91409          osd.255            down         0
> >> 1.00000
> >> 256    hdd    10.91409          osd.256            down         0
> >> 1.00000
> >> 257    hdd    10.91409          osd.257            down         0
> >> 1.00000
> >> 258    hdd    10.91409          osd.258            down         0
> >> 1.00000
> >> 259    hdd    10.91409          osd.259            down         0
> >> 1.00000
> >> 260    hdd    10.91409          osd.260            down         0
> >> 1.00000
> >> 261    hdd    10.91409          osd.261            down         0
> >> 1.00000
> >>  -3           37.38275      host ceph06
> >>  40            1.00000          osd.40             down         0
> >> 1.00000
> >>   0    hdd     9.09569          osd.0                up   1.00000
> >> 1.00000
> >>   1    hdd     9.09569          osd.1                up   1.00000
> >> 1.00000
> >>   2    hdd     9.09569          osd.2                up   1.00000
> >> 1.00000
> >>   3    hdd     9.09569          osd.3                up   1.00000
> >> 1.00000
> >>
> >> ceph health detail
> >>
> >> HEALTH_WARN 1 failed cephadm daemon(s); 1 filesystem is degraded; 1
> >> MDSs report slow metadata IOs; 19 osds down; 4 hosts (50 osds) down;
> >> Reduced data availability: 1220 pgs inactive; Degraded data
> >> redundancy: 132 pgs undersized
> >> [WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
> >>     daemon rgw.cephx06.xxxxxx on cephx06 is in error state
> >> [WRN] FS_DEGRADED: 1 filesystem is degraded
> >>     fs cephfs is degraded
> >> [WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
> >>     mds.cephfs.cephx01.yyyyyy(mds.0): 2 slow metadata IOs are blocked
> >> > 30 secs, oldest blocked for 1664 secs
> >> [WRN] OSD_DOWN: 19 osds down
> >>     osd.154 (root=default,host=cephx01) is down
> >>     osd.156 (root=default,host=cephx01) is down
> >>     osd.158 (root=default,host=cephx01) is down
> >>     osd.160 (root=default,host=cephx01) is down
> >>     osd.162 (root=default,host=cephx01) is down
> >>     osd.164 (root=default,host=cephx01) is down
> >>     osd.166 (root=default,host=cephx01) is down
> >>     osd.168 (root=default,host=cephx01) is down
> >>     osd.170 (root=default,host=cephx01) is down
> >>     osd.172 (root=default,host=cephx01) is down
> >>     osd.174 (root=default,host=cephx01) is down
> >>     osd.176 (root=default,host=cephx01) is down
> >>     osd.178 (root=default,host=cephx01) is down     osd.180
> >> (root=default,host=cephx01) is down
> >>     osd.182 (root=default,host=cephx01) is down
> >>     osd.184 (root=default,host=cephx01) is down
> >>     osd.186 (root=ssds,host=cephx01-ssd) is down
> >>     osd.234 (root=default,host=cephx06) is down
> >>     osd.262 (root=ssds,host=cephx08-ssd) is down
> >> [WRN] OSD_HOST_DOWN: 4 hosts (50 osds) down
> >>     host cephx01-ssd (root=ssds) (1 osds) is down
> >>     host cephx01 (root=default) (24 osds) is down
> >>     host cephx08 (root=default) (24 osds) is down
> >>     host cephx08-ssd (root=ssds) (1 osds) is down
> >> [WRN] PG_AVAILABILITY: Reduced data availability: 1220 pgs inactive
> >>     pg 7.3cd is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3ce is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3cf is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3d0 is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3d1 is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3d2 is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3d3 is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3d4 is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3d5 is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3d6 is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3d7 is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3d8 is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3d9 is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3da is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3db is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3dc is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3dd is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3de is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3df is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3e0 is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3e1 is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3e2 is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3e3 is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3e4 is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3e5 is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3e6 is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3e7 is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3e8 is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3e9 is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3ea is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3eb is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3ec is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3ed is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3ee is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3ef is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3f0 is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3f1 is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3f2 is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3f3 is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3f4 is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3f5 is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3f6 is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3f7 is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3f8 is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3f9 is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3fa is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3fb is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3fc is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3fd is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3fe is stuck inactive for 13h, current state unknown, last
> >> acting []
> >>     pg 7.3ff is stuck inactive for 13h, current state unknown, last
> >> acting []
> >> [WRN] PG_DEGRADED: Degraded data redundancy: 132 pgs undersized
> >>     pg 1.0 is stuck undersized for 13h, current state
> >> active+undersized+remapped, last acting [263,236]
> >>     pg 1.1 is stuck undersized for 13h, current state
> >> active+undersized+remapped, last acting [263,236]
> >>     pg 1.7 is stuck undersized for 13h, current state
> >> active+undersized+remapped, last acting [263,237]
> >>     pg 1.b is stuck undersized for 13h, current state
> >> active+undersized+remapped, last acting [236,263]
> >>     pg 1.e is stuck undersized for 13h, current state
> >> active+undersized+remapped, last acting [263,236]
> >>     pg 2.3 is stuck undersized for 13h, current state
> >> undersized+remapped+peered, last acting [237]
> >>     pg 2.8 is stuck undersized for 13h, current state
> >> active+undersized+remapped, last acting [263,236]
> >>     pg 2.9 is stuck undersized for 13h, current state
> >> undersized+remapped+peered, last acting [236]
> >>     pg 2.a is stuck undersized for 13h, current state
> >> active+undersized+remapped, last acting [236,263]
> >>     pg 2.b is stuck undersized for 13h, current state
> >> active+undersized+remapped, last acting [263,237]
> >>     pg 2.d is stuck undersized for 13h, current state
> >> undersized+remapped+peered, last acting [237]
> >>     pg 2.e is stuck undersized for 13h, current state
> >> active+undersized+remapped, last acting [263,236]
> >>     pg 3.2 is stuck undersized for 13h, current state
> >> active+undersized+remapped, last acting [236,263]
> >>     pg 3.4 is stuck undersized for 13h, current state
> >> active+undersized+remapped, last acting [237,263]
> >>     pg 3.8 is stuck undersized for 13h, current state
> >> undersized+remapped+peered, last acting [236]
> >>     pg 3.a is stuck undersized for 13h, current state
> >> active+undersized+remapped, last acting [236,237]
> >>     pg 3.c is stuck undersized for 13h, current state
> >> active+undersized+remapped, last acting [263,236]
> >>     pg 3.d is stuck undersized for 13h, current state
> >> active+undersized+remapped, last acting [263,237]
> >>     pg 3.f is stuck undersized for 13h, current state
> >> undersized+remapped+peered, last acting [236]
> >>     pg 4.3 is stuck undersized for 13h, current state
> >> active+undersized+remapped, last acting [263,237]
> >>     pg 4.5 is stuck undersized for 13h, current state
> >> undersized+remapped+peered, last acting [237]
> >>     pg 4.b is stuck undersized for 13h, current state
> >> active+undersized+remapped, last acting [237,236]
> >>     pg 4.d is stuck undersized for 13h, current state
> >> undersized+remapped+peered, last acting [236]
> >>     pg 4.e is stuck undersized for 13h, current state
> >> undersized+remapped+peered, last acting [236]
> >>     pg 5.2 is stuck undersized for 13h, current state
> >> undersized+remapped+peered, last acting [237]
> >>     pg 5.3 is stuck undersized for 13h, current state
> >> undersized+remapped+peered, last acting [236]
> >>     pg 5.5 is stuck undersized for 13h, current state
> >> undersized+remapped+peered, last acting [237]
> >>     pg 5.8 is stuck undersized for 13h, current state
> >> undersized+remapped+peered, last acting [237]
> >>     pg 5.a is stuck undersized for 13h, current state
> >> undersized+remapped+peered, last acting [237]
> >>     pg 5.b is stuck undersized for 13h, current state
> >> active+undersized+remapped, last acting [236,263]
> >>     pg 5.c is stuck undersized for 13h, current state
> >> undersized+remapped+peered, last acting [237]
> >>     pg 5.e is stuck undersized for 13h, current state
> >> undersized+remapped+peered, last acting [237]
> >>     pg 5.f is stuck undersized for 13h, current state
> >> undersized+remapped+peered, last acting [237]
> >>     pg 9.0 is stuck undersized for 13h, current state
> >> undersized+remapped+peered, last acting [236]
> >>     pg 9.1 is stuck undersized for 13h, current state
> >> active+undersized+remapped, last acting [236,237]
> >>     pg 9.2 is stuck undersized for 13h, current state
> >> undersized+remapped+peered, last acting [237]
> >>     pg 9.3 is stuck undersized for 13h, current state
> >> undersized+remapped+peered, last acting [237]
> >>     pg 9.5 is stuck undersized for 13h, current state
> >> active+undersized+remapped, last acting [263,236]
> >>     pg 9.6 is stuck undersized for 13h, current state
> >> undersized+remapped+peered, last acting [236]
> >>     pg 9.7 is stuck undersized for 13h, current state
> >> undersized+remapped+peered, last acting [236]
> >>     pg 9.9 is stuck undersized for 13h, current state
> >> active+undersized+remapped, last acting [263,237]
> >>     pg 10.0 is stuck undersized for 13h, current state
> >> active+undersized+remapped, last acting [237,236]
> >>     pg 10.1 is stuck undersized for 13h, current state
> >> undersized+remapped+peered, last acting [237]
> >>     pg 10.2 is stuck undersized for 13h, current state
> >> undersized+remapped+peered, last acting [237]
> >>     pg 10.3 is stuck undersized for 13h, current state
> >> undersized+remapped+peered, last acting [236]
> >>     pg 10.4 is stuck undersized for 13h, current state
> >> undersized+remapped+peered, last acting [237]
> >>     pg 10.5 is stuck undersized for 13h, current state
> >> undersized+remapped+peered, last acting [236]
> >>     pg 10.6 is stuck undersized for 13h, current state
> >> active+undersized+remapped, last acting [263,236]
> >>     pg 10.7 is stuck undersized for 13h, current state
> >> undersized+remapped+peered, last acting [236]
> >>     pg 10.a is stuck undersized for 13h, current state
> >> undersized+remapped+peered, last acting [236]
> >>     pg 10.c is stuck undersized for 13h, current state
> >> active+undersized+remapped, last acting [237,236]
> >>
> >> Best,
> >> Malte
> >>
> >> Am 21.06.23 um 10:31 schrieb Eugen Block:
> >>> Hi,
> >>>
> >>>> Yes, we drained the nodes. It needed two weeks to finish the
> >>>> process, and yes, I think this is the root cause.
> >>>> So we still have the nodes but when I try to restart one of those
> >>>> OSDs it still cannot join:
> >>>
> >>> if the nodes were drained successfully (can you confirm that all PGs
> >>> were active+clean after draining before you removed the nodes?) then
> >>> the disks on the removed nodes wouldn't have any data to bring back.
> >>> The question would be, why do the remaining OSDs still reference
> >>> removed OSDs. Or am I misunderstanding something? I think it would
> >>> help to know the whole story, can you provide more details? Also some
> >>> more general cluster info would be helpful:
> >>> $ ceph -s
> >>> $ ceph osd tree
> >>> $ ceph health detail
> >>>
> >>>
> >>> Zitat von Malte Stroem <malte.stroem@xxxxxxxxx>:
> >>>
> >>>> Hello Eugen,
> >>>>
> >>>> thank you. Yesterday I thought: Well, Eugen can help!
> >>>>
> >>>> Yes, we drained the nodes. It needed two weeks to finish the
> >>>> process, and yes, I think this is the root cause.
> >>>>
> >>>> So we still have the nodes but when I try to restart one of those
> >>>> OSDs it still cannot join:
> >>>>
> >>>> Jun 21 09:46:03 ceph-node bash[2323668]: Running command:
> >>>> /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-66/block
> >>>> Jun 21 09:46:03 ceph-node bash[2323668]: Running command:
> >>>> /usr/bin/chown -R ceph:ceph /dev/dm-19
> >>>> Jun 21 09:46:03 ceph-node bash[2323668]: Running command:
> >>>> /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-66
> >>>> Jun 21 09:46:03 ceph-node bash[2323668]: --> ceph-volume lvm
> >>>> activate successful for osd ID: 66 Jun 21 09:51:04 ceph-node
> >>>> bash[2323668]: debug 2023-06-21T07:51:04.176+0000 7fabef5a1200  0
> >>>> monclient(hunting): authenticate timed out after 300
> >>>> Jun 21 09:56:04 ceph-node bash[2323668]: debug
> >>>> 2023-06-21T07:56:04.179+0000 7fabef5a1200  0 monclient(hunting):
> >>>> authenticate timed out after 300
> >>>> Jun 21 10:01:04 ceph-node bash[2323668]: debug
> >>>> 2023-06-21T08:01:04.177+0000 7fabef5a1200  0 monclient(hunting):
> >>>> authenticate timed out after 300
> >>>> Jun 21 10:06:04 ceph-node bash[2323668]: debug
> >>>> 2023-06-21T08:06:04.179+0000 7fabef5a1200  0 monclient(hunting):
> >>>> authenticate timed out after 300
> >>>> Jun 21 10:11:04 ceph-node bash[2323668]: debug
> >>>> 2023-06-21T08:11:04.174+0000 7fabef5a1200  0 monclient(hunting):
> >>>> authenticate timed out after 300
> >>>>
> >>>> Same messages on all OSDs.
> >>>>
> >>>> We still have some nodes running and did not restart those OSDs.
> >>>>
> >>>> Best,
> >>>> Malte
> >>>>
> >>>> Am 21.06.23 um 09:50 schrieb Eugen Block:
> >>>>> Hi,
> >>>>> can you share more details what exactly you did? How did you remove
> >>>>> the nodes? Hopefully, you waited for the draining to finish? But if
> >>>>> the remaining OSDs wait for removed OSDs it sounds like the
> >>>>> draining was not finished.
> >>>>>
> >>>>> Zitat von Malte Stroem <malte.stroem@xxxxxxxxx>:
> >>>>>
> >>>>>> Hello,
> >>>>>>
> >>>>>> we removed some nodes from our cluster. This worked without
> problems.
> >>>>>>
> >>>>>> Now, lots of OSDs do not want to join the cluster anymore if we
> >>>>>> reboot one of the still available nodes.
> >>>>>>
> >>>>>> It always runs into timeouts:
> >>>>>>
> >>>>>> --> ceph-volume lvm activate successful for osd ID: XX
> >>>>>> monclient(hunting): authenticate timed out after 300
> >>>>>>
> >>>>>> MONs and MGRs are running fine.
> >>>>>>
> >>>>>> Network is working, netcat to the MONs' ports are open.
> >>>>>>
> >>>>>> Setting a higher debug level has no effect even if we add it to
> >>>>>> the ceph.conf file.
> >>>>>>
> >>>>>> The PGs are pretty unhappy, e. g.:
> >>>>>>
> >>>>>> 7.143      87771                   0         0          0        0
> >>>>>> 314744902235            0           0  10081     10081      down
> >>>>>> 2023-06-20T09:16:03.546158+0000    961275'1395646 961300:9605547
> >>>>>> [209,NONE,NONE]         209  [209,NONE,NONE] 209    961231'1395512
> >>>>>>  2023-06-19T23:46:40.101791+0000    961231'1395512
> >>>>>> 2023-06-19T23:46:40.101791+0000
> >>>>>>
> >>>>>> PG query wants us to set an OSD lost however I do not want to do
> >>>>>> this.
> >>>>>>
> >>>>>> OSDs are blocked by OSDs from the removed nodes:
> >>>>>>
> >>>>>> ceph osd blocked-by
> >>>>>> osd  num_blocked
> >>>>>> 152           38
> >>>>>> 244           41
> >>>>>> 144           54
> >>>>>> ...
> >>>>>>
> >>>>>> We added the removed hosts again and tried to start the OSDs on
> >>>>>> this node and they also failed into the timeout mentioned above.
> >>>>>>
> >>>>>> This is a containerized cluster running version 16.2.10.
> >>>>>>
> >>>>>> Replication is 3, some pools use an erasure coded profile.
> >>>>>>
> >>>>>> Best regards,
> >>>>>> Malte
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>>>>
> >>>>>
> >>>>> _______________________________________________
> >>>>> ceph-users mailing list -- ceph-users@xxxxxxx
> >>>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>>
> >>>
> >>>
> >
> >
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx