Re: OSDs cannot join cluster anymore

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello Eugen,

thanks.

We found the cause.

Somehow all

/var/lib/ceph/fsid/osd.XX/config

files on every host were still filled with expired information about the mons.

So refreshing the files helped to bring the osds up again. Damn.

All other configs for the mons, mds', rgws and so on were up to date.

I do not know why the osd config files did not get refreshed however I guess something went wrong draining the nodes we removed from the cluster.

Best regards,
Malte

Am 21.06.23 um 22:11 schrieb Eugen Block:
I still can’t really grasp what might have happened here. But could you please clarify which of the down OSDs (or Hosts) are supposed to be down and which you’re trying to bring back online? Obviously osd.40 is one of your attempts. But what about the hosts cephx01 and cephx08? Are those the ones refusing to start their OSDs? And the remaining up OSDs you haven’t touched yet, correct? And regarding debug logs, you should set it with ceph config set because the local ceph.conf won’t have an effect. It could help to have the startup debug logs from one of the OSDs.

Zitat von Malte Stroem <malte.stroem@xxxxxxxxx>:

Hello Eugen,

recovery and rebalancing was finished however now all PGs show missing OSDs.

Everything looks like the PGs are missing OSDs although it finished correctly.

As if we shut down the servers immediately.

But we removed the nodes the way it is described in the documentation.

We just added new disks and they join the cluster immediately.

So the old OSDs removed from the cluster are available, I restored OSD.40 but it does not want to join the cluster.

Following are the outputs of the mentioned commands:

ceph -s

  cluster:
    id:     X
    health: HEALTH_WARN
            1 failed cephadm daemon(s)
            1 filesystem is degraded
            1 MDSs report slow metadata IOs
            19 osds down
            4 hosts (50 osds) down
            Reduced data availability: 1220 pgs inactive
            Degraded data redundancy: 132 pgs undersized

  services:
    mon: 3 daemons, quorum cephx02,cephx04,cephx06 (age 4m)
    mgr: cephx02.xxxxxx(active, since 92s), standbys: cephx04.yyyyyy, cephx06.zzzzzz     mds: 2/2 daemons up, 2 standby
    osd: 130 osds: 78 up (since 13m), 97 in (since 35m); 171 remapped pgs
    rgw: 1 daemon active (1 hosts, 1 zones)

  data:
    volumes: 1/2 healthy, 1 recovering
    pools:   12 pools, 1345 pgs
    objects: 11.02k objects, 1.9 GiB
    usage:   145 TiB used, 669 TiB / 814 TiB avail
    pgs:     86.617% pgs unknown
             4.089% pgs not active
             39053/33069 objects misplaced (118.095%)
             1165 unknown
             77   active+undersized+remapped
             55   undersized+remapped+peered
             38   active+clean+remapped
             10   active+clean

ceph osd tree

ID   CLASS  WEIGHT      TYPE NAME                STATUS  REWEIGHT PRI-AFF
-21            4.36646  root ssds
-61            0.87329      host cephx01-ssd
186    ssd     0.87329          osd.186            down   1.00000 1.00000
-76            0.87329      host cephx02-ssd
263    ssd     0.87329          osd.263              up   1.00000 1.00000
-85            0.87329      host cephx04-ssd
237    ssd     0.87329          osd.237              up   1.00000 1.00000
-88            0.87329      host cephx06-ssd
236    ssd     0.87329          osd.236              up   1.00000 1.00000
-94            0.87329      host cephx08-ssd
262    ssd     0.87329          osd.262            down   1.00000 1.00000
 -1         1347.07397  root default
-62          261.93823      host cephx01
139    hdd    10.91409          osd.139            down         0 1.00000 140    hdd    10.91409          osd.140            down         0 1.00000 142    hdd    10.91409          osd.142            down         0 1.00000 144    hdd    10.91409          osd.144            down         0 1.00000 146    hdd    10.91409          osd.146            down         0 1.00000 148    hdd    10.91409          osd.148            down         0 1.00000 150    hdd    10.91409          osd.150            down         0 1.00000 152    hdd    10.91409          osd.152            down         0 1.00000 154    hdd    10.91409          osd.154            down   1.00000 1.00000 156    hdd    10.91409          osd.156            down   1.00000 1.00000 158    hdd    10.91409          osd.158            down   1.00000 1.00000 160    hdd    10.91409          osd.160            down   1.00000 1.00000 162    hdd    10.91409          osd.162            down   1.00000 1.00000 164    hdd    10.91409          osd.164            down   1.00000 1.00000 166    hdd    10.91409          osd.166            down   1.00000 1.00000 168    hdd    10.91409          osd.168            down   1.00000 1.00000 170    hdd    10.91409          osd.170            down   1.00000 1.00000 172    hdd    10.91409          osd.172            down   1.00000 1.00000 174    hdd    10.91409          osd.174            down   1.00000 1.00000 176    hdd    10.91409          osd.176            down   1.00000 1.00000 178    hdd    10.91409          osd.178            down   1.00000 1.00000 180    hdd    10.91409          osd.180            down   1.00000 1.00000 182    hdd    10.91409          osd.182            down   1.00000 1.00000 184    hdd    10.91409          osd.184            down   1.00000 1.00000
-67          261.93823      host cephx02
138    hdd    10.91409          osd.138              up   1.00000 1.00000 141    hdd    10.91409          osd.141              up   1.00000 1.00000 143    hdd    10.91409          osd.143              up   1.00000 1.00000 145    hdd    10.91409          osd.145              up   1.00000 1.00000 147    hdd    10.91409          osd.147              up   1.00000 1.00000 149    hdd    10.91409          osd.149              up   1.00000 1.00000 151    hdd    10.91409          osd.151              up   1.00000 1.00000 153    hdd    10.91409          osd.153              up   1.00000 1.00000 155    hdd    10.91409          osd.155              up   1.00000 1.00000 157    hdd    10.91409          osd.157              up   1.00000 1.00000 159    hdd    10.91409          osd.159              up   1.00000 1.00000 161    hdd    10.91409          osd.161              up   1.00000 1.00000 163    hdd    10.91409          osd.163              up   1.00000 1.00000 165    hdd    10.91409          osd.165              up   1.00000 1.00000 167    hdd    10.91409          osd.167              up   1.00000 1.00000 169    hdd    10.91409          osd.169              up   1.00000 1.00000 171    hdd    10.91409          osd.171              up   1.00000 1.00000 173    hdd    10.91409          osd.173              up   1.00000 1.00000 175    hdd    10.91409          osd.175              up   1.00000 1.00000 177    hdd    10.91409          osd.177              up   1.00000 1.00000 179    hdd    10.91409          osd.179              up   1.00000 1.00000 181    hdd    10.91409          osd.181              up   1.00000 1.00000 183    hdd    10.91409          osd.183              up   1.00000 1.00000 185    hdd    10.91409          osd.185              up   1.00000 1.00000
-82          261.93823      host cephx04
189    hdd    10.91409          osd.189              up   1.00000 1.00000 191    hdd    10.91409          osd.191              up   1.00000 1.00000 193    hdd    10.91409          osd.193              up   1.00000 1.00000 195    hdd    10.91409          osd.195              up   1.00000 1.00000 197    hdd    10.91409          osd.197              up   1.00000 1.00000 199    hdd    10.91409          osd.199              up   1.00000 1.00000 201    hdd    10.91409          osd.201              up   1.00000 1.00000 202    hdd    10.91409          osd.202              up   1.00000 1.00000 204    hdd    10.91409          osd.204              up   1.00000 1.00000 206    hdd    10.91409          osd.206              up   1.00000 1.00000 208    hdd    10.91409          osd.208              up   1.00000 1.00000 210    hdd    10.91409          osd.210              up   1.00000 1.00000 212    hdd    10.91409          osd.212              up   1.00000 1.00000 214    hdd    10.91409          osd.214              up   1.00000 1.00000 217    hdd    10.91409          osd.217              up   1.00000 1.00000 219    hdd    10.91409          osd.219              up   1.00000 1.00000 221    hdd    10.91409          osd.221              up   1.00000 1.00000 223    hdd    10.91409          osd.223              up   1.00000 1.00000 225    hdd    10.91409          osd.225              up   1.00000 1.00000 227    hdd    10.91409          osd.227              up   1.00000 1.00000 229    hdd    10.91409          osd.229              up   1.00000 1.00000 231    hdd    10.91409          osd.231              up   1.00000 1.00000 233    hdd    10.91409          osd.233              up   1.00000 1.00000 235    hdd    10.91409          osd.235              up   1.00000 1.00000
-79          261.93823      host cephx06
188    hdd    10.91409          osd.188              up   1.00000 1.00000 190    hdd    10.91409          osd.190              up   1.00000 1.00000 192    hdd    10.91409          osd.192              up   1.00000 1.00000 194    hdd    10.91409          osd.194              up   1.00000 1.00000 196    hdd    10.91409          osd.196              up   1.00000 1.00000 198    hdd    10.91409          osd.198              up   1.00000 1.00000 200    hdd    10.91409          osd.200              up   1.00000 1.00000 203    hdd    10.91409          osd.203              up   1.00000 1.00000 205    hdd    10.91409          osd.205              up   1.00000 1.00000 207    hdd    10.91409          osd.207              up   1.00000 1.00000 209    hdd    10.91409          osd.209              up   1.00000 1.00000 211    hdd    10.91409          osd.211              up   1.00000 1.00000 213    hdd    10.91409          osd.213              up   1.00000 1.00000 215    hdd    10.91409          osd.215              up   1.00000 1.00000 216    hdd    10.91409          osd.216              up   1.00000 1.00000 218    hdd    10.91409          osd.218              up   1.00000 1.00000 220    hdd    10.91409          osd.220              up   1.00000 1.00000 222    hdd    10.91409          osd.222              up   1.00000 1.00000 224    hdd    10.91409          osd.224              up   1.00000 1.00000 226    hdd    10.91409          osd.226              up   1.00000 1.00000 228    hdd    10.91409          osd.228              up   1.00000 1.00000 230    hdd    10.91409          osd.230              up   1.00000 1.00000 232    hdd    10.91409          osd.232              up   1.00000 1.00000 234    hdd    10.91409          osd.234            down   1.00000 1.00000
-91          261.93823      host cephx08
238    hdd    10.91409          osd.238            down         0 1.00000 239    hdd    10.91409          osd.239            down         0 1.00000 240    hdd    10.91409          osd.240            down         0 1.00000 241    hdd    10.91409          osd.241            down         0 1.00000 242    hdd    10.91409          osd.242            down         0 1.00000 243    hdd    10.91409          osd.243            down         0 1.00000 244    hdd    10.91409          osd.244            down         0 1.00000 245    hdd    10.91409          osd.245            down         0 1.00000 246    hdd    10.91409          osd.246            down         0 1.00000 247    hdd    10.91409          osd.247            down         0 1.00000 248    hdd    10.91409          osd.248            down         0 1.00000 249    hdd    10.91409          osd.249            down         0 1.00000 250    hdd    10.91409          osd.250            down         0 1.00000 251    hdd    10.91409          osd.251            down         0 1.00000 252    hdd    10.91409          osd.252            down         0 1.00000 253    hdd    10.91409          osd.253            down         0 1.00000 254    hdd    10.91409          osd.254            down         0 1.00000 255    hdd    10.91409          osd.255            down         0 1.00000 256    hdd    10.91409          osd.256            down         0 1.00000 257    hdd    10.91409          osd.257            down         0 1.00000 258    hdd    10.91409          osd.258            down         0 1.00000 259    hdd    10.91409          osd.259            down         0 1.00000 260    hdd    10.91409          osd.260            down         0 1.00000 261    hdd    10.91409          osd.261            down         0 1.00000
 -3           37.38275      host ceph06
 40            1.00000          osd.40             down         0 1.00000   0    hdd     9.09569          osd.0                up   1.00000 1.00000   1    hdd     9.09569          osd.1                up   1.00000 1.00000   2    hdd     9.09569          osd.2                up   1.00000 1.00000   3    hdd     9.09569          osd.3                up   1.00000 1.00000

ceph health detail

HEALTH_WARN 1 failed cephadm daemon(s); 1 filesystem is degraded; 1 MDSs report slow metadata IOs; 19 osds down; 4 hosts (50 osds) down; Reduced data availability: 1220 pgs inactive; Degraded data redundancy: 132 pgs undersized
[WRN] CEPHADM_FAILED_DAEMON: 1 failed cephadm daemon(s)
    daemon rgw.cephx06.xxxxxx on cephx06 is in error state
[WRN] FS_DEGRADED: 1 filesystem is degraded
    fs cephfs is degraded
[WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
    mds.cephfs.cephx01.yyyyyy(mds.0): 2 slow metadata IOs are blocked > 30 secs, oldest blocked for 1664 secs
[WRN] OSD_DOWN: 19 osds down
    osd.154 (root=default,host=cephx01) is down
    osd.156 (root=default,host=cephx01) is down
    osd.158 (root=default,host=cephx01) is down
    osd.160 (root=default,host=cephx01) is down
    osd.162 (root=default,host=cephx01) is down
    osd.164 (root=default,host=cephx01) is down
    osd.166 (root=default,host=cephx01) is down
    osd.168 (root=default,host=cephx01) is down
    osd.170 (root=default,host=cephx01) is down
    osd.172 (root=default,host=cephx01) is down
    osd.174 (root=default,host=cephx01) is down
    osd.176 (root=default,host=cephx01) is down
    osd.178 (root=default,host=cephx01) is down     osd.180 (root=default,host=cephx01) is down
    osd.182 (root=default,host=cephx01) is down
    osd.184 (root=default,host=cephx01) is down
    osd.186 (root=ssds,host=cephx01-ssd) is down
    osd.234 (root=default,host=cephx06) is down
    osd.262 (root=ssds,host=cephx08-ssd) is down
[WRN] OSD_HOST_DOWN: 4 hosts (50 osds) down
    host cephx01-ssd (root=ssds) (1 osds) is down
    host cephx01 (root=default) (24 osds) is down
    host cephx08 (root=default) (24 osds) is down
    host cephx08-ssd (root=ssds) (1 osds) is down
[WRN] PG_AVAILABILITY: Reduced data availability: 1220 pgs inactive
    pg 7.3cd is stuck inactive for 13h, current state unknown, last acting []     pg 7.3ce is stuck inactive for 13h, current state unknown, last acting []     pg 7.3cf is stuck inactive for 13h, current state unknown, last acting []     pg 7.3d0 is stuck inactive for 13h, current state unknown, last acting []     pg 7.3d1 is stuck inactive for 13h, current state unknown, last acting []     pg 7.3d2 is stuck inactive for 13h, current state unknown, last acting []     pg 7.3d3 is stuck inactive for 13h, current state unknown, last acting []     pg 7.3d4 is stuck inactive for 13h, current state unknown, last acting []     pg 7.3d5 is stuck inactive for 13h, current state unknown, last acting []     pg 7.3d6 is stuck inactive for 13h, current state unknown, last acting []     pg 7.3d7 is stuck inactive for 13h, current state unknown, last acting []     pg 7.3d8 is stuck inactive for 13h, current state unknown, last acting []     pg 7.3d9 is stuck inactive for 13h, current state unknown, last acting []     pg 7.3da is stuck inactive for 13h, current state unknown, last acting []     pg 7.3db is stuck inactive for 13h, current state unknown, last acting []     pg 7.3dc is stuck inactive for 13h, current state unknown, last acting []     pg 7.3dd is stuck inactive for 13h, current state unknown, last acting []     pg 7.3de is stuck inactive for 13h, current state unknown, last acting []     pg 7.3df is stuck inactive for 13h, current state unknown, last acting []     pg 7.3e0 is stuck inactive for 13h, current state unknown, last acting []     pg 7.3e1 is stuck inactive for 13h, current state unknown, last acting []     pg 7.3e2 is stuck inactive for 13h, current state unknown, last acting []     pg 7.3e3 is stuck inactive for 13h, current state unknown, last acting []     pg 7.3e4 is stuck inactive for 13h, current state unknown, last acting []     pg 7.3e5 is stuck inactive for 13h, current state unknown, last acting []     pg 7.3e6 is stuck inactive for 13h, current state unknown, last acting []     pg 7.3e7 is stuck inactive for 13h, current state unknown, last acting []     pg 7.3e8 is stuck inactive for 13h, current state unknown, last acting []     pg 7.3e9 is stuck inactive for 13h, current state unknown, last acting []     pg 7.3ea is stuck inactive for 13h, current state unknown, last acting []     pg 7.3eb is stuck inactive for 13h, current state unknown, last acting []     pg 7.3ec is stuck inactive for 13h, current state unknown, last acting []     pg 7.3ed is stuck inactive for 13h, current state unknown, last acting []     pg 7.3ee is stuck inactive for 13h, current state unknown, last acting []     pg 7.3ef is stuck inactive for 13h, current state unknown, last acting []     pg 7.3f0 is stuck inactive for 13h, current state unknown, last acting []     pg 7.3f1 is stuck inactive for 13h, current state unknown, last acting []     pg 7.3f2 is stuck inactive for 13h, current state unknown, last acting []     pg 7.3f3 is stuck inactive for 13h, current state unknown, last acting []     pg 7.3f4 is stuck inactive for 13h, current state unknown, last acting []     pg 7.3f5 is stuck inactive for 13h, current state unknown, last acting []     pg 7.3f6 is stuck inactive for 13h, current state unknown, last acting []     pg 7.3f7 is stuck inactive for 13h, current state unknown, last acting []     pg 7.3f8 is stuck inactive for 13h, current state unknown, last acting []     pg 7.3f9 is stuck inactive for 13h, current state unknown, last acting []     pg 7.3fa is stuck inactive for 13h, current state unknown, last acting []     pg 7.3fb is stuck inactive for 13h, current state unknown, last acting []     pg 7.3fc is stuck inactive for 13h, current state unknown, last acting []     pg 7.3fd is stuck inactive for 13h, current state unknown, last acting []     pg 7.3fe is stuck inactive for 13h, current state unknown, last acting []     pg 7.3ff is stuck inactive for 13h, current state unknown, last acting []
[WRN] PG_DEGRADED: Degraded data redundancy: 132 pgs undersized
    pg 1.0 is stuck undersized for 13h, current state active+undersized+remapped, last acting [263,236]     pg 1.1 is stuck undersized for 13h, current state active+undersized+remapped, last acting [263,236]     pg 1.7 is stuck undersized for 13h, current state active+undersized+remapped, last acting [263,237]     pg 1.b is stuck undersized for 13h, current state active+undersized+remapped, last acting [236,263]     pg 1.e is stuck undersized for 13h, current state active+undersized+remapped, last acting [263,236]     pg 2.3 is stuck undersized for 13h, current state undersized+remapped+peered, last acting [237]     pg 2.8 is stuck undersized for 13h, current state active+undersized+remapped, last acting [263,236]     pg 2.9 is stuck undersized for 13h, current state undersized+remapped+peered, last acting [236]     pg 2.a is stuck undersized for 13h, current state active+undersized+remapped, last acting [236,263]     pg 2.b is stuck undersized for 13h, current state active+undersized+remapped, last acting [263,237]     pg 2.d is stuck undersized for 13h, current state undersized+remapped+peered, last acting [237]     pg 2.e is stuck undersized for 13h, current state active+undersized+remapped, last acting [263,236]     pg 3.2 is stuck undersized for 13h, current state active+undersized+remapped, last acting [236,263]     pg 3.4 is stuck undersized for 13h, current state active+undersized+remapped, last acting [237,263]     pg 3.8 is stuck undersized for 13h, current state undersized+remapped+peered, last acting [236]     pg 3.a is stuck undersized for 13h, current state active+undersized+remapped, last acting [236,237]     pg 3.c is stuck undersized for 13h, current state active+undersized+remapped, last acting [263,236]     pg 3.d is stuck undersized for 13h, current state active+undersized+remapped, last acting [263,237]     pg 3.f is stuck undersized for 13h, current state undersized+remapped+peered, last acting [236]     pg 4.3 is stuck undersized for 13h, current state active+undersized+remapped, last acting [263,237]     pg 4.5 is stuck undersized for 13h, current state undersized+remapped+peered, last acting [237]     pg 4.b is stuck undersized for 13h, current state active+undersized+remapped, last acting [237,236]     pg 4.d is stuck undersized for 13h, current state undersized+remapped+peered, last acting [236]     pg 4.e is stuck undersized for 13h, current state undersized+remapped+peered, last acting [236]     pg 5.2 is stuck undersized for 13h, current state undersized+remapped+peered, last acting [237]     pg 5.3 is stuck undersized for 13h, current state undersized+remapped+peered, last acting [236]     pg 5.5 is stuck undersized for 13h, current state undersized+remapped+peered, last acting [237]     pg 5.8 is stuck undersized for 13h, current state undersized+remapped+peered, last acting [237]     pg 5.a is stuck undersized for 13h, current state undersized+remapped+peered, last acting [237]     pg 5.b is stuck undersized for 13h, current state active+undersized+remapped, last acting [236,263]     pg 5.c is stuck undersized for 13h, current state undersized+remapped+peered, last acting [237]     pg 5.e is stuck undersized for 13h, current state undersized+remapped+peered, last acting [237]     pg 5.f is stuck undersized for 13h, current state undersized+remapped+peered, last acting [237]     pg 9.0 is stuck undersized for 13h, current state undersized+remapped+peered, last acting [236]     pg 9.1 is stuck undersized for 13h, current state active+undersized+remapped, last acting [236,237]     pg 9.2 is stuck undersized for 13h, current state undersized+remapped+peered, last acting [237]     pg 9.3 is stuck undersized for 13h, current state undersized+remapped+peered, last acting [237]     pg 9.5 is stuck undersized for 13h, current state active+undersized+remapped, last acting [263,236]     pg 9.6 is stuck undersized for 13h, current state undersized+remapped+peered, last acting [236]     pg 9.7 is stuck undersized for 13h, current state undersized+remapped+peered, last acting [236]     pg 9.9 is stuck undersized for 13h, current state active+undersized+remapped, last acting [263,237]     pg 10.0 is stuck undersized for 13h, current state active+undersized+remapped, last acting [237,236]     pg 10.1 is stuck undersized for 13h, current state undersized+remapped+peered, last acting [237]     pg 10.2 is stuck undersized for 13h, current state undersized+remapped+peered, last acting [237]     pg 10.3 is stuck undersized for 13h, current state undersized+remapped+peered, last acting [236]     pg 10.4 is stuck undersized for 13h, current state undersized+remapped+peered, last acting [237]     pg 10.5 is stuck undersized for 13h, current state undersized+remapped+peered, last acting [236]     pg 10.6 is stuck undersized for 13h, current state active+undersized+remapped, last acting [263,236]     pg 10.7 is stuck undersized for 13h, current state undersized+remapped+peered, last acting [236]     pg 10.a is stuck undersized for 13h, current state undersized+remapped+peered, last acting [236]     pg 10.c is stuck undersized for 13h, current state active+undersized+remapped, last acting [237,236]

Best,
Malte

Am 21.06.23 um 10:31 schrieb Eugen Block:
Hi,

Yes, we drained the nodes. It needed two weeks to finish the process, and yes, I think this is the root cause. So we still have the nodes but when I try to restart one of those OSDs it still cannot join:

if the nodes were drained successfully (can you confirm that all PGs were active+clean after draining before you removed the nodes?) then the disks on the removed nodes wouldn't have any data to bring back. The question would be, why do the remaining OSDs still reference removed OSDs. Or am I misunderstanding something? I think it would help to know the whole story, can you provide more details? Also some more general cluster info would be helpful:
$ ceph -s
$ ceph osd tree
$ ceph health detail


Zitat von Malte Stroem <malte.stroem@xxxxxxxxx>:

Hello Eugen,

thank you. Yesterday I thought: Well, Eugen can help!

Yes, we drained the nodes. It needed two weeks to finish the process, and yes, I think this is the root cause.

So we still have the nodes but when I try to restart one of those OSDs it still cannot join:

Jun 21 09:46:03 ceph-node bash[2323668]: Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-66/block Jun 21 09:46:03 ceph-node bash[2323668]: Running command: /usr/bin/chown -R ceph:ceph /dev/dm-19 Jun 21 09:46:03 ceph-node bash[2323668]: Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-66 Jun 21 09:46:03 ceph-node bash[2323668]: --> ceph-volume lvm activate successful for osd ID: 66 Jun 21 09:51:04 ceph-node bash[2323668]: debug 2023-06-21T07:51:04.176+0000 7fabef5a1200  0 monclient(hunting): authenticate timed out after 300 Jun 21 09:56:04 ceph-node bash[2323668]: debug 2023-06-21T07:56:04.179+0000 7fabef5a1200  0 monclient(hunting): authenticate timed out after 300 Jun 21 10:01:04 ceph-node bash[2323668]: debug 2023-06-21T08:01:04.177+0000 7fabef5a1200  0 monclient(hunting): authenticate timed out after 300 Jun 21 10:06:04 ceph-node bash[2323668]: debug 2023-06-21T08:06:04.179+0000 7fabef5a1200  0 monclient(hunting): authenticate timed out after 300 Jun 21 10:11:04 ceph-node bash[2323668]: debug 2023-06-21T08:11:04.174+0000 7fabef5a1200  0 monclient(hunting): authenticate timed out after 300

Same messages on all OSDs.

We still have some nodes running and did not restart those OSDs.

Best,
Malte

Am 21.06.23 um 09:50 schrieb Eugen Block:
Hi,
can you share more details what exactly you did? How did you remove the nodes? Hopefully, you waited for the draining to finish? But if the remaining OSDs wait for removed OSDs it sounds like the draining was not finished.

Zitat von Malte Stroem <malte.stroem@xxxxxxxxx>:

Hello,

we removed some nodes from our cluster. This worked without problems.

Now, lots of OSDs do not want to join the cluster anymore if we reboot one of the still available nodes.

It always runs into timeouts:

--> ceph-volume lvm activate successful for osd ID: XX
monclient(hunting): authenticate timed out after 300

MONs and MGRs are running fine.

Network is working, netcat to the MONs' ports are open.

Setting a higher debug level has no effect even if we add it to the ceph.conf file.

The PGs are pretty unhappy, e. g.:

7.143      87771                   0         0          0        0 314744902235            0           0  10081     10081      down 2023-06-20T09:16:03.546158+0000    961275'1395646 961300:9605547 [209,NONE,NONE]         209  [209,NONE,NONE] 209    961231'1395512  2023-06-19T23:46:40.101791+0000    961231'1395512 2023-06-19T23:46:40.101791+0000

PG query wants us to set an OSD lost however I do not want to do this.

OSDs are blocked by OSDs from the removed nodes:

ceph osd blocked-by
osd  num_blocked
152           38
244           41
144           54
...

We added the removed hosts again and tried to start the OSDs on this node and they also failed into the timeout mentioned above.

This is a containerized cluster running version 16.2.10.

Replication is 3, some pools use an erasure coded profile.

Best regards,
Malte


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx






_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux