Pretty desperate here. Can someone suggest what I might be able to do to get these OSDs back up. It looks like my recovery had stalled. On Mon, Jul 25, 2022 at 7:26 AM Anthony D'Atri <anthony.datri@xxxxxxxxx> wrote: > Do your values for public and cluster network include the new addresses on > all nodes? > This cluster only has one network. There is no separation between public and cluster. Three of the nodes momentarily came up using a different IP address. I've also noticed on one of the nodes that did not move or have any IP issue, on a single node, from the dashboard, it names the same device for two different osd's: 2 cn01 out destroyed hdd TOSHIBA_MG04SCA40EE_21M0A0CKFWZB Unknown sdb osd.2 3 cn01 out destroyed ssd SAMSUNG_MZILT3T8HBLS/007_S5G0NE0R200159 Unknown sdb osd.3 [ceph: root@cn01 /]# ceph-volume inventory Device Path Size rotates available Model name /dev/sda 3.64 TB True False MG04SCA40EE /dev/sdb 3.49 TB False False MZILT3T8HBLS/007 /dev/sdc 3.64 TB True False MG04SCA40EE /dev/sdd 3.64 TB True False MG04SCA40EE /dev/sde 3.49 TB False False MZILT3T8HBLS/007 /dev/sdf 3.64 TB True False MG04SCA40EE /dev/sdg 698.64 GB True False SEAGATE ST375064 [ceph: root@cn01 /]# ceph osd info osd.0 down out weight 0 up_from 30231 up_thru 30564 down_at 30688 last_clean_interval [25500,30228) [v2: 192.168.30.15:6818/2512683421,v1:192.168.30.15:6819/2512683421] [v2: 192.168.30.15:6824/2512683421,v1:192.168.30.15:6826/2512683421] autoout,exists d14cf503-a303-4fa4-a713-9530b67d613a osd.1 down out weight 0 up_from 30393 up_thru 30688 down_at 30697 last_clean_interval [25518,30321) [v2: 192.168.30.16:6834/1781855831,v1:192.168.30.16:6835/1781855831] [v2: 192.168.30.16:6836/1781855831,v1:192.168.30.16:6837/1781855831] autoout,exists 0d521411-c835-4fa3-beca-3631b4ff6bf7 osd.2 down out weight 0 up_from 31316 up_thru 31293 down_at 31317 last_clean_interval [31218,31296) [v2: 192.168.30.11:6810/894589880,v1:192.168.30.11:6811/894589880] [v2: 192.168.30.11:6812/894589880,v1:192.168.30.11:6813/894589880] destroyed,exists osd.3 down out weight 0 up_from 31265 up_thru 31266 down_at 31268 last_clean_interval [31254,31256) [v2: 192.168.30.11:6818/1641948535,v1:192.168.30.11:6819/1641948535] [v2: 192.168.30.11:6820/1641948535,v1:192.168.30.11:6821/1641948535] destroyed,exists osd.4 up in weight 1 up_from 31356 up_thru 31581 down_at 31339 last_clean_interval [31320,31338) [v2: 192.168.30.11:6802/2785067179,v1:192.168.30.11:6803/2785067179] [v2: 192.168.30.11:6804/2785067179,v1:192.168.30.11:6805/2785067179] exists,up 3afd06db-b91d-44fe-9305-5eb95f7a59b9 osd.5 up in weight 1 up_from 31347 up_thru 31699 down_at 31339 last_clean_interval [31311,31338) [v2: 192.168.30.11:6818/1936771540,v1:192.168.30.11:6819/1936771540] [v2: 192.168.30.11:6820/1936771540,v1:192.168.30.11:6821/1936771540] exists,up 063c2ccf-02ce-4f5e-8252-dddfbb258a95 osd.6 up in weight 1 up_from 31218 up_thru 31711 down_at 31217 last_clean_interval [30978,31195) [v2: 192.168.30.12:6816/1585973160,v1:192.168.30.12:6817/1585973160] [v2: 192.168.30.12:6818/1585973160,v1:192.168.30.12:6819/1585973160] exists,up 94250ea2-f12e-4dc6-9135-b626086ccffd osd.7 down out weight 0 up_from 30353 up_thru 30558 down_at 30688 last_clean_interval [25533,30349) [v2: 192.168.30.14:6816/4083104061,v1:192.168.30.14:6817/4083104061] [v2: 192.168.30.14:6840/4094104061,v1:192.168.30.14:6841/4094104061] autoout,exists de351aec-b91e-4c22-a0bf-85369bc14579 osd.8 up in weight 1 up_from 31226 up_thru 31668 down_at 31225 last_clean_interval [30983,31195) [v2: 192.168.30.12:6824/1312484329,v1:192.168.30.12:6825/1312484329] [v2: 192.168.30.12:6826/1312484329,v1:192.168.30.12:6827/1312484329] exists,up 51f665b4-fa5b-4b17-8390-ed130145ef04 osd.9 up in weight 1 up_from 31351 up_thru 31673 down_at 31340 last_clean_interval [31315,31338) [v2: 192.168.30.11:6810/1446838877,v1:192.168.30.11:6811/1446838877] [v2: 192.168.30.11:6812/1446838877,v1:192.168.30.11:6813/1446838877] exists,up 985f1127-d126-4629-b8cd-03cf2d914d99 osd.10 up in weight 1 up_from 31219 up_thru 31639 down_at 31218 last_clean_interval [30980,31195) [v2: 192.168.30.12:6808/1587842953,v1:192.168.30.12:6809/1587842953] [v2: 192.168.30.12:6810/1587842953,v1:192.168.30.12:6811/1587842953] exists,up c7fca03e-4bd5-4485-a090-658ca967d5f6 osd.11 up in weight 1 up_from 31234 up_thru 31659 down_at 31223 last_clean_interval [30978,31195) [v2: 192.168.30.12:6840/3403200742,v1:192.168.30.12:6841/3403200742] [v2: 192.168.30.12:6842/3403200742,v1:192.168.30.12:6843/3403200742] exists,up 81074bd7-ad9f-4e56-8885-cca4745f6c95 osd.12 up in weight 1 up_from 31230 up_thru 31717 down_at 31223 last_clean_interval [30975,31195) [v2: 192.168.30.13:6816/4268732910,v1:192.168.30.13:6817/4268732910] [v2: 192.168.30.13:6818/4268732910,v1:192.168.30.13:6819/4268732910] exists,up af1b55dd-c110-4861-aed9-c0737cef8be1 osd.13 down out weight 0 up_from 30389 up_thru 30688 down_at 30695 last_clean_interval [25513,30317) [v2: 192.168.30.16:6804/1573803727,v1:192.168.30.16:6805/1573803727] [v2: 192.168.30.16:6806/1573803727,v1:192.168.30.16:6807/1573803727] autoout,exists 737a3234-0f1f-4286-80e9-e89b581aae41 osd.14 up in weight 1 up_from 31218 up_thru 31709 down_at 31217 last_clean_interval [30979,31195) [v2: 192.168.30.13:6834/2291187817,v1:192.168.30.13:6835/2291187817] [v2: 192.168.30.13:6836/2291187817,v1:192.168.30.13:6837/2291187817] exists,up 97cd6ac7-aca0-42fd-a049-d27289f83183 osd.15 down out weight 0 up_from 29463 up_thru 30531 down_at 30688 last_clean_interval [25493,29462) [v2: 192.168.30.15:6808/2655269991,v1:192.168.30.15:6809/2655269991] [v2: 192.168.30.15:6802/2662269991,v1:192.168.30.15:6803/2662269991] autoout,exists 61aea8f4-5905-4be3-ae32-5eacf75a514e osd.16 up in weight 1 up_from 31226 up_thru 31647 down_at 31223 last_clean_interval [30970,31195) [v2: 192.168.30.13:6808/2624812081,v1:192.168.30.13:6809/2624812081] [v2: 192.168.30.13:6810/2624812081,v1:192.168.30.13:6811/2624812081] exists,up 791a7542-87cd-403d-a37e-8f00506b2eb6 osd.17 up in weight 1 up_from 31219 up_thru 31703 down_at 31218 last_clean_interval [30975,31195) [v2: 192.168.30.13:6800/2978036397,v1:192.168.30.13:6801/2978036397] [v2: 192.168.30.13:6802/2978036397,v1:192.168.30.13:6803/2978036397] exists,up 4a915645-412f-49e6-8477-1577469905da osd.18 down out weight 0 up_from 30334 up_thru 30566 down_at 30688 last_clean_interval [25543,30327) [v2: 192.168.30.14:6832/985432137,v1:192.168.30.14:6833/985432137] [v2: 192.168.30.14:6848/998432137,v1:192.168.30.14:6849/998432137] autoout,exists 85f59d83-710c-4896-9200-bda4894fc3e8 osd.19 down in weight 1 up_from 31303 up_thru 31321 down_at 31323 last_clean_interval [31292,31296) [v2: 192.168.30.13:6826/375623427,v1:192.168.30.13:6827/375623427] [v2: 192.168.30.13:6828/375623427,v1:192.168.30.13:6829/375623427] exists 7d09b51a-bd6d-40f8-a009-78ab9937795d osd.20 down out weight 0 up_from 30445 up_thru 30531 down_at 30688 last_clean_interval [28947,30444) [v2: 192.168.30.14:6810/4062649162,v1:192.168.30.14:6811/4062649162] [v2: 192.168.30.14:6800/4073649162,v1:192.168.30.14:6801/4073649162] autoout,exists 7ef6cc1a-4755-4a14-b9df-f1f538d903ce osd.21 up in weight 1 up_from 31345 up_thru 31567 down_at 31341 last_clean_interval [31307,31340) [v2: 192.168.30.11:6826/1625231298,v1:192.168.30.11:6827/1625231298] [v2: 192.168.30.11:6828/1625231298,v1:192.168.30.11:6829/1625231298] exists,up 5ef2e39a-a353-4cb8-a49e-093fe39b94ef osd.22 down in weight 1 up_from 30383 up_thru 30528 down_at 30688 last_clean_interval [25549,30317) [v2: 192.168.30.14:6806/1204256629,v1:192.168.30.14:6807/1204256629] [v2: 192.168.30.14:6812/1204256629,v1:192.168.30.14:6813/1204256629] exists c9befe11-a035-449c-8d17-42aaf191923d osd.23 down in weight 1 up_from 30334 up_thru 30576 down_at 30688 last_clean_interval [30263,30332) [v2: 192.168.30.14:6802/3837786490,v1:192.168.30.14:6803/3837786490] [v2: 192.168.30.14:6830/3838786490,v1:192.168.30.14:6831/3838786490] exists 2081147b-065d-4da7-89d9-747e1ae02b8d osd.24 down in weight 1 up_from 29455 up_thru 30570 down_at 30688 last_clean_interval [25487,29453) [v2: 192.168.30.15:6800/2008474583,v1:192.168.30.15:6801/2008474583] [v2: 192.168.30.15:6810/2016474583,v1:192.168.30.15:6811/2016474583] exists 39d78380-261c-4689-b53d-90713e6ffcca osd.26 up in weight 1 up_from 31208 up_thru 31643 down_at 31207 last_clean_interval [30967,31195) [v2: 192.168.30.12:6800/2861018947,v1:192.168.30.12:6801/2861018947] [v2: 192.168.30.12:6802/2861018947,v1:192.168.30.12:6803/2861018947] exists,up 046622c8-c09c-4254-8c15-3dc05a2f01ed osd.28 down in weight 1 up_from 30389 up_thru 30574 down_at 30691 last_clean_interval [25513,30312) [v2: 192.168.30.16:6820/3466284570,v1:192.168.30.16:6821/3466284570] [v2: 192.168.30.16:6822/3466284570,v1:192.168.30.16:6823/3466284570] exists 10578b97-e3c4-4553-a8d0-6dcc46af5db1 osd.29 down in weight 1 up_from 30378 up_thru 30554 down_at 30688 last_clean_interval [28595,30376) [v2: 192.168.30.14:6808/3739543672,v1:192.168.30.14:6809/3739543672] [v2: 192.168.30.14:6846/3747543672,v1:192.168.30.14:6847/3747543672] exists 9698e936-8edf-4adf-92c9-a0b5202ed01a osd.30 down in weight 1 up_from 30449 up_thru 30531 down_at 30688 last_clean_interval [25502,30446) [v2: 192.168.30.15:6825/2375507296,v1:192.168.30.15:6827/2375507296] [v2: 192.168.30.15:6829/2375507296,v1:192.168.30.15:6831/2375507296] exists e14d2a0f-a98a-44d4-8c06-4d893f673629 osd.31 down in weight 1 up_from 30364 up_thru 30688 down_at 30700 last_clean_interval [25514,30361) [v2: 192.168.30.16:6826/2835000708,v1:192.168.30.16:6827/2835000708] [v2: 192.168.30.16:6802/2843000708,v1:192.168.30.16:6803/2843000708] exists 541bca38-e704-483a-8cb8-39e5f69007d1 osd.32 up in weight 1 up_from 31209 up_thru 31627 down_at 31208 last_clean_interval [30974,31195) [v2: 192.168.30.12:6832/3860067997,v1:192.168.30.12:6833/3860067997] [v2: 192.168.30.12:6834/3860067997,v1:192.168.30.12:6835/3860067997] exists,up 9200a57e-2845-43ff-9787-8f1f3158fe90 osd.33 down in weight 1 up_from 30354 up_thru 30688 down_at 30693 last_clean_interval [25521,30350) [v2: 192.168.30.16:6842/2342555666,v1:192.168.30.16:6843/2342555666] [v2: 192.168.30.16:6844/2364555666,v1:192.168.30.16:6845/2364555666] exists 20c55d85-cf9a-4133-a189-7fdad2318f58 osd.34 down in weight 1 up_from 30390 up_thru 30688 down_at 30691 last_clean_interval [25516,30314) [v2: 192.168.30.16:6808/2282629870,v1:192.168.30.16:6811/2282629870] [v2: 192.168.30.16:6812/2282629870,v1:192.168.30.16:6814/2282629870] exists 77e0ef8f-c047-4f84-afb2-a8ad054e562f osd.35 up in weight 1 up_from 31204 up_thru 31657 down_at 31203 last_clean_interval [30958,31195) [v2: 192.168.30.13:6842/1919357520,v1:192.168.30.13:6843/1919357520] [v2: 192.168.30.13:6844/1919357520,v1:192.168.30.13:6845/1919357520] exists,up 2d2de0cb-6d41-4957-a473-2bbe9ce227bf osd.36 down in weight 1 up_from 29494 up_thru 30560 down_at 30688 last_clean_interval [25491,29492) [v2: 192.168.30.15:6816/2153321591,v1:192.168.30.15:6817/2153321591] [v2: 192.168.30.15:6842/2158321591,v1:192.168.30.15:6843/2158321591] exists 26114668-68b2-458b-89c2-cbad5507ab75 > > > On Jul 25, 2022, at 3:29 AM, Jeremy Hansen < > farnsworth.mcfadden@xxxxxxxxx> wrote: > > > > I transitioned some servers to a new rack and now I'm having major issues > > with Ceph upon bringing things back up. > > > > I believe the issue may be related to the ceph nodes coming back up with > > different IPs before VLANs were set. That's just a guess because I can't > > think of any other reason this would happen. > > > > Current state: > > > > Every 2.0s: ceph -s > > cn01.ceph.la1.clx.corp: Mon Jul 25 10:13:05 2022 > > > > cluster: > > id: bfa2ad58-c049-11eb-9098-3c8cf8ed728d > > health: HEALTH_WARN > > 1 filesystem is degraded > > 2 MDSs report slow metadata IOs > > 2/5 mons down, quorum cn02,cn03,cn01 > > 9 osds down > > 3 hosts (17 osds) down > > Reduced data availability: 97 pgs inactive, 9 pgs down > > Degraded data redundancy: 13860144/30824413 objects degraded > > (44.965%), 411 pgs degraded, 482 pgs undersized > > > > services: > > mon: 5 daemons, quorum cn02,cn03,cn01 (age 62m), out of quorum: cn05, > > cn04 > > mgr: cn02.arszct(active, since 5m) > > mds: 2/2 daemons up, 2 standby > > osd: 35 osds: 15 up (since 62m), 24 in (since 58m); 222 remapped pgs > > > > data: > > volumes: 1/2 healthy, 1 recovering > > pools: 8 pools, 545 pgs > > objects: 7.71M objects, 6.7 TiB > > usage: 15 TiB used, 39 TiB / 54 TiB avail > > pgs: 0.367% pgs unknown > > 17.431% pgs not active > > 13860144/30824413 objects degraded (44.965%) > > 1137693/30824413 objects misplaced (3.691%) > > 280 active+undersized+degraded > > 67 undersized+degraded+remapped+backfilling+peered > > 57 active+undersized+remapped > > 45 active+clean+remapped > > 44 active+undersized+degraded+remapped+backfilling > > 18 undersized+degraded+peered > > 10 active+undersized > > 9 down > > 7 active+clean > > 3 active+undersized+remapped+backfilling > > 2 active+undersized+degraded+remapped+backfill_wait > > 2 unknown > > 1 undersized+peered > > > > io: > > client: 170 B/s rd, 0 op/s rd, 0 op/s wr > > recovery: 168 MiB/s, 158 keys/s, 166 objects/s > > > > I have to disable and re-enable the dashboard just to use it. It seems > to > > get bogged down after a few moments. > > > > The three servers that were moved to the new rack Ceph has marked as > > "Down", but if I do a cephadm host-check, they all seem to pass: > > > > ************************ ceph ************************ > > --------- cn01.ceph.--------- > > podman (/usr/bin/podman) version 4.0.2 is present > > systemctl is present > > lvcreate is present > > Unit chronyd.service is enabled and running > > Host looks OK > > --------- cn02.ceph.--------- > > podman (/usr/bin/podman) version 4.0.2 is present > > systemctl is present > > lvcreate is present > > Unit chronyd.service is enabled and running > > Host looks OK > > --------- cn03.ceph.--------- > > podman (/usr/bin/podman) version 4.0.2 is present > > systemctl is present > > lvcreate is present > > Unit chronyd.service is enabled and running > > Host looks OK > > --------- cn04.ceph.--------- > > podman (/usr/bin/podman) version 4.0.2 is present > > systemctl is present > > lvcreate is present > > Unit chronyd.service is enabled and running > > Host looks OK > > --------- cn05.ceph.--------- > > podman|docker (/usr/bin/podman) is present > > systemctl is present > > lvcreate is present > > Unit chronyd.service is enabled and running > > Host looks OK > > --------- cn06.ceph.--------- > > podman (/usr/bin/podman) version 4.0.2 is present > > systemctl is present > > lvcreate is present > > Unit chronyd.service is enabled and running > > Host looks OK > > > > It seems to be recovering with what it has left, but a large amount of > OSDs > > are down. When trying to restart one of the down'd OSDs, I see a huge > dump. > > > > Jul 25 03:19:38 cn06.ceph > > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug > > 2022-07-25T10:19:38.532+0000 7fce14a6c080 0 osd.34 30689 done with init, > > starting boot process > > Jul 25 03:19:38 cn06.ceph > > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug > > 2022-07-25T10:19:38.532+0000 7fce14a6c080 1 osd.34 30689 start_boot > > Jul 25 03:20:10 cn06.ceph > > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug > > 2022-07-25T10:20:10.655+0000 7fcdfd12d700 1 osd.34 30689 start_boot > > Jul 25 03:20:41 cn06.ceph > > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug > > 2022-07-25T10:20:41.159+0000 7fcdfd12d700 1 osd.34 30689 start_boot > > Jul 25 03:21:11 cn06.ceph > > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug > > 2022-07-25T10:21:11.662+0000 7fcdfd12d700 1 osd.34 30689 start_boot > > > > At this point it just keeps printing start_boot, but the dashboard has it > > marked as "in" but "down". > > > > On these three hosts that moved, there were a bunch marked as "out" and > > "down", and some with "in" but "down". > > > > Not sure where to go next. I'm going to let the recovery continue and > hope > > that my 4x replication on these pools saves me. > > > > Not sure where to go from here. Any help is very much appreciated. This > > Ceph cluster holds all of our Cloudstack images... it would be terrible > to > > lose this data. > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > On Mon, Jul 25, 2022 at 10:15 AM Jeremy Hansen < farnsworth.mcfadden@xxxxxxxxx> wrote: > > > On Mon, Jul 25, 2022 at 7:26 AM Anthony D'Atri <anthony.datri@xxxxxxxxx> > wrote: > >> Do your values for public and cluster network include the new addresses >> on all nodes? >> > > This cluster only has one network. There is no separation between > public and cluster. Three of the nodes momentarily came up using a > different IP address. > > I've also noticed on one of the nodes that did not move or have any IP > issue, on a single node, from the dashboard, it names the same device for > two different osd's: > > 2 cn01 out destroyed hdd TOSHIBA_MG04SCA40EE_21M0A0CKFWZB Unknown sdb osd.2 > > 3 cn01 out destroyed ssd SAMSUNG_MZILT3T8HBLS/007_S5G0NE0R200159 Unknown > sdb osd.3 > > > [ceph: root@cn01 /]# ceph-volume inventory > > Device Path Size rotates available Model name > /dev/sda 3.64 TB True False MG04SCA40EE > /dev/sdb 3.49 TB False False MZILT3T8HBLS/007 > /dev/sdc 3.64 TB True False MG04SCA40EE > /dev/sdd 3.64 TB True False MG04SCA40EE > /dev/sde 3.49 TB False False MZILT3T8HBLS/007 > /dev/sdf 3.64 TB True False MG04SCA40EE > /dev/sdg 698.64 GB True False SEAGATE ST375064 > > [ceph: root@cn01 /]# ceph osd info > osd.0 down out weight 0 up_from 30231 up_thru 30564 down_at 30688 > last_clean_interval [25500,30228) [v2: > 192.168.30.15:6818/2512683421,v1:192.168.30.15:6819/2512683421] [v2: > 192.168.30.15:6824/2512683421,v1:192.168.30.15:6826/2512683421] > autoout,exists d14cf503-a303-4fa4-a713-9530b67d613a > osd.1 down out weight 0 up_from 30393 up_thru 30688 down_at 30697 > last_clean_interval [25518,30321) [v2: > 192.168.30.16:6834/1781855831,v1:192.168.30.16:6835/1781855831] [v2: > 192.168.30.16:6836/1781855831,v1:192.168.30.16:6837/1781855831] > autoout,exists 0d521411-c835-4fa3-beca-3631b4ff6bf7 > osd.2 down out weight 0 up_from 31316 up_thru 31293 down_at 31317 > last_clean_interval [31218,31296) [v2: > 192.168.30.11:6810/894589880,v1:192.168.30.11:6811/894589880] [v2: > 192.168.30.11:6812/894589880,v1:192.168.30.11:6813/894589880] > destroyed,exists > osd.3 down out weight 0 up_from 31265 up_thru 31266 down_at 31268 > last_clean_interval [31254,31256) [v2: > 192.168.30.11:6818/1641948535,v1:192.168.30.11:6819/1641948535] [v2: > 192.168.30.11:6820/1641948535,v1:192.168.30.11:6821/1641948535] > destroyed,exists > osd.4 up in weight 1 up_from 31356 up_thru 31581 down_at 31339 > last_clean_interval [31320,31338) [v2: > 192.168.30.11:6802/2785067179,v1:192.168.30.11:6803/2785067179] [v2: > 192.168.30.11:6804/2785067179,v1:192.168.30.11:6805/2785067179] exists,up > 3afd06db-b91d-44fe-9305-5eb95f7a59b9 > osd.5 up in weight 1 up_from 31347 up_thru 31699 down_at 31339 > last_clean_interval [31311,31338) [v2: > 192.168.30.11:6818/1936771540,v1:192.168.30.11:6819/1936771540] [v2: > 192.168.30.11:6820/1936771540,v1:192.168.30.11:6821/1936771540] exists,up > 063c2ccf-02ce-4f5e-8252-dddfbb258a95 > osd.6 up in weight 1 up_from 31218 up_thru 31711 down_at 31217 > last_clean_interval [30978,31195) [v2: > 192.168.30.12:6816/1585973160,v1:192.168.30.12:6817/1585973160] [v2: > 192.168.30.12:6818/1585973160,v1:192.168.30.12:6819/1585973160] exists,up > 94250ea2-f12e-4dc6-9135-b626086ccffd > osd.7 down out weight 0 up_from 30353 up_thru 30558 down_at 30688 > last_clean_interval [25533,30349) [v2: > 192.168.30.14:6816/4083104061,v1:192.168.30.14:6817/4083104061] [v2: > 192.168.30.14:6840/4094104061,v1:192.168.30.14:6841/4094104061] > autoout,exists de351aec-b91e-4c22-a0bf-85369bc14579 > osd.8 up in weight 1 up_from 31226 up_thru 31668 down_at 31225 > last_clean_interval [30983,31195) [v2: > 192.168.30.12:6824/1312484329,v1:192.168.30.12:6825/1312484329] [v2: > 192.168.30.12:6826/1312484329,v1:192.168.30.12:6827/1312484329] exists,up > 51f665b4-fa5b-4b17-8390-ed130145ef04 > osd.9 up in weight 1 up_from 31351 up_thru 31673 down_at 31340 > last_clean_interval [31315,31338) [v2: > 192.168.30.11:6810/1446838877,v1:192.168.30.11:6811/1446838877] [v2: > 192.168.30.11:6812/1446838877,v1:192.168.30.11:6813/1446838877] exists,up > 985f1127-d126-4629-b8cd-03cf2d914d99 > osd.10 up in weight 1 up_from 31219 up_thru 31639 down_at 31218 > last_clean_interval [30980,31195) [v2: > 192.168.30.12:6808/1587842953,v1:192.168.30.12:6809/1587842953] [v2: > 192.168.30.12:6810/1587842953,v1:192.168.30.12:6811/1587842953] exists,up > c7fca03e-4bd5-4485-a090-658ca967d5f6 > osd.11 up in weight 1 up_from 31234 up_thru 31659 down_at 31223 > last_clean_interval [30978,31195) [v2: > 192.168.30.12:6840/3403200742,v1:192.168.30.12:6841/3403200742] [v2: > 192.168.30.12:6842/3403200742,v1:192.168.30.12:6843/3403200742] exists,up > 81074bd7-ad9f-4e56-8885-cca4745f6c95 > osd.12 up in weight 1 up_from 31230 up_thru 31717 down_at 31223 > last_clean_interval [30975,31195) [v2: > 192.168.30.13:6816/4268732910,v1:192.168.30.13:6817/4268732910] [v2: > 192.168.30.13:6818/4268732910,v1:192.168.30.13:6819/4268732910] exists,up > af1b55dd-c110-4861-aed9-c0737cef8be1 > osd.13 down out weight 0 up_from 30389 up_thru 30688 down_at 30695 > last_clean_interval [25513,30317) [v2: > 192.168.30.16:6804/1573803727,v1:192.168.30.16:6805/1573803727] [v2: > 192.168.30.16:6806/1573803727,v1:192.168.30.16:6807/1573803727] > autoout,exists 737a3234-0f1f-4286-80e9-e89b581aae41 > osd.14 up in weight 1 up_from 31218 up_thru 31709 down_at 31217 > last_clean_interval [30979,31195) [v2: > 192.168.30.13:6834/2291187817,v1:192.168.30.13:6835/2291187817] [v2: > 192.168.30.13:6836/2291187817,v1:192.168.30.13:6837/2291187817] exists,up > 97cd6ac7-aca0-42fd-a049-d27289f83183 > osd.15 down out weight 0 up_from 29463 up_thru 30531 down_at 30688 > last_clean_interval [25493,29462) [v2: > 192.168.30.15:6808/2655269991,v1:192.168.30.15:6809/2655269991] [v2: > 192.168.30.15:6802/2662269991,v1:192.168.30.15:6803/2662269991] > autoout,exists 61aea8f4-5905-4be3-ae32-5eacf75a514e > osd.16 up in weight 1 up_from 31226 up_thru 31647 down_at 31223 > last_clean_interval [30970,31195) [v2: > 192.168.30.13:6808/2624812081,v1:192.168.30.13:6809/2624812081] [v2: > 192.168.30.13:6810/2624812081,v1:192.168.30.13:6811/2624812081] exists,up > 791a7542-87cd-403d-a37e-8f00506b2eb6 > osd.17 up in weight 1 up_from 31219 up_thru 31703 down_at 31218 > last_clean_interval [30975,31195) [v2: > 192.168.30.13:6800/2978036397,v1:192.168.30.13:6801/2978036397] [v2: > 192.168.30.13:6802/2978036397,v1:192.168.30.13:6803/2978036397] exists,up > 4a915645-412f-49e6-8477-1577469905da > osd.18 down out weight 0 up_from 30334 up_thru 30566 down_at 30688 > last_clean_interval [25543,30327) [v2: > 192.168.30.14:6832/985432137,v1:192.168.30.14:6833/985432137] [v2: > 192.168.30.14:6848/998432137,v1:192.168.30.14:6849/998432137] > autoout,exists 85f59d83-710c-4896-9200-bda4894fc3e8 > osd.19 down in weight 1 up_from 31303 up_thru 31321 down_at 31323 > last_clean_interval [31292,31296) [v2: > 192.168.30.13:6826/375623427,v1:192.168.30.13:6827/375623427] [v2: > 192.168.30.13:6828/375623427,v1:192.168.30.13:6829/375623427] exists > 7d09b51a-bd6d-40f8-a009-78ab9937795d > osd.20 down out weight 0 up_from 30445 up_thru 30531 down_at 30688 > last_clean_interval [28947,30444) [v2: > 192.168.30.14:6810/4062649162,v1:192.168.30.14:6811/4062649162] [v2: > 192.168.30.14:6800/4073649162,v1:192.168.30.14:6801/4073649162] > autoout,exists 7ef6cc1a-4755-4a14-b9df-f1f538d903ce > osd.21 up in weight 1 up_from 31345 up_thru 31567 down_at 31341 > last_clean_interval [31307,31340) [v2: > 192.168.30.11:6826/1625231298,v1:192.168.30.11:6827/1625231298] [v2: > 192.168.30.11:6828/1625231298,v1:192.168.30.11:6829/1625231298] exists,up > 5ef2e39a-a353-4cb8-a49e-093fe39b94ef > osd.22 down in weight 1 up_from 30383 up_thru 30528 down_at 30688 > last_clean_interval [25549,30317) [v2: > 192.168.30.14:6806/1204256629,v1:192.168.30.14:6807/1204256629] [v2: > 192.168.30.14:6812/1204256629,v1:192.168.30.14:6813/1204256629] exists > c9befe11-a035-449c-8d17-42aaf191923d > osd.23 down in weight 1 up_from 30334 up_thru 30576 down_at 30688 > last_clean_interval [30263,30332) [v2: > 192.168.30.14:6802/3837786490,v1:192.168.30.14:6803/3837786490] [v2: > 192.168.30.14:6830/3838786490,v1:192.168.30.14:6831/3838786490] exists > 2081147b-065d-4da7-89d9-747e1ae02b8d > osd.24 down in weight 1 up_from 29455 up_thru 30570 down_at 30688 > last_clean_interval [25487,29453) [v2: > 192.168.30.15:6800/2008474583,v1:192.168.30.15:6801/2008474583] [v2: > 192.168.30.15:6810/2016474583,v1:192.168.30.15:6811/2016474583] exists > 39d78380-261c-4689-b53d-90713e6ffcca > osd.26 up in weight 1 up_from 31208 up_thru 31643 down_at 31207 > last_clean_interval [30967,31195) [v2: > 192.168.30.12:6800/2861018947,v1:192.168.30.12:6801/2861018947] [v2: > 192.168.30.12:6802/2861018947,v1:192.168.30.12:6803/2861018947] exists,up > 046622c8-c09c-4254-8c15-3dc05a2f01ed > osd.28 down in weight 1 up_from 30389 up_thru 30574 down_at 30691 > last_clean_interval [25513,30312) [v2: > 192.168.30.16:6820/3466284570,v1:192.168.30.16:6821/3466284570] [v2: > 192.168.30.16:6822/3466284570,v1:192.168.30.16:6823/3466284570] exists > 10578b97-e3c4-4553-a8d0-6dcc46af5db1 > osd.29 down in weight 1 up_from 30378 up_thru 30554 down_at 30688 > last_clean_interval [28595,30376) [v2: > 192.168.30.14:6808/3739543672,v1:192.168.30.14:6809/3739543672] [v2: > 192.168.30.14:6846/3747543672,v1:192.168.30.14:6847/3747543672] exists > 9698e936-8edf-4adf-92c9-a0b5202ed01a > osd.30 down in weight 1 up_from 30449 up_thru 30531 down_at 30688 > last_clean_interval [25502,30446) [v2: > 192.168.30.15:6825/2375507296,v1:192.168.30.15:6827/2375507296] [v2: > 192.168.30.15:6829/2375507296,v1:192.168.30.15:6831/2375507296] exists > e14d2a0f-a98a-44d4-8c06-4d893f673629 > osd.31 down in weight 1 up_from 30364 up_thru 30688 down_at 30700 > last_clean_interval [25514,30361) [v2: > 192.168.30.16:6826/2835000708,v1:192.168.30.16:6827/2835000708] [v2: > 192.168.30.16:6802/2843000708,v1:192.168.30.16:6803/2843000708] exists > 541bca38-e704-483a-8cb8-39e5f69007d1 > osd.32 up in weight 1 up_from 31209 up_thru 31627 down_at 31208 > last_clean_interval [30974,31195) [v2: > 192.168.30.12:6832/3860067997,v1:192.168.30.12:6833/3860067997] [v2: > 192.168.30.12:6834/3860067997,v1:192.168.30.12:6835/3860067997] exists,up > 9200a57e-2845-43ff-9787-8f1f3158fe90 > osd.33 down in weight 1 up_from 30354 up_thru 30688 down_at 30693 > last_clean_interval [25521,30350) [v2: > 192.168.30.16:6842/2342555666,v1:192.168.30.16:6843/2342555666] [v2: > 192.168.30.16:6844/2364555666,v1:192.168.30.16:6845/2364555666] exists > 20c55d85-cf9a-4133-a189-7fdad2318f58 > osd.34 down in weight 1 up_from 30390 up_thru 30688 down_at 30691 > last_clean_interval [25516,30314) [v2: > 192.168.30.16:6808/2282629870,v1:192.168.30.16:6811/2282629870] [v2: > 192.168.30.16:6812/2282629870,v1:192.168.30.16:6814/2282629870] exists > 77e0ef8f-c047-4f84-afb2-a8ad054e562f > osd.35 up in weight 1 up_from 31204 up_thru 31657 down_at 31203 > last_clean_interval [30958,31195) [v2: > 192.168.30.13:6842/1919357520,v1:192.168.30.13:6843/1919357520] [v2: > 192.168.30.13:6844/1919357520,v1:192.168.30.13:6845/1919357520] exists,up > 2d2de0cb-6d41-4957-a473-2bbe9ce227bf > osd.36 down in weight 1 up_from 29494 up_thru 30560 down_at 30688 > last_clean_interval [25491,29492) [v2: > 192.168.30.15:6816/2153321591,v1:192.168.30.15:6817/2153321591] [v2: > 192.168.30.15:6842/2158321591,v1:192.168.30.15:6843/2158321591] exists > 26114668-68b2-458b-89c2-cbad5507ab75 > > > >> >> > On Jul 25, 2022, at 3:29 AM, Jeremy Hansen < >> farnsworth.mcfadden@xxxxxxxxx> wrote: >> > >> > I transitioned some servers to a new rack and now I'm having major >> issues >> > with Ceph upon bringing things back up. >> > >> > I believe the issue may be related to the ceph nodes coming back up with >> > different IPs before VLANs were set. That's just a guess because I >> can't >> > think of any other reason this would happen. >> > >> > Current state: >> > >> > Every 2.0s: ceph -s >> > cn01.ceph.la1.clx.corp: Mon Jul 25 10:13:05 2022 >> > >> > cluster: >> > id: bfa2ad58-c049-11eb-9098-3c8cf8ed728d >> > health: HEALTH_WARN >> > 1 filesystem is degraded >> > 2 MDSs report slow metadata IOs >> > 2/5 mons down, quorum cn02,cn03,cn01 >> > 9 osds down >> > 3 hosts (17 osds) down >> > Reduced data availability: 97 pgs inactive, 9 pgs down >> > Degraded data redundancy: 13860144/30824413 objects degraded >> > (44.965%), 411 pgs degraded, 482 pgs undersized >> > >> > services: >> > mon: 5 daemons, quorum cn02,cn03,cn01 (age 62m), out of quorum: cn05, >> > cn04 >> > mgr: cn02.arszct(active, since 5m) >> > mds: 2/2 daemons up, 2 standby >> > osd: 35 osds: 15 up (since 62m), 24 in (since 58m); 222 remapped pgs >> > >> > data: >> > volumes: 1/2 healthy, 1 recovering >> > pools: 8 pools, 545 pgs >> > objects: 7.71M objects, 6.7 TiB >> > usage: 15 TiB used, 39 TiB / 54 TiB avail >> > pgs: 0.367% pgs unknown >> > 17.431% pgs not active >> > 13860144/30824413 objects degraded (44.965%) >> > 1137693/30824413 objects misplaced (3.691%) >> > 280 active+undersized+degraded >> > 67 undersized+degraded+remapped+backfilling+peered >> > 57 active+undersized+remapped >> > 45 active+clean+remapped >> > 44 active+undersized+degraded+remapped+backfilling >> > 18 undersized+degraded+peered >> > 10 active+undersized >> > 9 down >> > 7 active+clean >> > 3 active+undersized+remapped+backfilling >> > 2 active+undersized+degraded+remapped+backfill_wait >> > 2 unknown >> > 1 undersized+peered >> > >> > io: >> > client: 170 B/s rd, 0 op/s rd, 0 op/s wr >> > recovery: 168 MiB/s, 158 keys/s, 166 objects/s >> > >> > I have to disable and re-enable the dashboard just to use it. It seems >> to >> > get bogged down after a few moments. >> > >> > The three servers that were moved to the new rack Ceph has marked as >> > "Down", but if I do a cephadm host-check, they all seem to pass: >> > >> > ************************ ceph ************************ >> > --------- cn01.ceph.--------- >> > podman (/usr/bin/podman) version 4.0.2 is present >> > systemctl is present >> > lvcreate is present >> > Unit chronyd.service is enabled and running >> > Host looks OK >> > --------- cn02.ceph.--------- >> > podman (/usr/bin/podman) version 4.0.2 is present >> > systemctl is present >> > lvcreate is present >> > Unit chronyd.service is enabled and running >> > Host looks OK >> > --------- cn03.ceph.--------- >> > podman (/usr/bin/podman) version 4.0.2 is present >> > systemctl is present >> > lvcreate is present >> > Unit chronyd.service is enabled and running >> > Host looks OK >> > --------- cn04.ceph.--------- >> > podman (/usr/bin/podman) version 4.0.2 is present >> > systemctl is present >> > lvcreate is present >> > Unit chronyd.service is enabled and running >> > Host looks OK >> > --------- cn05.ceph.--------- >> > podman|docker (/usr/bin/podman) is present >> > systemctl is present >> > lvcreate is present >> > Unit chronyd.service is enabled and running >> > Host looks OK >> > --------- cn06.ceph.--------- >> > podman (/usr/bin/podman) version 4.0.2 is present >> > systemctl is present >> > lvcreate is present >> > Unit chronyd.service is enabled and running >> > Host looks OK >> > >> > It seems to be recovering with what it has left, but a large amount of >> OSDs >> > are down. When trying to restart one of the down'd OSDs, I see a huge >> dump. >> > >> > Jul 25 03:19:38 cn06.ceph >> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug >> > 2022-07-25T10:19:38.532+0000 7fce14a6c080 0 osd.34 30689 done with >> init, >> > starting boot process >> > Jul 25 03:19:38 cn06.ceph >> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug >> > 2022-07-25T10:19:38.532+0000 7fce14a6c080 1 osd.34 30689 start_boot >> > Jul 25 03:20:10 cn06.ceph >> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug >> > 2022-07-25T10:20:10.655+0000 7fcdfd12d700 1 osd.34 30689 start_boot >> > Jul 25 03:20:41 cn06.ceph >> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug >> > 2022-07-25T10:20:41.159+0000 7fcdfd12d700 1 osd.34 30689 start_boot >> > Jul 25 03:21:11 cn06.ceph >> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug >> > 2022-07-25T10:21:11.662+0000 7fcdfd12d700 1 osd.34 30689 start_boot >> > >> > At this point it just keeps printing start_boot, but the dashboard has >> it >> > marked as "in" but "down". >> > >> > On these three hosts that moved, there were a bunch marked as "out" and >> > "down", and some with "in" but "down". >> > >> > Not sure where to go next. I'm going to let the recovery continue and >> hope >> > that my 4x replication on these pools saves me. >> > >> > Not sure where to go from here. Any help is very much appreciated. >> This >> > Ceph cluster holds all of our Cloudstack images... it would be >> terrible to >> > lose this data. >> > _______________________________________________ >> > ceph-users mailing list -- ceph-users@xxxxxxx >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx >> >> _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx