Re: Issues after a shutdown

Jeremy Hansen <farnsworth.mcfadden@xxxxxxxxx> · Mon, 25 Jul 2022 12:43:32 -0700

Pretty desperate here.  Can someone suggest what I might be able to do to
get these OSDs back up.  It looks like my recovery had stalled.

On Mon, Jul 25, 2022 at 7:26 AM Anthony D'Atri <anthony.datri@xxxxxxxxx>
wrote:

> Do your values for public and cluster network include the new addresses on
> all nodes?
>

This cluster only has one network.  There is no separation between
public and cluster.  Three of the nodes momentarily came up using a
different IP address.

I've also noticed on one of the nodes that did not move or have any IP
issue, on a single node, from the dashboard, it names the same device for
two different osd's:

2 cn01 out destroyed hdd TOSHIBA_MG04SCA40EE_21M0A0CKFWZB Unknown sdb osd.2

3 cn01 out destroyed ssd SAMSUNG_MZILT3T8HBLS/007_S5G0NE0R200159 Unknown
sdb osd.3

[ceph: root@cn01 /]# ceph-volume inventory

Device Path               Size         rotates available Model name
/dev/sda                  3.64 TB      True    False     MG04SCA40EE
/dev/sdb                  3.49 TB      False   False     MZILT3T8HBLS/007
/dev/sdc                  3.64 TB      True    False     MG04SCA40EE
/dev/sdd                  3.64 TB      True    False     MG04SCA40EE
/dev/sde                  3.49 TB      False   False     MZILT3T8HBLS/007
/dev/sdf                  3.64 TB      True    False     MG04SCA40EE
/dev/sdg                  698.64 GB    True    False     SEAGATE ST375064

[ceph: root@cn01 /]# ceph osd info
osd.0 down out weight 0 up_from 30231 up_thru 30564 down_at 30688
last_clean_interval [25500,30228) [v2:
192.168.30.15:6818/2512683421,v1:192.168.30.15:6819/2512683421] [v2:
192.168.30.15:6824/2512683421,v1:192.168.30.15:6826/2512683421]
autoout,exists d14cf503-a303-4fa4-a713-9530b67d613a
osd.1 down out weight 0 up_from 30393 up_thru 30688 down_at 30697
last_clean_interval [25518,30321) [v2:
192.168.30.16:6834/1781855831,v1:192.168.30.16:6835/1781855831] [v2:
192.168.30.16:6836/1781855831,v1:192.168.30.16:6837/1781855831]
autoout,exists 0d521411-c835-4fa3-beca-3631b4ff6bf7
osd.2 down out weight 0 up_from 31316 up_thru 31293 down_at 31317
last_clean_interval [31218,31296) [v2:
192.168.30.11:6810/894589880,v1:192.168.30.11:6811/894589880] [v2:
192.168.30.11:6812/894589880,v1:192.168.30.11:6813/894589880]
destroyed,exists
osd.3 down out weight 0 up_from 31265 up_thru 31266 down_at 31268
last_clean_interval [31254,31256) [v2:
192.168.30.11:6818/1641948535,v1:192.168.30.11:6819/1641948535] [v2:
192.168.30.11:6820/1641948535,v1:192.168.30.11:6821/1641948535]
destroyed,exists
osd.4 up   in  weight 1 up_from 31356 up_thru 31581 down_at 31339
last_clean_interval [31320,31338) [v2:
192.168.30.11:6802/2785067179,v1:192.168.30.11:6803/2785067179] [v2:
192.168.30.11:6804/2785067179,v1:192.168.30.11:6805/2785067179] exists,up
3afd06db-b91d-44fe-9305-5eb95f7a59b9
osd.5 up   in  weight 1 up_from 31347 up_thru 31699 down_at 31339
last_clean_interval [31311,31338) [v2:
192.168.30.11:6818/1936771540,v1:192.168.30.11:6819/1936771540] [v2:
192.168.30.11:6820/1936771540,v1:192.168.30.11:6821/1936771540] exists,up
063c2ccf-02ce-4f5e-8252-dddfbb258a95
osd.6 up   in  weight 1 up_from 31218 up_thru 31711 down_at 31217
last_clean_interval [30978,31195) [v2:
192.168.30.12:6816/1585973160,v1:192.168.30.12:6817/1585973160] [v2:
192.168.30.12:6818/1585973160,v1:192.168.30.12:6819/1585973160] exists,up
94250ea2-f12e-4dc6-9135-b626086ccffd
osd.7 down out weight 0 up_from 30353 up_thru 30558 down_at 30688
last_clean_interval [25533,30349) [v2:
192.168.30.14:6816/4083104061,v1:192.168.30.14:6817/4083104061] [v2:
192.168.30.14:6840/4094104061,v1:192.168.30.14:6841/4094104061]
autoout,exists de351aec-b91e-4c22-a0bf-85369bc14579
osd.8 up   in  weight 1 up_from 31226 up_thru 31668 down_at 31225
last_clean_interval [30983,31195) [v2:
192.168.30.12:6824/1312484329,v1:192.168.30.12:6825/1312484329] [v2:
192.168.30.12:6826/1312484329,v1:192.168.30.12:6827/1312484329] exists,up
51f665b4-fa5b-4b17-8390-ed130145ef04
osd.9 up   in  weight 1 up_from 31351 up_thru 31673 down_at 31340
last_clean_interval [31315,31338) [v2:
192.168.30.11:6810/1446838877,v1:192.168.30.11:6811/1446838877] [v2:
192.168.30.11:6812/1446838877,v1:192.168.30.11:6813/1446838877] exists,up
985f1127-d126-4629-b8cd-03cf2d914d99
osd.10 up   in  weight 1 up_from 31219 up_thru 31639 down_at 31218
last_clean_interval [30980,31195) [v2:
192.168.30.12:6808/1587842953,v1:192.168.30.12:6809/1587842953] [v2:
192.168.30.12:6810/1587842953,v1:192.168.30.12:6811/1587842953] exists,up
c7fca03e-4bd5-4485-a090-658ca967d5f6
osd.11 up   in  weight 1 up_from 31234 up_thru 31659 down_at 31223
last_clean_interval [30978,31195) [v2:
192.168.30.12:6840/3403200742,v1:192.168.30.12:6841/3403200742] [v2:
192.168.30.12:6842/3403200742,v1:192.168.30.12:6843/3403200742] exists,up
81074bd7-ad9f-4e56-8885-cca4745f6c95
osd.12 up   in  weight 1 up_from 31230 up_thru 31717 down_at 31223
last_clean_interval [30975,31195) [v2:
192.168.30.13:6816/4268732910,v1:192.168.30.13:6817/4268732910] [v2:
192.168.30.13:6818/4268732910,v1:192.168.30.13:6819/4268732910] exists,up
af1b55dd-c110-4861-aed9-c0737cef8be1
osd.13 down out weight 0 up_from 30389 up_thru 30688 down_at 30695
last_clean_interval [25513,30317) [v2:
192.168.30.16:6804/1573803727,v1:192.168.30.16:6805/1573803727] [v2:
192.168.30.16:6806/1573803727,v1:192.168.30.16:6807/1573803727]
autoout,exists 737a3234-0f1f-4286-80e9-e89b581aae41
osd.14 up   in  weight 1 up_from 31218 up_thru 31709 down_at 31217
last_clean_interval [30979,31195) [v2:
192.168.30.13:6834/2291187817,v1:192.168.30.13:6835/2291187817] [v2:
192.168.30.13:6836/2291187817,v1:192.168.30.13:6837/2291187817] exists,up
97cd6ac7-aca0-42fd-a049-d27289f83183
osd.15 down out weight 0 up_from 29463 up_thru 30531 down_at 30688
last_clean_interval [25493,29462) [v2:
192.168.30.15:6808/2655269991,v1:192.168.30.15:6809/2655269991] [v2:
192.168.30.15:6802/2662269991,v1:192.168.30.15:6803/2662269991]
autoout,exists 61aea8f4-5905-4be3-ae32-5eacf75a514e
osd.16 up   in  weight 1 up_from 31226 up_thru 31647 down_at 31223
last_clean_interval [30970,31195) [v2:
192.168.30.13:6808/2624812081,v1:192.168.30.13:6809/2624812081] [v2:
192.168.30.13:6810/2624812081,v1:192.168.30.13:6811/2624812081] exists,up
791a7542-87cd-403d-a37e-8f00506b2eb6
osd.17 up   in  weight 1 up_from 31219 up_thru 31703 down_at 31218
last_clean_interval [30975,31195) [v2:
192.168.30.13:6800/2978036397,v1:192.168.30.13:6801/2978036397] [v2:
192.168.30.13:6802/2978036397,v1:192.168.30.13:6803/2978036397] exists,up
4a915645-412f-49e6-8477-1577469905da
osd.18 down out weight 0 up_from 30334 up_thru 30566 down_at 30688
last_clean_interval [25543,30327) [v2:
192.168.30.14:6832/985432137,v1:192.168.30.14:6833/985432137] [v2:
192.168.30.14:6848/998432137,v1:192.168.30.14:6849/998432137]
autoout,exists 85f59d83-710c-4896-9200-bda4894fc3e8
osd.19 down in  weight 1 up_from 31303 up_thru 31321 down_at 31323
last_clean_interval [31292,31296) [v2:
192.168.30.13:6826/375623427,v1:192.168.30.13:6827/375623427] [v2:
192.168.30.13:6828/375623427,v1:192.168.30.13:6829/375623427] exists
7d09b51a-bd6d-40f8-a009-78ab9937795d
osd.20 down out weight 0 up_from 30445 up_thru 30531 down_at 30688
last_clean_interval [28947,30444) [v2:
192.168.30.14:6810/4062649162,v1:192.168.30.14:6811/4062649162] [v2:
192.168.30.14:6800/4073649162,v1:192.168.30.14:6801/4073649162]
autoout,exists 7ef6cc1a-4755-4a14-b9df-f1f538d903ce
osd.21 up   in  weight 1 up_from 31345 up_thru 31567 down_at 31341
last_clean_interval [31307,31340) [v2:
192.168.30.11:6826/1625231298,v1:192.168.30.11:6827/1625231298] [v2:
192.168.30.11:6828/1625231298,v1:192.168.30.11:6829/1625231298] exists,up
5ef2e39a-a353-4cb8-a49e-093fe39b94ef
osd.22 down in  weight 1 up_from 30383 up_thru 30528 down_at 30688
last_clean_interval [25549,30317) [v2:
192.168.30.14:6806/1204256629,v1:192.168.30.14:6807/1204256629] [v2:
192.168.30.14:6812/1204256629,v1:192.168.30.14:6813/1204256629] exists
c9befe11-a035-449c-8d17-42aaf191923d
osd.23 down in  weight 1 up_from 30334 up_thru 30576 down_at 30688
last_clean_interval [30263,30332) [v2:
192.168.30.14:6802/3837786490,v1:192.168.30.14:6803/3837786490] [v2:
192.168.30.14:6830/3838786490,v1:192.168.30.14:6831/3838786490] exists
2081147b-065d-4da7-89d9-747e1ae02b8d
osd.24 down in  weight 1 up_from 29455 up_thru 30570 down_at 30688
last_clean_interval [25487,29453) [v2:
192.168.30.15:6800/2008474583,v1:192.168.30.15:6801/2008474583] [v2:
192.168.30.15:6810/2016474583,v1:192.168.30.15:6811/2016474583] exists
39d78380-261c-4689-b53d-90713e6ffcca
osd.26 up   in  weight 1 up_from 31208 up_thru 31643 down_at 31207
last_clean_interval [30967,31195) [v2:
192.168.30.12:6800/2861018947,v1:192.168.30.12:6801/2861018947] [v2:
192.168.30.12:6802/2861018947,v1:192.168.30.12:6803/2861018947] exists,up
046622c8-c09c-4254-8c15-3dc05a2f01ed
osd.28 down in  weight 1 up_from 30389 up_thru 30574 down_at 30691
last_clean_interval [25513,30312) [v2:
192.168.30.16:6820/3466284570,v1:192.168.30.16:6821/3466284570] [v2:
192.168.30.16:6822/3466284570,v1:192.168.30.16:6823/3466284570] exists
10578b97-e3c4-4553-a8d0-6dcc46af5db1
osd.29 down in  weight 1 up_from 30378 up_thru 30554 down_at 30688
last_clean_interval [28595,30376) [v2:
192.168.30.14:6808/3739543672,v1:192.168.30.14:6809/3739543672] [v2:
192.168.30.14:6846/3747543672,v1:192.168.30.14:6847/3747543672] exists
9698e936-8edf-4adf-92c9-a0b5202ed01a
osd.30 down in  weight 1 up_from 30449 up_thru 30531 down_at 30688
last_clean_interval [25502,30446) [v2:
192.168.30.15:6825/2375507296,v1:192.168.30.15:6827/2375507296] [v2:
192.168.30.15:6829/2375507296,v1:192.168.30.15:6831/2375507296] exists
e14d2a0f-a98a-44d4-8c06-4d893f673629
osd.31 down in  weight 1 up_from 30364 up_thru 30688 down_at 30700
last_clean_interval [25514,30361) [v2:
192.168.30.16:6826/2835000708,v1:192.168.30.16:6827/2835000708] [v2:
192.168.30.16:6802/2843000708,v1:192.168.30.16:6803/2843000708] exists
541bca38-e704-483a-8cb8-39e5f69007d1
osd.32 up   in  weight 1 up_from 31209 up_thru 31627 down_at 31208
last_clean_interval [30974,31195) [v2:
192.168.30.12:6832/3860067997,v1:192.168.30.12:6833/3860067997] [v2:
192.168.30.12:6834/3860067997,v1:192.168.30.12:6835/3860067997] exists,up
9200a57e-2845-43ff-9787-8f1f3158fe90
osd.33 down in  weight 1 up_from 30354 up_thru 30688 down_at 30693
last_clean_interval [25521,30350) [v2:
192.168.30.16:6842/2342555666,v1:192.168.30.16:6843/2342555666] [v2:
192.168.30.16:6844/2364555666,v1:192.168.30.16:6845/2364555666] exists
20c55d85-cf9a-4133-a189-7fdad2318f58
osd.34 down in  weight 1 up_from 30390 up_thru 30688 down_at 30691
last_clean_interval [25516,30314) [v2:
192.168.30.16:6808/2282629870,v1:192.168.30.16:6811/2282629870] [v2:
192.168.30.16:6812/2282629870,v1:192.168.30.16:6814/2282629870] exists
77e0ef8f-c047-4f84-afb2-a8ad054e562f
osd.35 up   in  weight 1 up_from 31204 up_thru 31657 down_at 31203
last_clean_interval [30958,31195) [v2:
192.168.30.13:6842/1919357520,v1:192.168.30.13:6843/1919357520] [v2:
192.168.30.13:6844/1919357520,v1:192.168.30.13:6845/1919357520] exists,up
2d2de0cb-6d41-4957-a473-2bbe9ce227bf
osd.36 down in  weight 1 up_from 29494 up_thru 30560 down_at 30688
last_clean_interval [25491,29492) [v2:
192.168.30.15:6816/2153321591,v1:192.168.30.15:6817/2153321591] [v2:
192.168.30.15:6842/2158321591,v1:192.168.30.15:6843/2158321591] exists
26114668-68b2-458b-89c2-cbad5507ab75

>
> > On Jul 25, 2022, at 3:29 AM, Jeremy Hansen <
> farnsworth.mcfadden@xxxxxxxxx> wrote:
> >
> > I transitioned some servers to a new rack and now I'm having major issues
> > with Ceph upon bringing things back up.
> >
> > I believe the issue may be related to the ceph nodes coming back up with
> > different IPs before VLANs were set.  That's just a guess because I can't
> > think of any other reason this would happen.
> >
> > Current state:
> >
> > Every 2.0s: ceph -s
> >   cn01.ceph.la1.clx.corp: Mon Jul 25 10:13:05 2022
> >
> >  cluster:
> >    id:     bfa2ad58-c049-11eb-9098-3c8cf8ed728d
> >    health: HEALTH_WARN
> >            1 filesystem is degraded
> >            2 MDSs report slow metadata IOs
> >            2/5 mons down, quorum cn02,cn03,cn01
> >            9 osds down
> >            3 hosts (17 osds) down
> >            Reduced data availability: 97 pgs inactive, 9 pgs down
> >            Degraded data redundancy: 13860144/30824413 objects degraded
> > (44.965%), 411 pgs degraded, 482 pgs undersized
> >
> >  services:
> >    mon: 5 daemons, quorum cn02,cn03,cn01 (age 62m), out of quorum: cn05,
> > cn04
> >    mgr: cn02.arszct(active, since 5m)
> >    mds: 2/2 daemons up, 2 standby
> >    osd: 35 osds: 15 up (since 62m), 24 in (since 58m); 222 remapped pgs
> >
> >  data:
> >    volumes: 1/2 healthy, 1 recovering
> >    pools:   8 pools, 545 pgs
> >    objects: 7.71M objects, 6.7 TiB
> >    usage:   15 TiB used, 39 TiB / 54 TiB avail
> >    pgs:     0.367% pgs unknown
> >             17.431% pgs not active
> >             13860144/30824413 objects degraded (44.965%)
> >             1137693/30824413 objects misplaced (3.691%)
> >             280 active+undersized+degraded
> >             67  undersized+degraded+remapped+backfilling+peered
> >             57  active+undersized+remapped
> >             45  active+clean+remapped
> >             44  active+undersized+degraded+remapped+backfilling
> >             18  undersized+degraded+peered
> >             10  active+undersized
> >             9   down
> >             7   active+clean
> >             3   active+undersized+remapped+backfilling
> >             2   active+undersized+degraded+remapped+backfill_wait
> >             2   unknown
> >             1   undersized+peered
> >
> >  io:
> >    client:   170 B/s rd, 0 op/s rd, 0 op/s wr
> >    recovery: 168 MiB/s, 158 keys/s, 166 objects/s
> >
> > I have to disable and re-enable the dashboard just to use it.  It seems
> to
> > get bogged down after a few moments.
> >
> > The three servers that were moved to the new rack Ceph has marked as
> > "Down", but if I do a cephadm host-check, they all seem to pass:
> >
> > ************************ ceph  ************************
> > --------- cn01.ceph.---------
> > podman (/usr/bin/podman) version 4.0.2 is present
> > systemctl is present
> > lvcreate is present
> > Unit chronyd.service is enabled and running
> > Host looks OK
> > --------- cn02.ceph.---------
> > podman (/usr/bin/podman) version 4.0.2 is present
> > systemctl is present
> > lvcreate is present
> > Unit chronyd.service is enabled and running
> > Host looks OK
> > --------- cn03.ceph.---------
> > podman (/usr/bin/podman) version 4.0.2 is present
> > systemctl is present
> > lvcreate is present
> > Unit chronyd.service is enabled and running
> > Host looks OK
> > --------- cn04.ceph.---------
> > podman (/usr/bin/podman) version 4.0.2 is present
> > systemctl is present
> > lvcreate is present
> > Unit chronyd.service is enabled and running
> > Host looks OK
> > --------- cn05.ceph.---------
> > podman|docker (/usr/bin/podman) is present
> > systemctl is present
> > lvcreate is present
> > Unit chronyd.service is enabled and running
> > Host looks OK
> > --------- cn06.ceph.---------
> > podman (/usr/bin/podman) version 4.0.2 is present
> > systemctl is present
> > lvcreate is present
> > Unit chronyd.service is enabled and running
> > Host looks OK
> >
> > It seems to be recovering with what it has left, but a large amount of
> OSDs
> > are down.  When trying to restart one of the down'd OSDs, I see a huge
> dump.
> >
> > Jul 25 03:19:38 cn06.ceph
> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
> > 2022-07-25T10:19:38.532+0000 7fce14a6c080  0 osd.34 30689 done with init,
> > starting boot process
> > Jul 25 03:19:38 cn06.ceph
> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
> > 2022-07-25T10:19:38.532+0000 7fce14a6c080  1 osd.34 30689 start_boot
> > Jul 25 03:20:10 cn06.ceph
> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
> > 2022-07-25T10:20:10.655+0000 7fcdfd12d700  1 osd.34 30689 start_boot
> > Jul 25 03:20:41 cn06.ceph
> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
> > 2022-07-25T10:20:41.159+0000 7fcdfd12d700  1 osd.34 30689 start_boot
> > Jul 25 03:21:11 cn06.ceph
> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
> > 2022-07-25T10:21:11.662+0000 7fcdfd12d700  1 osd.34 30689 start_boot
> >
> > At this point it just keeps printing start_boot, but the dashboard has it
> > marked as "in" but "down".
> >
> > On these three hosts that moved, there were a bunch marked as "out" and
> > "down", and some with "in" but "down".
> >
> > Not sure where to go next.  I'm going to let the recovery continue and
> hope
> > that my 4x replication on these pools saves me.
> >
> > Not sure where to go from here.  Any help is very much appreciated.  This
> > Ceph cluster holds all of our Cloudstack images...  it would be terrible
> to
> > lose this data.
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>

On Mon, Jul 25, 2022 at 10:15 AM Jeremy Hansen <
farnsworth.mcfadden@xxxxxxxxx> wrote:

>
>
> On Mon, Jul 25, 2022 at 7:26 AM Anthony D'Atri <anthony.datri@xxxxxxxxx>
> wrote:
>
>> Do your values for public and cluster network include the new addresses
>> on all nodes?
>>
>
> This cluster only has one network.  There is no separation between
> public and cluster.  Three of the nodes momentarily came up using a
> different IP address.
>
> I've also noticed on one of the nodes that did not move or have any IP
> issue, on a single node, from the dashboard, it names the same device for
> two different osd's:
>
> 2 cn01 out destroyed hdd TOSHIBA_MG04SCA40EE_21M0A0CKFWZB Unknown sdb osd.2
>
> 3 cn01 out destroyed ssd SAMSUNG_MZILT3T8HBLS/007_S5G0NE0R200159 Unknown
> sdb osd.3
>
>
> [ceph: root@cn01 /]# ceph-volume inventory
>
> Device Path               Size         rotates available Model name
> /dev/sda                  3.64 TB      True    False     MG04SCA40EE
> /dev/sdb                  3.49 TB      False   False     MZILT3T8HBLS/007
> /dev/sdc                  3.64 TB      True    False     MG04SCA40EE
> /dev/sdd                  3.64 TB      True    False     MG04SCA40EE
> /dev/sde                  3.49 TB      False   False     MZILT3T8HBLS/007
> /dev/sdf                  3.64 TB      True    False     MG04SCA40EE
> /dev/sdg                  698.64 GB    True    False     SEAGATE ST375064
>
> [ceph: root@cn01 /]# ceph osd info
> osd.0 down out weight 0 up_from 30231 up_thru 30564 down_at 30688
> last_clean_interval [25500,30228) [v2:
> 192.168.30.15:6818/2512683421,v1:192.168.30.15:6819/2512683421] [v2:
> 192.168.30.15:6824/2512683421,v1:192.168.30.15:6826/2512683421]
> autoout,exists d14cf503-a303-4fa4-a713-9530b67d613a
> osd.1 down out weight 0 up_from 30393 up_thru 30688 down_at 30697
> last_clean_interval [25518,30321) [v2:
> 192.168.30.16:6834/1781855831,v1:192.168.30.16:6835/1781855831] [v2:
> 192.168.30.16:6836/1781855831,v1:192.168.30.16:6837/1781855831]
> autoout,exists 0d521411-c835-4fa3-beca-3631b4ff6bf7
> osd.2 down out weight 0 up_from 31316 up_thru 31293 down_at 31317
> last_clean_interval [31218,31296) [v2:
> 192.168.30.11:6810/894589880,v1:192.168.30.11:6811/894589880] [v2:
> 192.168.30.11:6812/894589880,v1:192.168.30.11:6813/894589880]
> destroyed,exists
> osd.3 down out weight 0 up_from 31265 up_thru 31266 down_at 31268
> last_clean_interval [31254,31256) [v2:
> 192.168.30.11:6818/1641948535,v1:192.168.30.11:6819/1641948535] [v2:
> 192.168.30.11:6820/1641948535,v1:192.168.30.11:6821/1641948535]
> destroyed,exists
> osd.4 up   in  weight 1 up_from 31356 up_thru 31581 down_at 31339
> last_clean_interval [31320,31338) [v2:
> 192.168.30.11:6802/2785067179,v1:192.168.30.11:6803/2785067179] [v2:
> 192.168.30.11:6804/2785067179,v1:192.168.30.11:6805/2785067179] exists,up
> 3afd06db-b91d-44fe-9305-5eb95f7a59b9
> osd.5 up   in  weight 1 up_from 31347 up_thru 31699 down_at 31339
> last_clean_interval [31311,31338) [v2:
> 192.168.30.11:6818/1936771540,v1:192.168.30.11:6819/1936771540] [v2:
> 192.168.30.11:6820/1936771540,v1:192.168.30.11:6821/1936771540] exists,up
> 063c2ccf-02ce-4f5e-8252-dddfbb258a95
> osd.6 up   in  weight 1 up_from 31218 up_thru 31711 down_at 31217
> last_clean_interval [30978,31195) [v2:
> 192.168.30.12:6816/1585973160,v1:192.168.30.12:6817/1585973160] [v2:
> 192.168.30.12:6818/1585973160,v1:192.168.30.12:6819/1585973160] exists,up
> 94250ea2-f12e-4dc6-9135-b626086ccffd
> osd.7 down out weight 0 up_from 30353 up_thru 30558 down_at 30688
> last_clean_interval [25533,30349) [v2:
> 192.168.30.14:6816/4083104061,v1:192.168.30.14:6817/4083104061] [v2:
> 192.168.30.14:6840/4094104061,v1:192.168.30.14:6841/4094104061]
> autoout,exists de351aec-b91e-4c22-a0bf-85369bc14579
> osd.8 up   in  weight 1 up_from 31226 up_thru 31668 down_at 31225
> last_clean_interval [30983,31195) [v2:
> 192.168.30.12:6824/1312484329,v1:192.168.30.12:6825/1312484329] [v2:
> 192.168.30.12:6826/1312484329,v1:192.168.30.12:6827/1312484329] exists,up
> 51f665b4-fa5b-4b17-8390-ed130145ef04
> osd.9 up   in  weight 1 up_from 31351 up_thru 31673 down_at 31340
> last_clean_interval [31315,31338) [v2:
> 192.168.30.11:6810/1446838877,v1:192.168.30.11:6811/1446838877] [v2:
> 192.168.30.11:6812/1446838877,v1:192.168.30.11:6813/1446838877] exists,up
> 985f1127-d126-4629-b8cd-03cf2d914d99
> osd.10 up   in  weight 1 up_from 31219 up_thru 31639 down_at 31218
> last_clean_interval [30980,31195) [v2:
> 192.168.30.12:6808/1587842953,v1:192.168.30.12:6809/1587842953] [v2:
> 192.168.30.12:6810/1587842953,v1:192.168.30.12:6811/1587842953] exists,up
> c7fca03e-4bd5-4485-a090-658ca967d5f6
> osd.11 up   in  weight 1 up_from 31234 up_thru 31659 down_at 31223
> last_clean_interval [30978,31195) [v2:
> 192.168.30.12:6840/3403200742,v1:192.168.30.12:6841/3403200742] [v2:
> 192.168.30.12:6842/3403200742,v1:192.168.30.12:6843/3403200742] exists,up
> 81074bd7-ad9f-4e56-8885-cca4745f6c95
> osd.12 up   in  weight 1 up_from 31230 up_thru 31717 down_at 31223
> last_clean_interval [30975,31195) [v2:
> 192.168.30.13:6816/4268732910,v1:192.168.30.13:6817/4268732910] [v2:
> 192.168.30.13:6818/4268732910,v1:192.168.30.13:6819/4268732910] exists,up
> af1b55dd-c110-4861-aed9-c0737cef8be1
> osd.13 down out weight 0 up_from 30389 up_thru 30688 down_at 30695
> last_clean_interval [25513,30317) [v2:
> 192.168.30.16:6804/1573803727,v1:192.168.30.16:6805/1573803727] [v2:
> 192.168.30.16:6806/1573803727,v1:192.168.30.16:6807/1573803727]
> autoout,exists 737a3234-0f1f-4286-80e9-e89b581aae41
> osd.14 up   in  weight 1 up_from 31218 up_thru 31709 down_at 31217
> last_clean_interval [30979,31195) [v2:
> 192.168.30.13:6834/2291187817,v1:192.168.30.13:6835/2291187817] [v2:
> 192.168.30.13:6836/2291187817,v1:192.168.30.13:6837/2291187817] exists,up
> 97cd6ac7-aca0-42fd-a049-d27289f83183
> osd.15 down out weight 0 up_from 29463 up_thru 30531 down_at 30688
> last_clean_interval [25493,29462) [v2:
> 192.168.30.15:6808/2655269991,v1:192.168.30.15:6809/2655269991] [v2:
> 192.168.30.15:6802/2662269991,v1:192.168.30.15:6803/2662269991]
> autoout,exists 61aea8f4-5905-4be3-ae32-5eacf75a514e
> osd.16 up   in  weight 1 up_from 31226 up_thru 31647 down_at 31223
> last_clean_interval [30970,31195) [v2:
> 192.168.30.13:6808/2624812081,v1:192.168.30.13:6809/2624812081] [v2:
> 192.168.30.13:6810/2624812081,v1:192.168.30.13:6811/2624812081] exists,up
> 791a7542-87cd-403d-a37e-8f00506b2eb6
> osd.17 up   in  weight 1 up_from 31219 up_thru 31703 down_at 31218
> last_clean_interval [30975,31195) [v2:
> 192.168.30.13:6800/2978036397,v1:192.168.30.13:6801/2978036397] [v2:
> 192.168.30.13:6802/2978036397,v1:192.168.30.13:6803/2978036397] exists,up
> 4a915645-412f-49e6-8477-1577469905da
> osd.18 down out weight 0 up_from 30334 up_thru 30566 down_at 30688
> last_clean_interval [25543,30327) [v2:
> 192.168.30.14:6832/985432137,v1:192.168.30.14:6833/985432137] [v2:
> 192.168.30.14:6848/998432137,v1:192.168.30.14:6849/998432137]
> autoout,exists 85f59d83-710c-4896-9200-bda4894fc3e8
> osd.19 down in  weight 1 up_from 31303 up_thru 31321 down_at 31323
> last_clean_interval [31292,31296) [v2:
> 192.168.30.13:6826/375623427,v1:192.168.30.13:6827/375623427] [v2:
> 192.168.30.13:6828/375623427,v1:192.168.30.13:6829/375623427] exists
> 7d09b51a-bd6d-40f8-a009-78ab9937795d
> osd.20 down out weight 0 up_from 30445 up_thru 30531 down_at 30688
> last_clean_interval [28947,30444) [v2:
> 192.168.30.14:6810/4062649162,v1:192.168.30.14:6811/4062649162] [v2:
> 192.168.30.14:6800/4073649162,v1:192.168.30.14:6801/4073649162]
> autoout,exists 7ef6cc1a-4755-4a14-b9df-f1f538d903ce
> osd.21 up   in  weight 1 up_from 31345 up_thru 31567 down_at 31341
> last_clean_interval [31307,31340) [v2:
> 192.168.30.11:6826/1625231298,v1:192.168.30.11:6827/1625231298] [v2:
> 192.168.30.11:6828/1625231298,v1:192.168.30.11:6829/1625231298] exists,up
> 5ef2e39a-a353-4cb8-a49e-093fe39b94ef
> osd.22 down in  weight 1 up_from 30383 up_thru 30528 down_at 30688
> last_clean_interval [25549,30317) [v2:
> 192.168.30.14:6806/1204256629,v1:192.168.30.14:6807/1204256629] [v2:
> 192.168.30.14:6812/1204256629,v1:192.168.30.14:6813/1204256629] exists
> c9befe11-a035-449c-8d17-42aaf191923d
> osd.23 down in  weight 1 up_from 30334 up_thru 30576 down_at 30688
> last_clean_interval [30263,30332) [v2:
> 192.168.30.14:6802/3837786490,v1:192.168.30.14:6803/3837786490] [v2:
> 192.168.30.14:6830/3838786490,v1:192.168.30.14:6831/3838786490] exists
> 2081147b-065d-4da7-89d9-747e1ae02b8d
> osd.24 down in  weight 1 up_from 29455 up_thru 30570 down_at 30688
> last_clean_interval [25487,29453) [v2:
> 192.168.30.15:6800/2008474583,v1:192.168.30.15:6801/2008474583] [v2:
> 192.168.30.15:6810/2016474583,v1:192.168.30.15:6811/2016474583] exists
> 39d78380-261c-4689-b53d-90713e6ffcca
> osd.26 up   in  weight 1 up_from 31208 up_thru 31643 down_at 31207
> last_clean_interval [30967,31195) [v2:
> 192.168.30.12:6800/2861018947,v1:192.168.30.12:6801/2861018947] [v2:
> 192.168.30.12:6802/2861018947,v1:192.168.30.12:6803/2861018947] exists,up
> 046622c8-c09c-4254-8c15-3dc05a2f01ed
> osd.28 down in  weight 1 up_from 30389 up_thru 30574 down_at 30691
> last_clean_interval [25513,30312) [v2:
> 192.168.30.16:6820/3466284570,v1:192.168.30.16:6821/3466284570] [v2:
> 192.168.30.16:6822/3466284570,v1:192.168.30.16:6823/3466284570] exists
> 10578b97-e3c4-4553-a8d0-6dcc46af5db1
> osd.29 down in  weight 1 up_from 30378 up_thru 30554 down_at 30688
> last_clean_interval [28595,30376) [v2:
> 192.168.30.14:6808/3739543672,v1:192.168.30.14:6809/3739543672] [v2:
> 192.168.30.14:6846/3747543672,v1:192.168.30.14:6847/3747543672] exists
> 9698e936-8edf-4adf-92c9-a0b5202ed01a
> osd.30 down in  weight 1 up_from 30449 up_thru 30531 down_at 30688
> last_clean_interval [25502,30446) [v2:
> 192.168.30.15:6825/2375507296,v1:192.168.30.15:6827/2375507296] [v2:
> 192.168.30.15:6829/2375507296,v1:192.168.30.15:6831/2375507296] exists
> e14d2a0f-a98a-44d4-8c06-4d893f673629
> osd.31 down in  weight 1 up_from 30364 up_thru 30688 down_at 30700
> last_clean_interval [25514,30361) [v2:
> 192.168.30.16:6826/2835000708,v1:192.168.30.16:6827/2835000708] [v2:
> 192.168.30.16:6802/2843000708,v1:192.168.30.16:6803/2843000708] exists
> 541bca38-e704-483a-8cb8-39e5f69007d1
> osd.32 up   in  weight 1 up_from 31209 up_thru 31627 down_at 31208
> last_clean_interval [30974,31195) [v2:
> 192.168.30.12:6832/3860067997,v1:192.168.30.12:6833/3860067997] [v2:
> 192.168.30.12:6834/3860067997,v1:192.168.30.12:6835/3860067997] exists,up
> 9200a57e-2845-43ff-9787-8f1f3158fe90
> osd.33 down in  weight 1 up_from 30354 up_thru 30688 down_at 30693
> last_clean_interval [25521,30350) [v2:
> 192.168.30.16:6842/2342555666,v1:192.168.30.16:6843/2342555666] [v2:
> 192.168.30.16:6844/2364555666,v1:192.168.30.16:6845/2364555666] exists
> 20c55d85-cf9a-4133-a189-7fdad2318f58
> osd.34 down in  weight 1 up_from 30390 up_thru 30688 down_at 30691
> last_clean_interval [25516,30314) [v2:
> 192.168.30.16:6808/2282629870,v1:192.168.30.16:6811/2282629870] [v2:
> 192.168.30.16:6812/2282629870,v1:192.168.30.16:6814/2282629870] exists
> 77e0ef8f-c047-4f84-afb2-a8ad054e562f
> osd.35 up   in  weight 1 up_from 31204 up_thru 31657 down_at 31203
> last_clean_interval [30958,31195) [v2:
> 192.168.30.13:6842/1919357520,v1:192.168.30.13:6843/1919357520] [v2:
> 192.168.30.13:6844/1919357520,v1:192.168.30.13:6845/1919357520] exists,up
> 2d2de0cb-6d41-4957-a473-2bbe9ce227bf
> osd.36 down in  weight 1 up_from 29494 up_thru 30560 down_at 30688
> last_clean_interval [25491,29492) [v2:
> 192.168.30.15:6816/2153321591,v1:192.168.30.15:6817/2153321591] [v2:
> 192.168.30.15:6842/2158321591,v1:192.168.30.15:6843/2158321591] exists
> 26114668-68b2-458b-89c2-cbad5507ab75
>
>
>
>>
>> > On Jul 25, 2022, at 3:29 AM, Jeremy Hansen <
>> farnsworth.mcfadden@xxxxxxxxx> wrote:
>> >
>> > I transitioned some servers to a new rack and now I'm having major
>> issues
>> > with Ceph upon bringing things back up.
>> >
>> > I believe the issue may be related to the ceph nodes coming back up with
>> > different IPs before VLANs were set.  That's just a guess because I
>> can't
>> > think of any other reason this would happen.
>> >
>> > Current state:
>> >
>> > Every 2.0s: ceph -s
>> >   cn01.ceph.la1.clx.corp: Mon Jul 25 10:13:05 2022
>> >
>> >  cluster:
>> >    id:     bfa2ad58-c049-11eb-9098-3c8cf8ed728d
>> >    health: HEALTH_WARN
>> >            1 filesystem is degraded
>> >            2 MDSs report slow metadata IOs
>> >            2/5 mons down, quorum cn02,cn03,cn01
>> >            9 osds down
>> >            3 hosts (17 osds) down
>> >            Reduced data availability: 97 pgs inactive, 9 pgs down
>> >            Degraded data redundancy: 13860144/30824413 objects degraded
>> > (44.965%), 411 pgs degraded, 482 pgs undersized
>> >
>> >  services:
>> >    mon: 5 daemons, quorum cn02,cn03,cn01 (age 62m), out of quorum: cn05,
>> > cn04
>> >    mgr: cn02.arszct(active, since 5m)
>> >    mds: 2/2 daemons up, 2 standby
>> >    osd: 35 osds: 15 up (since 62m), 24 in (since 58m); 222 remapped pgs
>> >
>> >  data:
>> >    volumes: 1/2 healthy, 1 recovering
>> >    pools:   8 pools, 545 pgs
>> >    objects: 7.71M objects, 6.7 TiB
>> >    usage:   15 TiB used, 39 TiB / 54 TiB avail
>> >    pgs:     0.367% pgs unknown
>> >             17.431% pgs not active
>> >             13860144/30824413 objects degraded (44.965%)
>> >             1137693/30824413 objects misplaced (3.691%)
>> >             280 active+undersized+degraded
>> >             67  undersized+degraded+remapped+backfilling+peered
>> >             57  active+undersized+remapped
>> >             45  active+clean+remapped
>> >             44  active+undersized+degraded+remapped+backfilling
>> >             18  undersized+degraded+peered
>> >             10  active+undersized
>> >             9   down
>> >             7   active+clean
>> >             3   active+undersized+remapped+backfilling
>> >             2   active+undersized+degraded+remapped+backfill_wait
>> >             2   unknown
>> >             1   undersized+peered
>> >
>> >  io:
>> >    client:   170 B/s rd, 0 op/s rd, 0 op/s wr
>> >    recovery: 168 MiB/s, 158 keys/s, 166 objects/s
>> >
>> > I have to disable and re-enable the dashboard just to use it.  It seems
>> to
>> > get bogged down after a few moments.
>> >
>> > The three servers that were moved to the new rack Ceph has marked as
>> > "Down", but if I do a cephadm host-check, they all seem to pass:
>> >
>> > ************************ ceph  ************************
>> > --------- cn01.ceph.---------
>> > podman (/usr/bin/podman) version 4.0.2 is present
>> > systemctl is present
>> > lvcreate is present
>> > Unit chronyd.service is enabled and running
>> > Host looks OK
>> > --------- cn02.ceph.---------
>> > podman (/usr/bin/podman) version 4.0.2 is present
>> > systemctl is present
>> > lvcreate is present
>> > Unit chronyd.service is enabled and running
>> > Host looks OK
>> > --------- cn03.ceph.---------
>> > podman (/usr/bin/podman) version 4.0.2 is present
>> > systemctl is present
>> > lvcreate is present
>> > Unit chronyd.service is enabled and running
>> > Host looks OK
>> > --------- cn04.ceph.---------
>> > podman (/usr/bin/podman) version 4.0.2 is present
>> > systemctl is present
>> > lvcreate is present
>> > Unit chronyd.service is enabled and running
>> > Host looks OK
>> > --------- cn05.ceph.---------
>> > podman|docker (/usr/bin/podman) is present
>> > systemctl is present
>> > lvcreate is present
>> > Unit chronyd.service is enabled and running
>> > Host looks OK
>> > --------- cn06.ceph.---------
>> > podman (/usr/bin/podman) version 4.0.2 is present
>> > systemctl is present
>> > lvcreate is present
>> > Unit chronyd.service is enabled and running
>> > Host looks OK
>> >
>> > It seems to be recovering with what it has left, but a large amount of
>> OSDs
>> > are down.  When trying to restart one of the down'd OSDs, I see a huge
>> dump.
>> >
>> > Jul 25 03:19:38 cn06.ceph
>> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
>> > 2022-07-25T10:19:38.532+0000 7fce14a6c080  0 osd.34 30689 done with
>> init,
>> > starting boot process
>> > Jul 25 03:19:38 cn06.ceph
>> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
>> > 2022-07-25T10:19:38.532+0000 7fce14a6c080  1 osd.34 30689 start_boot
>> > Jul 25 03:20:10 cn06.ceph
>> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
>> > 2022-07-25T10:20:10.655+0000 7fcdfd12d700  1 osd.34 30689 start_boot
>> > Jul 25 03:20:41 cn06.ceph
>> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
>> > 2022-07-25T10:20:41.159+0000 7fcdfd12d700  1 osd.34 30689 start_boot
>> > Jul 25 03:21:11 cn06.ceph
>> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
>> > 2022-07-25T10:21:11.662+0000 7fcdfd12d700  1 osd.34 30689 start_boot
>> >
>> > At this point it just keeps printing start_boot, but the dashboard has
>> it
>> > marked as "in" but "down".
>> >
>> > On these three hosts that moved, there were a bunch marked as "out" and
>> > "down", and some with "in" but "down".
>> >
>> > Not sure where to go next.  I'm going to let the recovery continue and
>> hope
>> > that my 4x replication on these pools saves me.
>> >
>> > Not sure where to go from here.  Any help is very much appreciated.
>> This
>> > Ceph cluster holds all of our Cloudstack images...  it would be
>> terrible to
>> > lose this data.
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx
>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx