Re: [Warning Possible spam] Re: Issues after a shutdown

Jeremy Hansen <farnsworth.mcfadden@xxxxxxxxx> · Mon, 25 Jul 2022 14:00:27 -0700

I noticed this on the initial run of ceph health, but I no longer see it.
When you say "don't use ceph adm", can you explain why this is bad?

This is ceph health outside of cephadm shell:

HEALTH_WARN 1 filesystem is degraded; 2 MDSs report slow metadata IOs; 2/5
mons down, quorum cn02,cn03,cn01; 10 osds down; 3 hosts (17 osds) down;
Reduced data ava
ilability: 13 pgs inactive, 9 pgs down; Degraded data redundancy:
8515690/30862245 objects degraded (27.593%), 326 pgs degraded, 447 pgs
undersized
[WRN] FS_DEGRADED: 1 filesystem is degraded
    fs coldlogix is degraded
[WRN] MDS_SLOW_METADATA_IO: 2 MDSs report slow metadata IOs
    mds.coldlogix.cn01.uriofo(mds.0): 2 slow metadata IOs are blocked > 30
secs, oldest blocked for 3701 secs
    mds.btc.cn02.ouvaus(mds.0): 1 slow metadata IOs are blocked > 30 secs,
oldest blocked for 382 secs
[WRN] MON_DOWN: 2/5 mons down, quorum cn02,cn03,cn01
    mon.cn05 (rank 0) addr [v2:192.168.30.15:3300/0,v1:192.168.30.15:6789/0]
is down (out of quorum)
    mon.cn04 (rank 3) addr [v2:192.168.30.14:3300/0,v1:192.168.30.14:6789/0]
is down (out of quorum)
[WRN] OSD_DOWN: 10 osds down
    osd.0 (root=default,host=cn05) is down
    osd.1 (root=default,host=cn06) is down
    osd.7 (root=default,host=cn04) is down
    osd.13 (root=default,host=cn06) is down
    osd.15 (root=default,host=cn05) is down
    osd.18 (root=default,host=cn04) is down
    osd.20 (root=default,host=cn04) is down
    osd.33 (root=default,host=cn06) is down
    osd.34 (root=default,host=cn06) is down
    osd.36 (root=default,host=cn05) is down
[WRN] OSD_HOST_DOWN: 3 hosts (17 osds) down
    host cn04 (root=default) (6 osds) is down
    host cn05 (root=default) (5 osds) is down
    host cn06 (root=default) (6 osds) is down
[WRN] PG_AVAILABILITY: Reduced data availability: 13 pgs inactive, 9 pgs
down
    pg 9.3a is down, acting [8]
    pg 9.7a is down, acting [8]
    pg 9.ba is down, acting [8]
    pg 9.fa is down, acting [8]
    pg 11.3 is stuck inactive for 39h, current state
undersized+degraded+peered, last acting [11]
    pg 11.11 is down, acting [19,9]
    pg 11.1f is stuck inactive for 13h, current state
undersized+degraded+peered, last acting [10]
    pg 12.36 is down, acting [21,16]
    pg 12.59 is down, acting [26,5]
    pg 12.66 is down, acting [5]
    pg 19.4 is stuck inactive for 39h, current state
undersized+degraded+peered, last acting [6]
    pg 19.1c is down, acting [21,16,11]
    pg 21.1 is stuck inactive for 2m, current state unknown, last acting []
[WRN] PG_DEGRADED: Degraded data redundancy: 8515690/30862245 objects
degraded (27.593%), 326 pgs degraded, 447 pgs undersized
    pg 9.75 is stuck undersized for 61m, current state
active+undersized+remapped, last acting [4,8,35]
    pg 9.76 is stuck undersized for 62m, current state
active+undersized+degraded, last acting [35,10,21]
    pg 9.77 is stuck undersized for 61m, current state
active+undersized+remapped, last acting [32,35,4]
    pg 9.78 is stuck undersized for 62m, current state
active+undersized+degraded, last acting [14,10]
    pg 9.79 is stuck undersized for 62m, current state
active+undersized+degraded, last acting [21,32]
    pg 9.7b is stuck undersized for 61m, current state
active+undersized+degraded, last acting [8,12,5]
    pg 9.7c is stuck undersized for 61m, current state
active+undersized+degraded, last acting [4,35,10]
    pg 9.7d is stuck undersized for 62m, current state
active+undersized+degraded, last acting [5,19,10]
    pg 9.7e is stuck undersized for 62m, current state
active+undersized+remapped, last acting [21,10,17]
    pg 9.80 is stuck undersized for 61m, current state
active+undersized+degraded, last acting [8,4,17]
    pg 9.81 is stuck undersized for 62m, current state
active+undersized+degraded, last acting [14,26]
    pg 9.82 is stuck undersized for 62m, current state
active+undersized+degraded, last acting [26,16]
    pg 9.83 is stuck undersized for 61m, current state
active+undersized+degraded, last acting [8,4]
    pg 9.84 is stuck undersized for 61m, current state
active+undersized+degraded, last acting [4,35,6]
    pg 9.85 is stuck undersized for 61m, current state
active+undersized+degraded, last acting [32,12,9]
    pg 9.86 is stuck undersized for 61m, current state
active+undersized+degraded, last acting [35,5,8]
    pg 9.87 is stuck undersized for 61m, current state
active+undersized+degraded, last acting [9,12]
    pg 9.88 is stuck undersized for 62m, current state
active+undersized+remapped, last acting [19,32,35]
    pg 9.89 is stuck undersized for 61m, current state
active+undersized+degraded, last acting [10,14,4]
    pg 9.8a is stuck undersized for 62m, current state
active+undersized+degraded, last acting [21,19]
    pg 9.8b is stuck undersized for 61m, current state
active+undersized+degraded, last acting [8,35]
    pg 9.8c is stuck undersized for 58m, current state
active+undersized+remapped, last acting [10,19,5]
    pg 9.8d is stuck undersized for 61m, current state
active+undersized+remapped, last acting [9,6]
    pg 9.8f is stuck undersized for 62m, current state
active+undersized+remapped, last acting [19,26,17]
    pg 9.90 is stuck undersized for 62m, current state
active+undersized+remapped, last acting [35,26]
    pg 9.91 is stuck undersized for 62m, current state
active+undersized+degraded, last acting [17,5]
    pg 9.92 is stuck undersized for 62m, current state
active+undersized+degraded, last acting [21,26]
    pg 9.93 is stuck undersized for 62m, current state
active+undersized+remapped, last acting [19,26,5]
    pg 9.94 is stuck undersized for 62m, current state
active+undersized+degraded, last acting [21,11]
    pg 9.95 is stuck undersized for 61m, current state
active+undersized+degraded, last acting [8,19]
    pg 9.96 is stuck undersized for 62m, current state
active+undersized+degraded, last acting [17,6]
    pg 9.97 is stuck undersized for 61m, current state
active+undersized+degraded, last acting [8,9,16]
    pg 9.98 is stuck undersized for 62m, current state
active+undersized+degraded, last acting [6,21]
    pg 9.99 is stuck undersized for 61m, current state
active+undersized+degraded, last acting [10,9]
    pg 9.9a is stuck undersized for 61m, current state
active+undersized+remapped, last acting [4,16,10]
    pg 9.9b is stuck undersized for 61m, current state
active+undersized+degraded, last acting [12,4,11]
    pg 9.9c is stuck undersized for 61m, current state
active+undersized+degraded, last acting [9,16]
    pg 9.9d is stuck undersized for 62m, current state
active+undersized+degraded, last acting [26,35]
    pg 9.9f is stuck undersized for 61m, current state
active+undersized+degraded, last acting [9,17,26]
    pg 12.70 is stuck undersized for 62m, current state
active+undersized+degraded, last acting [21,35]
    pg 12.71 is active+undersized+degraded, acting [6,12]
    pg 12.72 is stuck undersized for 61m, current state
active+undersized+degraded, last acting [10,14,4]
    pg 12.73 is stuck undersized for 62m, current state
active+undersized+remapped, last acting [5,17,11]
    pg 12.78 is stuck undersized for 61m, current state
active+undersized+degraded, last acting [5,8,35]
    pg 12.79 is stuck undersized for 61m, current state
active+undersized+degraded, last acting [4,17]
    pg 12.7a is stuck undersized for 62m, current state
active+undersized+degraded, last acting [10,21]
    pg 12.7b is stuck undersized for 62m, current state
active+undersized+remapped, last acting [17,21,11]
    pg 12.7c is stuck undersized for 62m, current state
active+undersized+degraded, last acting [32,21,16]
    pg 12.7d is stuck undersized for 61m, current state
active+undersized+degraded, last acting [35,6,9]
    pg 12.7e is stuck undersized for 61m, current state
active+undersized+degraded, last acting [26,4]
    pg 12.7f is stuck undersized for 61m, current state
active+undersized+degraded, last acting [9,14]

It's no longer giving me the ssh key issues but hasn't done anything to
improve my situation.  When the machines came up with a different IP, did
this somehow throw off some kind of ssh known hosts file or pub key
exchange?  It's all very strange why a momentary bad IP could wreak so much
havoc.

Thank you
-jeremy

On Mon, Jul 25, 2022 at 1:44 PM Frank Schilder <frans@xxxxxx> wrote:

> I don't use ceph-adm  and I also don't know how you got the "some more
> info". However, I did notice that it contains instructions, starting at
> "Please make sure that the host is reachable ...". How about starting to
> follow those?
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Jeremy Hansen <farnsworth.mcfadden@xxxxxxxxx>
> Sent: 25 July 2022 22:32:32
> To: ceph-users@xxxxxxx
> Subject: [Warning Possible spam]   Re: Issues after a shutdown
>
> Here's some more info:
>
> HEALTH_WARN 2 failed cephadm daemon(s); 3 hosts fail cephadm check; 2
> filesystems are degraded; 1 MDSs report slow metadata IOs; 2/5 mons down,
> quorum cn02,cn03,cn01; 10 osds down; 3 hosts (17 osds) down; Reduced data
> availability: 13 pgs inactive, 9 pgs down; Degraded data redundancy:
> 8515690/30862245 objects degraded (27.593%), 326 pgs degraded, 447 pgs
> undersized
> [WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
>     daemon osd.3 on cn01.ceph is in error state
>     daemon osd.2 on cn01.ceph is in error state
> [WRN] CEPHADM_HOST_CHECK_FAILED: 3 hosts fail cephadm check
>     host cn04.ceph (192.168.30.14) failed check: Failed to connect to
> cn04.ceph (192.168.30.14).
> Please make sure that the host is reachable and accepts connections using
> the cephadm SSH key
>
> To add the cephadm SSH key to the host:
> > ceph cephadm get-pub-key > ~/ceph.pub
> > ssh-copy-id -f -i ~/ceph.pub root@192.168.30.14
>
> To check that the host is reachable open a new shell with the --no-hosts
> flag:
> > cephadm shell --no-hosts
>
> Then run the following:
> > ceph cephadm get-ssh-config > ssh_config
> > ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key
> > chmod 0600 ~/cephadm_private_key
> > ssh -F ssh_config -i ~/cephadm_private_key root@192.168.30.14
>     host cn06.ceph (192.168.30.16) failed check: Failed to connect to
> cn06.ceph (192.168.30.16).
> Please make sure that the host is reachable and accepts connections using
> the cephadm SSH key
>
> To add the cephadm SSH key to the host:
> > ceph cephadm get-pub-key > ~/ceph.pub
> > ssh-copy-id -f -i ~/ceph.pub root@192.168.30.16
>
> To check that the host is reachable open a new shell with the --no-hosts
> flag:
> > cephadm shell --no-hosts
>
> Then run the following:
> > ceph cephadm get-ssh-config > ssh_config
> > ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key
> > chmod 0600 ~/cephadm_private_key
> > ssh -F ssh_config -i ~/cephadm_private_key root@192.168.30.16
>     host cn05.ceph (192.168.30.15) failed check: Failed to connect to
> cn05.ceph (192.168.30.15).
> Please make sure that the host is reachable and accepts connections using
> the cephadm SSH key
>
> To add the cephadm SSH key to the host:
> > ceph cephadm get-pub-key > ~/ceph.pub
> > ssh-copy-id -f -i ~/ceph.pub root@192.168.30.15
>
> To check that the host is reachable open a new shell with the --no-hosts
> flag:
> > cephadm shell --no-hosts
>
> Then run the following:
> > ceph cephadm get-ssh-config > ssh_config
> > ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key
> > chmod 0600 ~/cephadm_private_key
> > ssh -F ssh_config -i ~/cephadm_private_key root@192.168.30.15
> [WRN] FS_DEGRADED: 2 filesystems are degraded
>     fs coldlogix is degraded
>     fs btc is degraded
> [WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
>     mds.coldlogix.cn01.uriofo(mds.0): 2 slow metadata IOs are blocked > 30
> secs, oldest blocked for 2096 secs
> [WRN] MON_DOWN: 2/5 mons down, quorum cn02,cn03,cn01
>     mon.cn05 (rank 0) addr [v2:
> 192.168.30.15:3300/0,v1:192.168.30.15:6789/0]
> is down (out of quorum)
>     mon.cn04 (rank 3) addr [v2:
> 192.168.30.14:3300/0,v1:192.168.30.14:6789/0]
> is down (out of quorum)
> [WRN] OSD_DOWN: 10 osds down
>     osd.0 (root=default,host=cn05) is down
>     osd.1 (root=default,host=cn06) is down
>     osd.7 (root=default,host=cn04) is down
>     osd.13 (root=default,host=cn06) is down
>     osd.15 (root=default,host=cn05) is down
>     osd.18 (root=default,host=cn04) is down
>     osd.20 (root=default,host=cn04) is down
>     osd.33 (root=default,host=cn06) is down
>     osd.34 (root=default,host=cn06) is down
>     osd.36 (root=default,host=cn05) is down
> [WRN] OSD_HOST_DOWN: 3 hosts (17 osds) down
>     host cn04 (root=default) (6 osds) is down
>     host cn05 (root=default) (5 osds) is down
>     host cn06 (root=default) (6 osds) is down
> [WRN] PG_AVAILABILITY: Reduced data availability: 13 pgs inactive, 9 pgs
> down
>     pg 9.3a is down, acting [8]
>     pg 9.7a is down, acting [8]
>     pg 9.ba is down, acting [8]
>     pg 9.fa is down, acting [8]
>     pg 11.3 is stuck inactive for 39h, current state
> undersized+degraded+peered, last acting [11]
>     pg 11.11 is down, acting [19,9]
>     pg 11.1f is stuck inactive for 13h, current state
> undersized+degraded+peered, last acting [10]
>     pg 12.36 is down, acting [21,16]
>     pg 12.59 is down, acting [26,5]
>     pg 12.66 is down, acting [5]
>     pg 19.4 is stuck inactive for 39h, current state
> undersized+degraded+peered, last acting [6]
>     pg 19.1c is down, acting [21,16,11]
>     pg 21.1 is stuck inactive for 36m, current state unknown, last acting
> []
> [WRN] PG_DEGRADED: Degraded data redundancy: 8515690/30862245 objects
> degraded (27.593%), 326 pgs degraded, 447 pgs undersized
>     pg 9.75 is stuck undersized for 34m, current state
> active+undersized+remapped, last acting [4,8,35]
>     pg 9.76 is stuck undersized for 35m, current state
> active+undersized+degraded, last acting [35,10,21]
>     pg 9.77 is stuck undersized for 34m, current state
> active+undersized+remapped, last acting [32,35,4]
>     pg 9.78 is stuck undersized for 35m, current state
> active+undersized+degraded, last acting [14,10]
>     pg 9.79 is stuck undersized for 35m, current state
> active+undersized+degraded, last acting [21,32]
>     pg 9.7b is stuck undersized for 34m, current state
> active+undersized+degraded, last acting [8,12,5]
>     pg 9.7c is stuck undersized for 34m, current state
> active+undersized+degraded, last acting [4,35,10]
>     pg 9.7d is stuck undersized for 35m, current state
> active+undersized+degraded, last acting [5,19,10]
>     pg 9.7e is stuck undersized for 35m, current state
> active+undersized+remapped, last acting [21,10,17]
>     pg 9.80 is stuck undersized for 34m, current state
> active+undersized+degraded, last acting [8,4,17]
>     pg 9.81 is stuck undersized for 35m, current state
> active+undersized+degraded, last acting [14,26]
>     pg 9.82 is stuck undersized for 35m, current state
> active+undersized+degraded, last acting [26,16]
>     pg 9.83 is stuck undersized for 34m, current state
> active+undersized+degraded, last acting [8,4]
>     pg 9.84 is stuck undersized for 34m, current state
> active+undersized+degraded, last acting [4,35,6]
>     pg 9.85 is stuck undersized for 35m, current state
> active+undersized+degraded, last acting [32,12,9]
>     pg 9.86 is stuck undersized for 34m, current state
> active+undersized+degraded, last acting [35,5,8]
>     pg 9.87 is stuck undersized for 35m, current state
> active+undersized+degraded, last acting [9,12]
>     pg 9.88 is stuck undersized for 35m, current state
> active+undersized+remapped, last acting [19,32,35]
>     pg 9.89 is stuck undersized for 34m, current state
> active+undersized+degraded, last acting [10,14,4]
>     pg 9.8a is stuck undersized for 35m, current state
> active+undersized+degraded, last acting [21,19]
>     pg 9.8b is stuck undersized for 34m, current state
> active+undersized+degraded, last acting [8,35]
>     pg 9.8c is stuck undersized for 31m, current state
> active+undersized+remapped, last acting [10,19,5]
>     pg 9.8d is stuck undersized for 35m, current state
> active+undersized+remapped, last acting [9,6]
>     pg 9.8f is stuck undersized for 35m, current state
> active+undersized+remapped, last acting [19,26,17]
>     pg 9.90 is stuck undersized for 35m, current state
> active+undersized+remapped, last acting [35,26]
>     pg 9.91 is stuck undersized for 35m, current state
> active+undersized+degraded, last acting [17,5]
>     pg 9.92 is stuck undersized for 35m, current state
> active+undersized+degraded, last acting [21,26]
>     pg 9.93 is stuck undersized for 35m, current state
> active+undersized+remapped, last acting [19,26,5]
>     pg 9.94 is stuck undersized for 35m, current state
> active+undersized+degraded, last acting [21,11]
>     pg 9.95 is stuck undersized for 34m, current state
> active+undersized+degraded, last acting [8,19]
>     pg 9.96 is stuck undersized for 35m, current state
> active+undersized+degraded, last acting [17,6]
>     pg 9.97 is stuck undersized for 34m, current state
> active+undersized+degraded, last acting [8,9,16]
>     pg 9.98 is stuck undersized for 35m, current state
> active+undersized+degraded, last acting [6,21]
>     pg 9.99 is stuck undersized for 35m, current state
> active+undersized+degraded, last acting [10,9]
>     pg 9.9a is stuck undersized for 34m, current state
> active+undersized+remapped, last acting [4,16,10]
>     pg 9.9b is stuck undersized for 34m, current state
> active+undersized+degraded, last acting [12,4,11]
>     pg 9.9c is stuck undersized for 35m, current state
> active+undersized+degraded, last acting [9,16]
>     pg 9.9d is stuck undersized for 35m, current state
> active+undersized+degraded, last acting [26,35]
>     pg 9.9f is stuck undersized for 35m, current state
> active+undersized+degraded, last acting [9,17,26]
>     pg 12.70 is stuck undersized for 35m, current state
> active+undersized+degraded, last acting [21,35]
>     pg 12.71 is active+undersized+degraded, acting [6,12]
>     pg 12.72 is stuck undersized for 34m, current state
> active+undersized+degraded, last acting [10,14,4]
>     pg 12.73 is stuck undersized for 35m, current state
> active+undersized+remapped, last acting [5,17,11]
>     pg 12.78 is stuck undersized for 34m, current state
> active+undersized+degraded, last acting [5,8,35]
>     pg 12.79 is stuck undersized for 34m, current state
> active+undersized+degraded, last acting [4,17]
>     pg 12.7a is stuck undersized for 35m, current state
> active+undersized+degraded, last acting [10,21]
>     pg 12.7b is stuck undersized for 35m, current state
> active+undersized+remapped, last acting [17,21,11]
>     pg 12.7c is stuck undersized for 35m, current state
> active+undersized+degraded, last acting [32,21,16]
>     pg 12.7d is stuck undersized for 35m, current state
> active+undersized+degraded, last acting [35,6,9]
>     pg 12.7e is stuck undersized for 34m, current state
> active+undersized+degraded, last acting [26,4]
>     pg 12.7f is stuck undersized for 35m, current state
> active+undersized+degraded, last acting [9,14]
>
> On Mon, Jul 25, 2022 at 12:43 PM Jeremy Hansen <
> farnsworth.mcfadden@xxxxxxxxx> wrote:
>
> > Pretty desperate here.  Can someone suggest what I might be able to do to
> > get these OSDs back up.  It looks like my recovery had stalled.
> >
> >
> > On Mon, Jul 25, 2022 at 7:26 AM Anthony D'Atri <anthony.datri@xxxxxxxxx>
> > wrote:
> >
> >> Do your values for public and cluster network include the new addresses
> >> on all nodes?
> >>
> >
> > This cluster only has one network.  There is no separation between
> > public and cluster.  Three of the nodes momentarily came up using a
> > different IP address.
> >
> > I've also noticed on one of the nodes that did not move or have any IP
> > issue, on a single node, from the dashboard, it names the same device for
> > two different osd's:
> >
> > 2 cn01 out destroyed hdd TOSHIBA_MG04SCA40EE_21M0A0CKFWZB Unknown sdb
> osd.2
> >
> > 3 cn01 out destroyed ssd SAMSUNG_MZILT3T8HBLS/007_S5G0NE0R200159 Unknown
> > sdb osd.3
> >
> >
> > [ceph: root@cn01 /]# ceph-volume inventory
> >
> > Device Path               Size         rotates available Model name
> > /dev/sda                  3.64 TB      True    False     MG04SCA40EE
> > /dev/sdb                  3.49 TB      False   False     MZILT3T8HBLS/007
> > /dev/sdc                  3.64 TB      True    False     MG04SCA40EE
> > /dev/sdd                  3.64 TB      True    False     MG04SCA40EE
> > /dev/sde                  3.49 TB      False   False     MZILT3T8HBLS/007
> > /dev/sdf                  3.64 TB      True    False     MG04SCA40EE
> > /dev/sdg                  698.64 GB    True    False     SEAGATE ST375064
> >
> > [ceph: root@cn01 /]# ceph osd info
> > osd.0 down out weight 0 up_from 30231 up_thru 30564 down_at 30688
> > last_clean_interval [25500,30228) [v2:
> > 192.168.30.15:6818/2512683421,v1:192.168.30.15:6819/2512683421] [v2:
> > 192.168.30.15:6824/2512683421,v1:192.168.30.15:6826/2512683421]
> > autoout,exists d14cf503-a303-4fa4-a713-9530b67d613a
> > osd.1 down out weight 0 up_from 30393 up_thru 30688 down_at 30697
> > last_clean_interval [25518,30321) [v2:
> > 192.168.30.16:6834/1781855831,v1:192.168.30.16:6835/1781855831] [v2:
> > 192.168.30.16:6836/1781855831,v1:192.168.30.16:6837/1781855831]
> > autoout,exists 0d521411-c835-4fa3-beca-3631b4ff6bf7
> > osd.2 down out weight 0 up_from 31316 up_thru 31293 down_at 31317
> > last_clean_interval [31218,31296) [v2:
> > 192.168.30.11:6810/894589880,v1:192.168.30.11:6811/894589880] [v2:
> > 192.168.30.11:6812/894589880,v1:192.168.30.11:6813/894589880]
> > destroyed,exists
> > osd.3 down out weight 0 up_from 31265 up_thru 31266 down_at 31268
> > last_clean_interval [31254,31256) [v2:
> > 192.168.30.11:6818/1641948535,v1:192.168.30.11:6819/1641948535] [v2:
> > 192.168.30.11:6820/1641948535,v1:192.168.30.11:6821/1641948535]
> > destroyed,exists
> > osd.4 up   in  weight 1 up_from 31356 up_thru 31581 down_at 31339
> > last_clean_interval [31320,31338) [v2:
> > 192.168.30.11:6802/2785067179,v1:192.168.30.11:6803/2785067179] [v2:
> > 192.168.30.11:6804/2785067179,v1:192.168.30.11:6805/2785067179]
> exists,up
> > 3afd06db-b91d-44fe-9305-5eb95f7a59b9
> > osd.5 up   in  weight 1 up_from 31347 up_thru 31699 down_at 31339
> > last_clean_interval [31311,31338) [v2:
> > 192.168.30.11:6818/1936771540,v1:192.168.30.11:6819/1936771540] [v2:
> > 192.168.30.11:6820/1936771540,v1:192.168.30.11:6821/1936771540]
> exists,up
> > 063c2ccf-02ce-4f5e-8252-dddfbb258a95
> > osd.6 up   in  weight 1 up_from 31218 up_thru 31711 down_at 31217
> > last_clean_interval [30978,31195) [v2:
> > 192.168.30.12:6816/1585973160,v1:192.168.30.12:6817/1585973160] [v2:
> > 192.168.30.12:6818/1585973160,v1:192.168.30.12:6819/1585973160]
> exists,up
> > 94250ea2-f12e-4dc6-9135-b626086ccffd
> > osd.7 down out weight 0 up_from 30353 up_thru 30558 down_at 30688
> > last_clean_interval [25533,30349) [v2:
> > 192.168.30.14:6816/4083104061,v1:192.168.30.14:6817/4083104061] [v2:
> > 192.168.30.14:6840/4094104061,v1:192.168.30.14:6841/4094104061]
> > autoout,exists de351aec-b91e-4c22-a0bf-85369bc14579
> > osd.8 up   in  weight 1 up_from 31226 up_thru 31668 down_at 31225
> > last_clean_interval [30983,31195) [v2:
> > 192.168.30.12:6824/1312484329,v1:192.168.30.12:6825/1312484329] [v2:
> > 192.168.30.12:6826/1312484329,v1:192.168.30.12:6827/1312484329]
> exists,up
> > 51f665b4-fa5b-4b17-8390-ed130145ef04
> > osd.9 up   in  weight 1 up_from 31351 up_thru 31673 down_at 31340
> > last_clean_interval [31315,31338) [v2:
> > 192.168.30.11:6810/1446838877,v1:192.168.30.11:6811/1446838877] [v2:
> > 192.168.30.11:6812/1446838877,v1:192.168.30.11:6813/1446838877]
> exists,up
> > 985f1127-d126-4629-b8cd-03cf2d914d99
> > osd.10 up   in  weight 1 up_from 31219 up_thru 31639 down_at 31218
> > last_clean_interval [30980,31195) [v2:
> > 192.168.30.12:6808/1587842953,v1:192.168.30.12:6809/1587842953] [v2:
> > 192.168.30.12:6810/1587842953,v1:192.168.30.12:6811/1587842953]
> exists,up
> > c7fca03e-4bd5-4485-a090-658ca967d5f6
> > osd.11 up   in  weight 1 up_from 31234 up_thru 31659 down_at 31223
> > last_clean_interval [30978,31195) [v2:
> > 192.168.30.12:6840/3403200742,v1:192.168.30.12:6841/3403200742] [v2:
> > 192.168.30.12:6842/3403200742,v1:192.168.30.12:6843/3403200742]
> exists,up
> > 81074bd7-ad9f-4e56-8885-cca4745f6c95
> > osd.12 up   in  weight 1 up_from 31230 up_thru 31717 down_at 31223
> > last_clean_interval [30975,31195) [v2:
> > 192.168.30.13:6816/4268732910,v1:192.168.30.13:6817/4268732910] [v2:
> > 192.168.30.13:6818/4268732910,v1:192.168.30.13:6819/4268732910]
> exists,up
> > af1b55dd-c110-4861-aed9-c0737cef8be1
> > osd.13 down out weight 0 up_from 30389 up_thru 30688 down_at 30695
> > last_clean_interval [25513,30317) [v2:
> > 192.168.30.16:6804/1573803727,v1:192.168.30.16:6805/1573803727] [v2:
> > 192.168.30.16:6806/1573803727,v1:192.168.30.16:6807/1573803727]
> > autoout,exists 737a3234-0f1f-4286-80e9-e89b581aae41
> > osd.14 up   in  weight 1 up_from 31218 up_thru 31709 down_at 31217
> > last_clean_interval [30979,31195) [v2:
> > 192.168.30.13:6834/2291187817,v1:192.168.30.13:6835/2291187817] [v2:
> > 192.168.30.13:6836/2291187817,v1:192.168.30.13:6837/2291187817]
> exists,up
> > 97cd6ac7-aca0-42fd-a049-d27289f83183
> > osd.15 down out weight 0 up_from 29463 up_thru 30531 down_at 30688
> > last_clean_interval [25493,29462) [v2:
> > 192.168.30.15:6808/2655269991,v1:192.168.30.15:6809/2655269991] [v2:
> > 192.168.30.15:6802/2662269991,v1:192.168.30.15:6803/2662269991]
> > autoout,exists 61aea8f4-5905-4be3-ae32-5eacf75a514e
> > osd.16 up   in  weight 1 up_from 31226 up_thru 31647 down_at 31223
> > last_clean_interval [30970,31195) [v2:
> > 192.168.30.13:6808/2624812081,v1:192.168.30.13:6809/2624812081] [v2:
> > 192.168.30.13:6810/2624812081,v1:192.168.30.13:6811/2624812081]
> exists,up
> > 791a7542-87cd-403d-a37e-8f00506b2eb6
> > osd.17 up   in  weight 1 up_from 31219 up_thru 31703 down_at 31218
> > last_clean_interval [30975,31195) [v2:
> > 192.168.30.13:6800/2978036397,v1:192.168.30.13:6801/2978036397] [v2:
> > 192.168.30.13:6802/2978036397,v1:192.168.30.13:6803/2978036397]
> exists,up
> > 4a915645-412f-49e6-8477-1577469905da
> > osd.18 down out weight 0 up_from 30334 up_thru 30566 down_at 30688
> > last_clean_interval [25543,30327) [v2:
> > 192.168.30.14:6832/985432137,v1:192.168.30.14:6833/985432137] [v2:
> > 192.168.30.14:6848/998432137,v1:192.168.30.14:6849/998432137]
> > autoout,exists 85f59d83-710c-4896-9200-bda4894fc3e8
> > osd.19 down in  weight 1 up_from 31303 up_thru 31321 down_at 31323
> > last_clean_interval [31292,31296) [v2:
> > 192.168.30.13:6826/375623427,v1:192.168.30.13:6827/375623427] [v2:
> > 192.168.30.13:6828/375623427,v1:192.168.30.13:6829/375623427] exists
> > 7d09b51a-bd6d-40f8-a009-78ab9937795d
> > osd.20 down out weight 0 up_from 30445 up_thru 30531 down_at 30688
> > last_clean_interval [28947,30444) [v2:
> > 192.168.30.14:6810/4062649162,v1:192.168.30.14:6811/4062649162] [v2:
> > 192.168.30.14:6800/4073649162,v1:192.168.30.14:6801/4073649162]
> > autoout,exists 7ef6cc1a-4755-4a14-b9df-f1f538d903ce
> > osd.21 up   in  weight 1 up_from 31345 up_thru 31567 down_at 31341
> > last_clean_interval [31307,31340) [v2:
> > 192.168.30.11:6826/1625231298,v1:192.168.30.11:6827/1625231298] [v2:
> > 192.168.30.11:6828/1625231298,v1:192.168.30.11:6829/1625231298]
> exists,up
> > 5ef2e39a-a353-4cb8-a49e-093fe39b94ef
> > osd.22 down in  weight 1 up_from 30383 up_thru 30528 down_at 30688
> > last_clean_interval [25549,30317) [v2:
> > 192.168.30.14:6806/1204256629,v1:192.168.30.14:6807/1204256629] [v2:
> > 192.168.30.14:6812/1204256629,v1:192.168.30.14:6813/1204256629] exists
> > c9befe11-a035-449c-8d17-42aaf191923d
> > osd.23 down in  weight 1 up_from 30334 up_thru 30576 down_at 30688
> > last_clean_interval [30263,30332) [v2:
> > 192.168.30.14:6802/3837786490,v1:192.168.30.14:6803/3837786490] [v2:
> > 192.168.30.14:6830/3838786490,v1:192.168.30.14:6831/3838786490] exists
> > 2081147b-065d-4da7-89d9-747e1ae02b8d
> > osd.24 down in  weight 1 up_from 29455 up_thru 30570 down_at 30688
> > last_clean_interval [25487,29453) [v2:
> > 192.168.30.15:6800/2008474583,v1:192.168.30.15:6801/2008474583] [v2:
> > 192.168.30.15:6810/2016474583,v1:192.168.30.15:6811/2016474583] exists
> > 39d78380-261c-4689-b53d-90713e6ffcca
> > osd.26 up   in  weight 1 up_from 31208 up_thru 31643 down_at 31207
> > last_clean_interval [30967,31195) [v2:
> > 192.168.30.12:6800/2861018947,v1:192.168.30.12:6801/2861018947] [v2:
> > 192.168.30.12:6802/2861018947,v1:192.168.30.12:6803/2861018947]
> exists,up
> > 046622c8-c09c-4254-8c15-3dc05a2f01ed
> > osd.28 down in  weight 1 up_from 30389 up_thru 30574 down_at 30691
> > last_clean_interval [25513,30312) [v2:
> > 192.168.30.16:6820/3466284570,v1:192.168.30.16:6821/3466284570] [v2:
> > 192.168.30.16:6822/3466284570,v1:192.168.30.16:6823/3466284570] exists
> > 10578b97-e3c4-4553-a8d0-6dcc46af5db1
> > osd.29 down in  weight 1 up_from 30378 up_thru 30554 down_at 30688
> > last_clean_interval [28595,30376) [v2:
> > 192.168.30.14:6808/3739543672,v1:192.168.30.14:6809/3739543672] [v2:
> > 192.168.30.14:6846/3747543672,v1:192.168.30.14:6847/3747543672] exists
> > 9698e936-8edf-4adf-92c9-a0b5202ed01a
> > osd.30 down in  weight 1 up_from 30449 up_thru 30531 down_at 30688
> > last_clean_interval [25502,30446) [v2:
> > 192.168.30.15:6825/2375507296,v1:192.168.30.15:6827/2375507296] [v2:
> > 192.168.30.15:6829/2375507296,v1:192.168.30.15:6831/2375507296] exists
> > e14d2a0f-a98a-44d4-8c06-4d893f673629
> > osd.31 down in  weight 1 up_from 30364 up_thru 30688 down_at 30700
> > last_clean_interval [25514,30361) [v2:
> > 192.168.30.16:6826/2835000708,v1:192.168.30.16:6827/2835000708] [v2:
> > 192.168.30.16:6802/2843000708,v1:192.168.30.16:6803/2843000708] exists
> > 541bca38-e704-483a-8cb8-39e5f69007d1
> > osd.32 up   in  weight 1 up_from 31209 up_thru 31627 down_at 31208
> > last_clean_interval [30974,31195) [v2:
> > 192.168.30.12:6832/3860067997,v1:192.168.30.12:6833/3860067997] [v2:
> > 192.168.30.12:6834/3860067997,v1:192.168.30.12:6835/3860067997]
> exists,up
> > 9200a57e-2845-43ff-9787-8f1f3158fe90
> > osd.33 down in  weight 1 up_from 30354 up_thru 30688 down_at 30693
> > last_clean_interval [25521,30350) [v2:
> > 192.168.30.16:6842/2342555666,v1:192.168.30.16:6843/2342555666] [v2:
> > 192.168.30.16:6844/2364555666,v1:192.168.30.16:6845/2364555666] exists
> > 20c55d85-cf9a-4133-a189-7fdad2318f58
> > osd.34 down in  weight 1 up_from 30390 up_thru 30688 down_at 30691
> > last_clean_interval [25516,30314) [v2:
> > 192.168.30.16:6808/2282629870,v1:192.168.30.16:6811/2282629870] [v2:
> > 192.168.30.16:6812/2282629870,v1:192.168.30.16:6814/2282629870] exists
> > 77e0ef8f-c047-4f84-afb2-a8ad054e562f
> > osd.35 up   in  weight 1 up_from 31204 up_thru 31657 down_at 31203
> > last_clean_interval [30958,31195) [v2:
> > 192.168.30.13:6842/1919357520,v1:192.168.30.13:6843/1919357520] [v2:
> > 192.168.30.13:6844/1919357520,v1:192.168.30.13:6845/1919357520]
> exists,up
> > 2d2de0cb-6d41-4957-a473-2bbe9ce227bf
> > osd.36 down in  weight 1 up_from 29494 up_thru 30560 down_at 30688
> > last_clean_interval [25491,29492) [v2:
> > 192.168.30.15:6816/2153321591,v1:192.168.30.15:6817/2153321591] [v2:
> > 192.168.30.15:6842/2158321591,v1:192.168.30.15:6843/2158321591] exists
> > 26114668-68b2-458b-89c2-cbad5507ab75
> >
> >
> >
> >>
> >> > On Jul 25, 2022, at 3:29 AM, Jeremy Hansen <
> >> farnsworth.mcfadden@xxxxxxxxx> wrote:
> >> >
> >> > I transitioned some servers to a new rack and now I'm having major
> >> issues
> >> > with Ceph upon bringing things back up.
> >> >
> >> > I believe the issue may be related to the ceph nodes coming back up
> with
> >> > different IPs before VLANs were set.  That's just a guess because I
> >> can't
> >> > think of any other reason this would happen.
> >> >
> >> > Current state:
> >> >
> >> > Every 2.0s: ceph -s
> >> >   cn01.ceph.la1.clx.corp: Mon Jul 25 10:13:05 2022
> >> >
> >> >  cluster:
> >> >    id:     bfa2ad58-c049-11eb-9098-3c8cf8ed728d
> >> >    health: HEALTH_WARN
> >> >            1 filesystem is degraded
> >> >            2 MDSs report slow metadata IOs
> >> >            2/5 mons down, quorum cn02,cn03,cn01
> >> >            9 osds down
> >> >            3 hosts (17 osds) down
> >> >            Reduced data availability: 97 pgs inactive, 9 pgs down
> >> >            Degraded data redundancy: 13860144/30824413 objects
> degraded
> >> > (44.965%), 411 pgs degraded, 482 pgs undersized
> >> >
> >> >  services:
> >> >    mon: 5 daemons, quorum cn02,cn03,cn01 (age 62m), out of quorum:
> cn05,
> >> > cn04
> >> >    mgr: cn02.arszct(active, since 5m)
> >> >    mds: 2/2 daemons up, 2 standby
> >> >    osd: 35 osds: 15 up (since 62m), 24 in (since 58m); 222 remapped
> pgs
> >> >
> >> >  data:
> >> >    volumes: 1/2 healthy, 1 recovering
> >> >    pools:   8 pools, 545 pgs
> >> >    objects: 7.71M objects, 6.7 TiB
> >> >    usage:   15 TiB used, 39 TiB / 54 TiB avail
> >> >    pgs:     0.367% pgs unknown
> >> >             17.431% pgs not active
> >> >             13860144/30824413 objects degraded (44.965%)
> >> >             1137693/30824413 objects misplaced (3.691%)
> >> >             280 active+undersized+degraded
> >> >             67  undersized+degraded+remapped+backfilling+peered
> >> >             57  active+undersized+remapped
> >> >             45  active+clean+remapped
> >> >             44  active+undersized+degraded+remapped+backfilling
> >> >             18  undersized+degraded+peered
> >> >             10  active+undersized
> >> >             9   down
> >> >             7   active+clean
> >> >             3   active+undersized+remapped+backfilling
> >> >             2   active+undersized+degraded+remapped+backfill_wait
> >> >             2   unknown
> >> >             1   undersized+peered
> >> >
> >> >  io:
> >> >    client:   170 B/s rd, 0 op/s rd, 0 op/s wr
> >> >    recovery: 168 MiB/s, 158 keys/s, 166 objects/s
> >> >
> >> > I have to disable and re-enable the dashboard just to use it.  It
> seems
> >> to
> >> > get bogged down after a few moments.
> >> >
> >> > The three servers that were moved to the new rack Ceph has marked as
> >> > "Down", but if I do a cephadm host-check, they all seem to pass:
> >> >
> >> > ************************ ceph  ************************
> >> > --------- cn01.ceph.---------
> >> > podman (/usr/bin/podman) version 4.0.2 is present
> >> > systemctl is present
> >> > lvcreate is present
> >> > Unit chronyd.service is enabled and running
> >> > Host looks OK
> >> > --------- cn02.ceph.---------
> >> > podman (/usr/bin/podman) version 4.0.2 is present
> >> > systemctl is present
> >> > lvcreate is present
> >> > Unit chronyd.service is enabled and running
> >> > Host looks OK
> >> > --------- cn03.ceph.---------
> >> > podman (/usr/bin/podman) version 4.0.2 is present
> >> > systemctl is present
> >> > lvcreate is present
> >> > Unit chronyd.service is enabled and running
> >> > Host looks OK
> >> > --------- cn04.ceph.---------
> >> > podman (/usr/bin/podman) version 4.0.2 is present
> >> > systemctl is present
> >> > lvcreate is present
> >> > Unit chronyd.service is enabled and running
> >> > Host looks OK
> >> > --------- cn05.ceph.---------
> >> > podman|docker (/usr/bin/podman) is present
> >> > systemctl is present
> >> > lvcreate is present
> >> > Unit chronyd.service is enabled and running
> >> > Host looks OK
> >> > --------- cn06.ceph.---------
> >> > podman (/usr/bin/podman) version 4.0.2 is present
> >> > systemctl is present
> >> > lvcreate is present
> >> > Unit chronyd.service is enabled and running
> >> > Host looks OK
> >> >
> >> > It seems to be recovering with what it has left, but a large amount of
> >> OSDs
> >> > are down.  When trying to restart one of the down'd OSDs, I see a huge
> >> dump.
> >> >
> >> > Jul 25 03:19:38 cn06.ceph
> >> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
> >> > 2022-07-25T10:19:38.532+0000 7fce14a6c080  0 osd.34 30689 done with
> >> init,
> >> > starting boot process
> >> > Jul 25 03:19:38 cn06.ceph
> >> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
> >> > 2022-07-25T10:19:38.532+0000 7fce14a6c080  1 osd.34 30689 start_boot
> >> > Jul 25 03:20:10 cn06.ceph
> >> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
> >> > 2022-07-25T10:20:10.655+0000 7fcdfd12d700  1 osd.34 30689 start_boot
> >> > Jul 25 03:20:41 cn06.ceph
> >> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
> >> > 2022-07-25T10:20:41.159+0000 7fcdfd12d700  1 osd.34 30689 start_boot
> >> > Jul 25 03:21:11 cn06.ceph
> >> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
> >> > 2022-07-25T10:21:11.662+0000 7fcdfd12d700  1 osd.34 30689 start_boot
> >> >
> >> > At this point it just keeps printing start_boot, but the dashboard has
> >> it
> >> > marked as "in" but "down".
> >> >
> >> > On these three hosts that moved, there were a bunch marked as "out"
> and
> >> > "down", and some with "in" but "down".
> >> >
> >> > Not sure where to go next.  I'm going to let the recovery continue and
> >> hope
> >> > that my 4x replication on these pools saves me.
> >> >
> >> > Not sure where to go from here.  Any help is very much appreciated.
> >> This
> >> > Ceph cluster holds all of our Cloudstack images...  it would be
> >> terrible to
> >> > lose this data.
> >> > _______________________________________________
> >> > ceph-users mailing list -- ceph-users@xxxxxxx
> >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>
> >
> > On Mon, Jul 25, 2022 at 10:15 AM Jeremy Hansen <
> > farnsworth.mcfadden@xxxxxxxxx> wrote:
> >
> >>
> >>
> >> On Mon, Jul 25, 2022 at 7:26 AM Anthony D'Atri <anthony.datri@xxxxxxxxx
> >
> >> wrote:
> >>
> >>> Do your values for public and cluster network include the new addresses
> >>> on all nodes?
> >>>
> >>
> >> This cluster only has one network.  There is no separation between
> >> public and cluster.  Three of the nodes momentarily came up using a
> >> different IP address.
> >>
> >> I've also noticed on one of the nodes that did not move or have any IP
> >> issue, on a single node, from the dashboard, it names the same device
> for
> >> two different osd's:
> >>
> >> 2 cn01 out destroyed hdd TOSHIBA_MG04SCA40EE_21M0A0CKFWZB Unknown sdb
> >> osd.2
> >>
> >> 3 cn01 out destroyed ssd SAMSUNG_MZILT3T8HBLS/007_S5G0NE0R200159 Unknown
> >> sdb osd.3
> >>
> >>
> >> [ceph: root@cn01 /]# ceph-volume inventory
> >>
> >> Device Path               Size         rotates available Model name
> >> /dev/sda                  3.64 TB      True    False     MG04SCA40EE
> >> /dev/sdb                  3.49 TB      False   False
>  MZILT3T8HBLS/007
> >> /dev/sdc                  3.64 TB      True    False     MG04SCA40EE
> >> /dev/sdd                  3.64 TB      True    False     MG04SCA40EE
> >> /dev/sde                  3.49 TB      False   False
>  MZILT3T8HBLS/007
> >> /dev/sdf                  3.64 TB      True    False     MG04SCA40EE
> >> /dev/sdg                  698.64 GB    True    False     SEAGATE
> ST375064
> >>
> >> [ceph: root@cn01 /]# ceph osd info
> >> osd.0 down out weight 0 up_from 30231 up_thru 30564 down_at 30688
> >> last_clean_interval [25500,30228) [v2:
> >> 192.168.30.15:6818/2512683421,v1:192.168.30.15:6819/2512683421] [v2:
> >> 192.168.30.15:6824/2512683421,v1:192.168.30.15:6826/2512683421]
> >> autoout,exists d14cf503-a303-4fa4-a713-9530b67d613a
> >> osd.1 down out weight 0 up_from 30393 up_thru 30688 down_at 30697
> >> last_clean_interval [25518,30321) [v2:
> >> 192.168.30.16:6834/1781855831,v1:192.168.30.16:6835/1781855831] [v2:
> >> 192.168.30.16:6836/1781855831,v1:192.168.30.16:6837/1781855831]
> >> autoout,exists 0d521411-c835-4fa3-beca-3631b4ff6bf7
> >> osd.2 down out weight 0 up_from 31316 up_thru 31293 down_at 31317
> >> last_clean_interval [31218,31296) [v2:
> >> 192.168.30.11:6810/894589880,v1:192.168.30.11:6811/894589880] [v2:
> >> 192.168.30.11:6812/894589880,v1:192.168.30.11:6813/894589880]
> >> destroyed,exists
> >> osd.3 down out weight 0 up_from 31265 up_thru 31266 down_at 31268
> >> last_clean_interval [31254,31256) [v2:
> >> 192.168.30.11:6818/1641948535,v1:192.168.30.11:6819/1641948535] [v2:
> >> 192.168.30.11:6820/1641948535,v1:192.168.30.11:6821/1641948535]
> >> destroyed,exists
> >> osd.4 up   in  weight 1 up_from 31356 up_thru 31581 down_at 31339
> >> last_clean_interval [31320,31338) [v2:
> >> 192.168.30.11:6802/2785067179,v1:192.168.30.11:6803/2785067179] [v2:
> >> 192.168.30.11:6804/2785067179,v1:192.168.30.11:6805/2785067179]
> >> exists,up 3afd06db-b91d-44fe-9305-5eb95f7a59b9
> >> osd.5 up   in  weight 1 up_from 31347 up_thru 31699 down_at 31339
> >> last_clean_interval [31311,31338) [v2:
> >> 192.168.30.11:6818/1936771540,v1:192.168.30.11:6819/1936771540] [v2:
> >> 192.168.30.11:6820/1936771540,v1:192.168.30.11:6821/1936771540]
> >> exists,up 063c2ccf-02ce-4f5e-8252-dddfbb258a95
> >> osd.6 up   in  weight 1 up_from 31218 up_thru 31711 down_at 31217
> >> last_clean_interval [30978,31195) [v2:
> >> 192.168.30.12:6816/1585973160,v1:192.168.30.12:6817/1585973160] [v2:
> >> 192.168.30.12:6818/1585973160,v1:192.168.30.12:6819/1585973160]
> >> exists,up 94250ea2-f12e-4dc6-9135-b626086ccffd
> >> osd.7 down out weight 0 up_from 30353 up_thru 30558 down_at 30688
> >> last_clean_interval [25533,30349) [v2:
> >> 192.168.30.14:6816/4083104061,v1:192.168.30.14:6817/4083104061] [v2:
> >> 192.168.30.14:6840/4094104061,v1:192.168.30.14:6841/4094104061]
> >> autoout,exists de351aec-b91e-4c22-a0bf-85369bc14579
> >> osd.8 up   in  weight 1 up_from 31226 up_thru 31668 down_at 31225
> >> last_clean_interval [30983,31195) [v2:
> >> 192.168.30.12:6824/1312484329,v1:192.168.30.12:6825/1312484329] [v2:
> >> 192.168.30.12:6826/1312484329,v1:192.168.30.12:6827/1312484329]
> >> exists,up 51f665b4-fa5b-4b17-8390-ed130145ef04
> >> osd.9 up   in  weight 1 up_from 31351 up_thru 31673 down_at 31340
> >> last_clean_interval [31315,31338) [v2:
> >> 192.168.30.11:6810/1446838877,v1:192.168.30.11:6811/1446838877] [v2:
> >> 192.168.30.11:6812/1446838877,v1:192.168.30.11:6813/1446838877]
> >> exists,up 985f1127-d126-4629-b8cd-03cf2d914d99
> >> osd.10 up   in  weight 1 up_from 31219 up_thru 31639 down_at 31218
> >> last_clean_interval [30980,31195) [v2:
> >> 192.168.30.12:6808/1587842953,v1:192.168.30.12:6809/1587842953] [v2:
> >> 192.168.30.12:6810/1587842953,v1:192.168.30.12:6811/1587842953]
> >> exists,up c7fca03e-4bd5-4485-a090-658ca967d5f6
> >> osd.11 up   in  weight 1 up_from 31234 up_thru 31659 down_at 31223
> >> last_clean_interval [30978,31195) [v2:
> >> 192.168.30.12:6840/3403200742,v1:192.168.30.12:6841/3403200742] [v2:
> >> 192.168.30.12:6842/3403200742,v1:192.168.30.12:6843/3403200742]
> >> exists,up 81074bd7-ad9f-4e56-8885-cca4745f6c95
> >> osd.12 up   in  weight 1 up_from 31230 up_thru 31717 down_at 31223
> >> last_clean_interval [30975,31195) [v2:
> >> 192.168.30.13:6816/4268732910,v1:192.168.30.13:6817/4268732910] [v2:
> >> 192.168.30.13:6818/4268732910,v1:192.168.30.13:6819/4268732910]
> >> exists,up af1b55dd-c110-4861-aed9-c0737cef8be1
> >> osd.13 down out weight 0 up_from 30389 up_thru 30688 down_at 30695
> >> last_clean_interval [25513,30317) [v2:
> >> 192.168.30.16:6804/1573803727,v1:192.168.30.16:6805/1573803727] [v2:
> >> 192.168.30.16:6806/1573803727,v1:192.168.30.16:6807/1573803727]
> >> autoout,exists 737a3234-0f1f-4286-80e9-e89b581aae41
> >> osd.14 up   in  weight 1 up_from 31218 up_thru 31709 down_at 31217
> >> last_clean_interval [30979,31195) [v2:
> >> 192.168.30.13:6834/2291187817,v1:192.168.30.13:6835/2291187817] [v2:
> >> 192.168.30.13:6836/2291187817,v1:192.168.30.13:6837/2291187817]
> >> exists,up 97cd6ac7-aca0-42fd-a049-d27289f83183
> >> osd.15 down out weight 0 up_from 29463 up_thru 30531 down_at 30688
> >> last_clean_interval [25493,29462) [v2:
> >> 192.168.30.15:6808/2655269991,v1:192.168.30.15:6809/2655269991] [v2:
> >> 192.168.30.15:6802/2662269991,v1:192.168.30.15:6803/2662269991]
> >> autoout,exists 61aea8f4-5905-4be3-ae32-5eacf75a514e
> >> osd.16 up   in  weight 1 up_from 31226 up_thru 31647 down_at 31223
> >> last_clean_interval [30970,31195) [v2:
> >> 192.168.30.13:6808/2624812081,v1:192.168.30.13:6809/2624812081] [v2:
> >> 192.168.30.13:6810/2624812081,v1:192.168.30.13:6811/2624812081]
> >> exists,up 791a7542-87cd-403d-a37e-8f00506b2eb6
> >> osd.17 up   in  weight 1 up_from 31219 up_thru 31703 down_at 31218
> >> last_clean_interval [30975,31195) [v2:
> >> 192.168.30.13:6800/2978036397,v1:192.168.30.13:6801/2978036397] [v2:
> >> 192.168.30.13:6802/2978036397,v1:192.168.30.13:6803/2978036397]
> >> exists,up 4a915645-412f-49e6-8477-1577469905da
> >> osd.18 down out weight 0 up_from 30334 up_thru 30566 down_at 30688
> >> last_clean_interval [25543,30327) [v2:
> >> 192.168.30.14:6832/985432137,v1:192.168.30.14:6833/985432137] [v2:
> >> 192.168.30.14:6848/998432137,v1:192.168.30.14:6849/998432137]
> >> autoout,exists 85f59d83-710c-4896-9200-bda4894fc3e8
> >> osd.19 down in  weight 1 up_from 31303 up_thru 31321 down_at 31323
> >> last_clean_interval [31292,31296) [v2:
> >> 192.168.30.13:6826/375623427,v1:192.168.30.13:6827/375623427] [v2:
> >> 192.168.30.13:6828/375623427,v1:192.168.30.13:6829/375623427] exists
> >> 7d09b51a-bd6d-40f8-a009-78ab9937795d
> >> osd.20 down out weight 0 up_from 30445 up_thru 30531 down_at 30688
> >> last_clean_interval [28947,30444) [v2:
> >> 192.168.30.14:6810/4062649162,v1:192.168.30.14:6811/4062649162] [v2:
> >> 192.168.30.14:6800/4073649162,v1:192.168.30.14:6801/4073649162]
> >> autoout,exists 7ef6cc1a-4755-4a14-b9df-f1f538d903ce
> >> osd.21 up   in  weight 1 up_from 31345 up_thru 31567 down_at 31341
> >> last_clean_interval [31307,31340) [v2:
> >> 192.168.30.11:6826/1625231298,v1:192.168.30.11:6827/1625231298] [v2:
> >> 192.168.30.11:6828/1625231298,v1:192.168.30.11:6829/1625231298]
> >> exists,up 5ef2e39a-a353-4cb8-a49e-093fe39b94ef
> >> osd.22 down in  weight 1 up_from 30383 up_thru 30528 down_at 30688
> >> last_clean_interval [25549,30317) [v2:
> >> 192.168.30.14:6806/1204256629,v1:192.168.30.14:6807/1204256629] [v2:
> >> 192.168.30.14:6812/1204256629,v1:192.168.30.14:6813/1204256629] exists
> >> c9befe11-a035-449c-8d17-42aaf191923d
> >> osd.23 down in  weight 1 up_from 30334 up_thru 30576 down_at 30688
> >> last_clean_interval [30263,30332) [v2:
> >> 192.168.30.14:6802/3837786490,v1:192.168.30.14:6803/3837786490] [v2:
> >> 192.168.30.14:6830/3838786490,v1:192.168.30.14:6831/3838786490] exists
> >> 2081147b-065d-4da7-89d9-747e1ae02b8d
> >> osd.24 down in  weight 1 up_from 29455 up_thru 30570 down_at 30688
> >> last_clean_interval [25487,29453) [v2:
> >> 192.168.30.15:6800/2008474583,v1:192.168.30.15:6801/2008474583] [v2:
> >> 192.168.30.15:6810/2016474583,v1:192.168.30.15:6811/2016474583] exists
> >> 39d78380-261c-4689-b53d-90713e6ffcca
> >> osd.26 up   in  weight 1 up_from 31208 up_thru 31643 down_at 31207
> >> last_clean_interval [30967,31195) [v2:
> >> 192.168.30.12:6800/2861018947,v1:192.168.30.12:6801/2861018947] [v2:
> >> 192.168.30.12:6802/2861018947,v1:192.168.30.12:6803/2861018947]
> >> exists,up 046622c8-c09c-4254-8c15-3dc05a2f01ed
> >> osd.28 down in  weight 1 up_from 30389 up_thru 30574 down_at 30691
> >> last_clean_interval [25513,30312) [v2:
> >> 192.168.30.16:6820/3466284570,v1:192.168.30.16:6821/3466284570] [v2:
> >> 192.168.30.16:6822/3466284570,v1:192.168.30.16:6823/3466284570] exists
> >> 10578b97-e3c4-4553-a8d0-6dcc46af5db1
> >> osd.29 down in  weight 1 up_from 30378 up_thru 30554 down_at 30688
> >> last_clean_interval [28595,30376) [v2:
> >> 192.168.30.14:6808/3739543672,v1:192.168.30.14:6809/3739543672] [v2:
> >> 192.168.30.14:6846/3747543672,v1:192.168.30.14:6847/3747543672] exists
> >> 9698e936-8edf-4adf-92c9-a0b5202ed01a
> >> osd.30 down in  weight 1 up_from 30449 up_thru 30531 down_at 30688
> >> last_clean_interval [25502,30446) [v2:
> >> 192.168.30.15:6825/2375507296,v1:192.168.30.15:6827/2375507296] [v2:
> >> 192.168.30.15:6829/2375507296,v1:192.168.30.15:6831/2375507296] exists
> >> e14d2a0f-a98a-44d4-8c06-4d893f673629
> >> osd.31 down in  weight 1 up_from 30364 up_thru 30688 down_at 30700
> >> last_clean_interval [25514,30361) [v2:
> >> 192.168.30.16:6826/2835000708,v1:192.168.30.16:6827/2835000708] [v2:
> >> 192.168.30.16:6802/2843000708,v1:192.168.30.16:6803/2843000708] exists
> >> 541bca38-e704-483a-8cb8-39e5f69007d1
> >> osd.32 up   in  weight 1 up_from 31209 up_thru 31627 down_at 31208
> >> last_clean_interval [30974,31195) [v2:
> >> 192.168.30.12:6832/3860067997,v1:192.168.30.12:6833/3860067997] [v2:
> >> 192.168.30.12:6834/3860067997,v1:192.168.30.12:6835/3860067997]
> >> exists,up 9200a57e-2845-43ff-9787-8f1f3158fe90
> >> osd.33 down in  weight 1 up_from 30354 up_thru 30688 down_at 30693
> >> last_clean_interval [25521,30350) [v2:
> >> 192.168.30.16:6842/2342555666,v1:192.168.30.16:6843/2342555666] [v2:
> >> 192.168.30.16:6844/2364555666,v1:192.168.30.16:6845/2364555666] exists
> >> 20c55d85-cf9a-4133-a189-7fdad2318f58
> >> osd.34 down in  weight 1 up_from 30390 up_thru 30688 down_at 30691
> >> last_clean_interval [25516,30314) [v2:
> >> 192.168.30.16:6808/2282629870,v1:192.168.30.16:6811/2282629870] [v2:
> >> 192.168.30.16:6812/2282629870,v1:192.168.30.16:6814/2282629870] exists
> >> 77e0ef8f-c047-4f84-afb2-a8ad054e562f
> >> osd.35 up   in  weight 1 up_from 31204 up_thru 31657 down_at 31203
> >> last_clean_interval [30958,31195) [v2:
> >> 192.168.30.13:6842/1919357520,v1:192.168.30.13:6843/1919357520] [v2:
> >> 192.168.30.13:6844/1919357520,v1:192.168.30.13:6845/1919357520]
> >> exists,up 2d2de0cb-6d41-4957-a473-2bbe9ce227bf
> >> osd.36 down in  weight 1 up_from 29494 up_thru 30560 down_at 30688
> >> last_clean_interval [25491,29492) [v2:
> >> 192.168.30.15:6816/2153321591,v1:192.168.30.15:6817/2153321591] [v2:
> >> 192.168.30.15:6842/2158321591,v1:192.168.30.15:6843/2158321591] exists
> >> 26114668-68b2-458b-89c2-cbad5507ab75
> >>
> >>
> >>
> >>>
> >>> > On Jul 25, 2022, at 3:29 AM, Jeremy Hansen <
> >>> farnsworth.mcfadden@xxxxxxxxx> wrote:
> >>> >
> >>> > I transitioned some servers to a new rack and now I'm having major
> >>> issues
> >>> > with Ceph upon bringing things back up.
> >>> >
> >>> > I believe the issue may be related to the ceph nodes coming back up
> >>> with
> >>> > different IPs before VLANs were set.  That's just a guess because I
> >>> can't
> >>> > think of any other reason this would happen.
> >>> >
> >>> > Current state:
> >>> >
> >>> > Every 2.0s: ceph -s
> >>> >   cn01.ceph.la1.clx.corp: Mon Jul 25 10:13:05 2022
> >>> >
> >>> >  cluster:
> >>> >    id:     bfa2ad58-c049-11eb-9098-3c8cf8ed728d
> >>> >    health: HEALTH_WARN
> >>> >            1 filesystem is degraded
> >>> >            2 MDSs report slow metadata IOs
> >>> >            2/5 mons down, quorum cn02,cn03,cn01
> >>> >            9 osds down
> >>> >            3 hosts (17 osds) down
> >>> >            Reduced data availability: 97 pgs inactive, 9 pgs down
> >>> >            Degraded data redundancy: 13860144/30824413 objects
> degraded
> >>> > (44.965%), 411 pgs degraded, 482 pgs undersized
> >>> >
> >>> >  services:
> >>> >    mon: 5 daemons, quorum cn02,cn03,cn01 (age 62m), out of quorum:
> >>> cn05,
> >>> > cn04
> >>> >    mgr: cn02.arszct(active, since 5m)
> >>> >    mds: 2/2 daemons up, 2 standby
> >>> >    osd: 35 osds: 15 up (since 62m), 24 in (since 58m); 222 remapped
> pgs
> >>> >
> >>> >  data:
> >>> >    volumes: 1/2 healthy, 1 recovering
> >>> >    pools:   8 pools, 545 pgs
> >>> >    objects: 7.71M objects, 6.7 TiB
> >>> >    usage:   15 TiB used, 39 TiB / 54 TiB avail
> >>> >    pgs:     0.367% pgs unknown
> >>> >             17.431% pgs not active
> >>> >             13860144/30824413 objects degraded (44.965%)
> >>> >             1137693/30824413 objects misplaced (3.691%)
> >>> >             280 active+undersized+degraded
> >>> >             67  undersized+degraded+remapped+backfilling+peered
> >>> >             57  active+undersized+remapped
> >>> >             45  active+clean+remapped
> >>> >             44  active+undersized+degraded+remapped+backfilling
> >>> >             18  undersized+degraded+peered
> >>> >             10  active+undersized
> >>> >             9   down
> >>> >             7   active+clean
> >>> >             3   active+undersized+remapped+backfilling
> >>> >             2   active+undersized+degraded+remapped+backfill_wait
> >>> >             2   unknown
> >>> >             1   undersized+peered
> >>> >
> >>> >  io:
> >>> >    client:   170 B/s rd, 0 op/s rd, 0 op/s wr
> >>> >    recovery: 168 MiB/s, 158 keys/s, 166 objects/s
> >>> >
> >>> > I have to disable and re-enable the dashboard just to use it.  It
> >>> seems to
> >>> > get bogged down after a few moments.
> >>> >
> >>> > The three servers that were moved to the new rack Ceph has marked as
> >>> > "Down", but if I do a cephadm host-check, they all seem to pass:
> >>> >
> >>> > ************************ ceph  ************************
> >>> > --------- cn01.ceph.---------
> >>> > podman (/usr/bin/podman) version 4.0.2 is present
> >>> > systemctl is present
> >>> > lvcreate is present
> >>> > Unit chronyd.service is enabled and running
> >>> > Host looks OK
> >>> > --------- cn02.ceph.---------
> >>> > podman (/usr/bin/podman) version 4.0.2 is present
> >>> > systemctl is present
> >>> > lvcreate is present
> >>> > Unit chronyd.service is enabled and running
> >>> > Host looks OK
> >>> > --------- cn03.ceph.---------
> >>> > podman (/usr/bin/podman) version 4.0.2 is present
> >>> > systemctl is present
> >>> > lvcreate is present
> >>> > Unit chronyd.service is enabled and running
> >>> > Host looks OK
> >>> > --------- cn04.ceph.---------
> >>> > podman (/usr/bin/podman) version 4.0.2 is present
> >>> > systemctl is present
> >>> > lvcreate is present
> >>> > Unit chronyd.service is enabled and running
> >>> > Host looks OK
> >>> > --------- cn05.ceph.---------
> >>> > podman|docker (/usr/bin/podman) is present
> >>> > systemctl is present
> >>> > lvcreate is present
> >>> > Unit chronyd.service is enabled and running
> >>> > Host looks OK
> >>> > --------- cn06.ceph.---------
> >>> > podman (/usr/bin/podman) version 4.0.2 is present
> >>> > systemctl is present
> >>> > lvcreate is present
> >>> > Unit chronyd.service is enabled and running
> >>> > Host looks OK
> >>> >
> >>> > It seems to be recovering with what it has left, but a large amount
> of
> >>> OSDs
> >>> > are down.  When trying to restart one of the down'd OSDs, I see a
> huge
> >>> dump.
> >>> >
> >>> > Jul 25 03:19:38 cn06.ceph
> >>> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
> >>> > 2022-07-25T10:19:38.532+0000 7fce14a6c080  0 osd.34 30689 done with
> >>> init,
> >>> > starting boot process
> >>> > Jul 25 03:19:38 cn06.ceph
> >>> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
> >>> > 2022-07-25T10:19:38.532+0000 7fce14a6c080  1 osd.34 30689 start_boot
> >>> > Jul 25 03:20:10 cn06.ceph
> >>> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
> >>> > 2022-07-25T10:20:10.655+0000 7fcdfd12d700  1 osd.34 30689 start_boot
> >>> > Jul 25 03:20:41 cn06.ceph
> >>> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
> >>> > 2022-07-25T10:20:41.159+0000 7fcdfd12d700  1 osd.34 30689 start_boot
> >>> > Jul 25 03:21:11 cn06.ceph
> >>> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
> >>> > 2022-07-25T10:21:11.662+0000 7fcdfd12d700  1 osd.34 30689 start_boot
> >>> >
> >>> > At this point it just keeps printing start_boot, but the dashboard
> has
> >>> it
> >>> > marked as "in" but "down".
> >>> >
> >>> > On these three hosts that moved, there were a bunch marked as "out"
> and
> >>> > "down", and some with "in" but "down".
> >>> >
> >>> > Not sure where to go next.  I'm going to let the recovery continue
> and
> >>> hope
> >>> > that my 4x replication on these pools saves me.
> >>> >
> >>> > Not sure where to go from here.  Any help is very much appreciated.
> >>> This
> >>> > Ceph cluster holds all of our Cloudstack images...  it would be
> >>> terrible to
> >>> > lose this data.
> >>> > _______________________________________________
> >>> > ceph-users mailing list -- ceph-users@xxxxxxx
> >>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >>>
> >>>
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx