Re: [Warning Possible spam] Re: Issues after a shutdown

Adam King <adking@xxxxxxxxxx> · Mon, 25 Jul 2022 17:41:06 -0400

Do the journal logs for any of the OSDs that are marked down give any
useful info on why they're failing to start back up? If the host level ip
issues have gone away I think that would be the next place to check.

On Mon, Jul 25, 2022 at 5:03 PM Jeremy Hansen <farnsworth.mcfadden@xxxxxxxxx>
wrote:

> I noticed this on the initial run of ceph health, but I no longer see it.
> When you say "don't use ceph adm", can you explain why this is bad?
>
> This is ceph health outside of cephadm shell:
>
> HEALTH_WARN 1 filesystem is degraded; 2 MDSs report slow metadata IOs; 2/5
> mons down, quorum cn02,cn03,cn01; 10 osds down; 3 hosts (17 osds) down;
> Reduced data ava
> ilability: 13 pgs inactive, 9 pgs down; Degraded data redundancy:
> 8515690/30862245 objects degraded (27.593%), 326 pgs degraded, 447 pgs
> undersized
> [WRN] FS_DEGRADED: 1 filesystem is degraded
>     fs coldlogix is degraded
> [WRN] MDS_SLOW_METADATA_IO: 2 MDSs report slow metadata IOs
>     mds.coldlogix.cn01.uriofo(mds.0): 2 slow metadata IOs are blocked > 30
> secs, oldest blocked for 3701 secs
>     mds.btc.cn02.ouvaus(mds.0): 1 slow metadata IOs are blocked > 30 secs,
> oldest blocked for 382 secs
> [WRN] MON_DOWN: 2/5 mons down, quorum cn02,cn03,cn01
>     mon.cn05 (rank 0) addr [v2:
> 192.168.30.15:3300/0,v1:192.168.30.15:6789/0]
> is down (out of quorum)
>     mon.cn04 (rank 3) addr [v2:
> 192.168.30.14:3300/0,v1:192.168.30.14:6789/0]
> is down (out of quorum)
> [WRN] OSD_DOWN: 10 osds down
>     osd.0 (root=default,host=cn05) is down
>     osd.1 (root=default,host=cn06) is down
>     osd.7 (root=default,host=cn04) is down
>     osd.13 (root=default,host=cn06) is down
>     osd.15 (root=default,host=cn05) is down
>     osd.18 (root=default,host=cn04) is down
>     osd.20 (root=default,host=cn04) is down
>     osd.33 (root=default,host=cn06) is down
>     osd.34 (root=default,host=cn06) is down
>     osd.36 (root=default,host=cn05) is down
> [WRN] OSD_HOST_DOWN: 3 hosts (17 osds) down
>     host cn04 (root=default) (6 osds) is down
>     host cn05 (root=default) (5 osds) is down
>     host cn06 (root=default) (6 osds) is down
> [WRN] PG_AVAILABILITY: Reduced data availability: 13 pgs inactive, 9 pgs
> down
>     pg 9.3a is down, acting [8]
>     pg 9.7a is down, acting [8]
>     pg 9.ba is down, acting [8]
>     pg 9.fa is down, acting [8]
>     pg 11.3 is stuck inactive for 39h, current state
> undersized+degraded+peered, last acting [11]
>     pg 11.11 is down, acting [19,9]
>     pg 11.1f is stuck inactive for 13h, current state
> undersized+degraded+peered, last acting [10]
>     pg 12.36 is down, acting [21,16]
>     pg 12.59 is down, acting [26,5]
>     pg 12.66 is down, acting [5]
>     pg 19.4 is stuck inactive for 39h, current state
> undersized+degraded+peered, last acting [6]
>     pg 19.1c is down, acting [21,16,11]
>     pg 21.1 is stuck inactive for 2m, current state unknown, last acting []
> [WRN] PG_DEGRADED: Degraded data redundancy: 8515690/30862245 objects
> degraded (27.593%), 326 pgs degraded, 447 pgs undersized
>     pg 9.75 is stuck undersized for 61m, current state
> active+undersized+remapped, last acting [4,8,35]
>     pg 9.76 is stuck undersized for 62m, current state
> active+undersized+degraded, last acting [35,10,21]
>     pg 9.77 is stuck undersized for 61m, current state
> active+undersized+remapped, last acting [32,35,4]
>     pg 9.78 is stuck undersized for 62m, current state
> active+undersized+degraded, last acting [14,10]
>     pg 9.79 is stuck undersized for 62m, current state
> active+undersized+degraded, last acting [21,32]
>     pg 9.7b is stuck undersized for 61m, current state
> active+undersized+degraded, last acting [8,12,5]
>     pg 9.7c is stuck undersized for 61m, current state
> active+undersized+degraded, last acting [4,35,10]
>     pg 9.7d is stuck undersized for 62m, current state
> active+undersized+degraded, last acting [5,19,10]
>     pg 9.7e is stuck undersized for 62m, current state
> active+undersized+remapped, last acting [21,10,17]
>     pg 9.80 is stuck undersized for 61m, current state
> active+undersized+degraded, last acting [8,4,17]
>     pg 9.81 is stuck undersized for 62m, current state
> active+undersized+degraded, last acting [14,26]
>     pg 9.82 is stuck undersized for 62m, current state
> active+undersized+degraded, last acting [26,16]
>     pg 9.83 is stuck undersized for 61m, current state
> active+undersized+degraded, last acting [8,4]
>     pg 9.84 is stuck undersized for 61m, current state
> active+undersized+degraded, last acting [4,35,6]
>     pg 9.85 is stuck undersized for 61m, current state
> active+undersized+degraded, last acting [32,12,9]
>     pg 9.86 is stuck undersized for 61m, current state
> active+undersized+degraded, last acting [35,5,8]
>     pg 9.87 is stuck undersized for 61m, current state
> active+undersized+degraded, last acting [9,12]
>     pg 9.88 is stuck undersized for 62m, current state
> active+undersized+remapped, last acting [19,32,35]
>     pg 9.89 is stuck undersized for 61m, current state
> active+undersized+degraded, last acting [10,14,4]
>     pg 9.8a is stuck undersized for 62m, current state
> active+undersized+degraded, last acting [21,19]
>     pg 9.8b is stuck undersized for 61m, current state
> active+undersized+degraded, last acting [8,35]
>     pg 9.8c is stuck undersized for 58m, current state
> active+undersized+remapped, last acting [10,19,5]
>     pg 9.8d is stuck undersized for 61m, current state
> active+undersized+remapped, last acting [9,6]
>     pg 9.8f is stuck undersized for 62m, current state
> active+undersized+remapped, last acting [19,26,17]
>     pg 9.90 is stuck undersized for 62m, current state
> active+undersized+remapped, last acting [35,26]
>     pg 9.91 is stuck undersized for 62m, current state
> active+undersized+degraded, last acting [17,5]
>     pg 9.92 is stuck undersized for 62m, current state
> active+undersized+degraded, last acting [21,26]
>     pg 9.93 is stuck undersized for 62m, current state
> active+undersized+remapped, last acting [19,26,5]
>     pg 9.94 is stuck undersized for 62m, current state
> active+undersized+degraded, last acting [21,11]
>     pg 9.95 is stuck undersized for 61m, current state
> active+undersized+degraded, last acting [8,19]
>     pg 9.96 is stuck undersized for 62m, current state
> active+undersized+degraded, last acting [17,6]
>     pg 9.97 is stuck undersized for 61m, current state
> active+undersized+degraded, last acting [8,9,16]
>     pg 9.98 is stuck undersized for 62m, current state
> active+undersized+degraded, last acting [6,21]
>     pg 9.99 is stuck undersized for 61m, current state
> active+undersized+degraded, last acting [10,9]
>     pg 9.9a is stuck undersized for 61m, current state
> active+undersized+remapped, last acting [4,16,10]
>     pg 9.9b is stuck undersized for 61m, current state
> active+undersized+degraded, last acting [12,4,11]
>     pg 9.9c is stuck undersized for 61m, current state
> active+undersized+degraded, last acting [9,16]
>     pg 9.9d is stuck undersized for 62m, current state
> active+undersized+degraded, last acting [26,35]
>     pg 9.9f is stuck undersized for 61m, current state
> active+undersized+degraded, last acting [9,17,26]
>     pg 12.70 is stuck undersized for 62m, current state
> active+undersized+degraded, last acting [21,35]
>     pg 12.71 is active+undersized+degraded, acting [6,12]
>     pg 12.72 is stuck undersized for 61m, current state
> active+undersized+degraded, last acting [10,14,4]
>     pg 12.73 is stuck undersized for 62m, current state
> active+undersized+remapped, last acting [5,17,11]
>     pg 12.78 is stuck undersized for 61m, current state
> active+undersized+degraded, last acting [5,8,35]
>     pg 12.79 is stuck undersized for 61m, current state
> active+undersized+degraded, last acting [4,17]
>     pg 12.7a is stuck undersized for 62m, current state
> active+undersized+degraded, last acting [10,21]
>     pg 12.7b is stuck undersized for 62m, current state
> active+undersized+remapped, last acting [17,21,11]
>     pg 12.7c is stuck undersized for 62m, current state
> active+undersized+degraded, last acting [32,21,16]
>     pg 12.7d is stuck undersized for 61m, current state
> active+undersized+degraded, last acting [35,6,9]
>     pg 12.7e is stuck undersized for 61m, current state
> active+undersized+degraded, last acting [26,4]
>     pg 12.7f is stuck undersized for 61m, current state
> active+undersized+degraded, last acting [9,14]
>
> It's no longer giving me the ssh key issues but hasn't done anything to
> improve my situation.  When the machines came up with a different IP, did
> this somehow throw off some kind of ssh known hosts file or pub key
> exchange?  It's all very strange why a momentary bad IP could wreak so much
> havoc.
>
> Thank you
> -jeremy
>
>
> On Mon, Jul 25, 2022 at 1:44 PM Frank Schilder <frans@xxxxxx> wrote:
>
> > I don't use ceph-adm  and I also don't know how you got the "some more
> > info". However, I did notice that it contains instructions, starting at
> > "Please make sure that the host is reachable ...". How about starting to
> > follow those?
> >
> > Best regards,
> > =================
> > Frank Schilder
> > AIT Risø Campus
> > Bygning 109, rum S14
> >
> > ________________________________________
> > From: Jeremy Hansen <farnsworth.mcfadden@xxxxxxxxx>
> > Sent: 25 July 2022 22:32:32
> > To: ceph-users@xxxxxxx
> > Subject: [Warning Possible spam]   Re: Issues after a
> shutdown
> >
> > Here's some more info:
> >
> > HEALTH_WARN 2 failed cephadm daemon(s); 3 hosts fail cephadm check; 2
> > filesystems are degraded; 1 MDSs report slow metadata IOs; 2/5 mons down,
> > quorum cn02,cn03,cn01; 10 osds down; 3 hosts (17 osds) down; Reduced data
> > availability: 13 pgs inactive, 9 pgs down; Degraded data redundancy:
> > 8515690/30862245 objects degraded (27.593%), 326 pgs degraded, 447 pgs
> > undersized
> > [WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
> >     daemon osd.3 on cn01.ceph is in error state
> >     daemon osd.2 on cn01.ceph is in error state
> > [WRN] CEPHADM_HOST_CHECK_FAILED: 3 hosts fail cephadm check
> >     host cn04.ceph (192.168.30.14) failed check: Failed to connect to
> > cn04.ceph (192.168.30.14).
> > Please make sure that the host is reachable and accepts connections using
> > the cephadm SSH key
> >
> > To add the cephadm SSH key to the host:
> > > ceph cephadm get-pub-key > ~/ceph.pub
> > > ssh-copy-id -f -i ~/ceph.pub root@192.168.30.14
> >
> > To check that the host is reachable open a new shell with the --no-hosts
> > flag:
> > > cephadm shell --no-hosts
> >
> > Then run the following:
> > > ceph cephadm get-ssh-config > ssh_config
> > > ceph config-key get mgr/cephadm/ssh_identity_key >
> ~/cephadm_private_key
> > > chmod 0600 ~/cephadm_private_key
> > > ssh -F ssh_config -i ~/cephadm_private_key root@192.168.30.14
> >     host cn06.ceph (192.168.30.16) failed check: Failed to connect to
> > cn06.ceph (192.168.30.16).
> > Please make sure that the host is reachable and accepts connections using
> > the cephadm SSH key
> >
> > To add the cephadm SSH key to the host:
> > > ceph cephadm get-pub-key > ~/ceph.pub
> > > ssh-copy-id -f -i ~/ceph.pub root@192.168.30.16
> >
> > To check that the host is reachable open a new shell with the --no-hosts
> > flag:
> > > cephadm shell --no-hosts
> >
> > Then run the following:
> > > ceph cephadm get-ssh-config > ssh_config
> > > ceph config-key get mgr/cephadm/ssh_identity_key >
> ~/cephadm_private_key
> > > chmod 0600 ~/cephadm_private_key
> > > ssh -F ssh_config -i ~/cephadm_private_key root@192.168.30.16
> >     host cn05.ceph (192.168.30.15) failed check: Failed to connect to
> > cn05.ceph (192.168.30.15).
> > Please make sure that the host is reachable and accepts connections using
> > the cephadm SSH key
> >
> > To add the cephadm SSH key to the host:
> > > ceph cephadm get-pub-key > ~/ceph.pub
> > > ssh-copy-id -f -i ~/ceph.pub root@192.168.30.15
> >
> > To check that the host is reachable open a new shell with the --no-hosts
> > flag:
> > > cephadm shell --no-hosts
> >
> > Then run the following:
> > > ceph cephadm get-ssh-config > ssh_config
> > > ceph config-key get mgr/cephadm/ssh_identity_key >
> ~/cephadm_private_key
> > > chmod 0600 ~/cephadm_private_key
> > > ssh -F ssh_config -i ~/cephadm_private_key root@192.168.30.15
> > [WRN] FS_DEGRADED: 2 filesystems are degraded
> >     fs coldlogix is degraded
> >     fs btc is degraded
> > [WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
> >     mds.coldlogix.cn01.uriofo(mds.0): 2 slow metadata IOs are blocked >
> 30
> > secs, oldest blocked for 2096 secs
> > [WRN] MON_DOWN: 2/5 mons down, quorum cn02,cn03,cn01
> >     mon.cn05 (rank 0) addr [v2:
> > 192.168.30.15:3300/0,v1:192.168.30.15:6789/0]
> > is down (out of quorum)
> >     mon.cn04 (rank 3) addr [v2:
> > 192.168.30.14:3300/0,v1:192.168.30.14:6789/0]
> > is down (out of quorum)
> > [WRN] OSD_DOWN: 10 osds down
> >     osd.0 (root=default,host=cn05) is down
> >     osd.1 (root=default,host=cn06) is down
> >     osd.7 (root=default,host=cn04) is down
> >     osd.13 (root=default,host=cn06) is down
> >     osd.15 (root=default,host=cn05) is down
> >     osd.18 (root=default,host=cn04) is down
> >     osd.20 (root=default,host=cn04) is down
> >     osd.33 (root=default,host=cn06) is down
> >     osd.34 (root=default,host=cn06) is down
> >     osd.36 (root=default,host=cn05) is down
> > [WRN] OSD_HOST_DOWN: 3 hosts (17 osds) down
> >     host cn04 (root=default) (6 osds) is down
> >     host cn05 (root=default) (5 osds) is down
> >     host cn06 (root=default) (6 osds) is down
> > [WRN] PG_AVAILABILITY: Reduced data availability: 13 pgs inactive, 9 pgs
> > down
> >     pg 9.3a is down, acting [8]
> >     pg 9.7a is down, acting [8]
> >     pg 9.ba is down, acting [8]
> >     pg 9.fa is down, acting [8]
> >     pg 11.3 is stuck inactive for 39h, current state
> > undersized+degraded+peered, last acting [11]
> >     pg 11.11 is down, acting [19,9]
> >     pg 11.1f is stuck inactive for 13h, current state
> > undersized+degraded+peered, last acting [10]
> >     pg 12.36 is down, acting [21,16]
> >     pg 12.59 is down, acting [26,5]
> >     pg 12.66 is down, acting [5]
> >     pg 19.4 is stuck inactive for 39h, current state
> > undersized+degraded+peered, last acting [6]
> >     pg 19.1c is down, acting [21,16,11]
> >     pg 21.1 is stuck inactive for 36m, current state unknown, last acting
> > []
> > [WRN] PG_DEGRADED: Degraded data redundancy: 8515690/30862245 objects
> > degraded (27.593%), 326 pgs degraded, 447 pgs undersized
> >     pg 9.75 is stuck undersized for 34m, current state
> > active+undersized+remapped, last acting [4,8,35]
> >     pg 9.76 is stuck undersized for 35m, current state
> > active+undersized+degraded, last acting [35,10,21]
> >     pg 9.77 is stuck undersized for 34m, current state
> > active+undersized+remapped, last acting [32,35,4]
> >     pg 9.78 is stuck undersized for 35m, current state
> > active+undersized+degraded, last acting [14,10]
> >     pg 9.79 is stuck undersized for 35m, current state
> > active+undersized+degraded, last acting [21,32]
> >     pg 9.7b is stuck undersized for 34m, current state
> > active+undersized+degraded, last acting [8,12,5]
> >     pg 9.7c is stuck undersized for 34m, current state
> > active+undersized+degraded, last acting [4,35,10]
> >     pg 9.7d is stuck undersized for 35m, current state
> > active+undersized+degraded, last acting [5,19,10]
> >     pg 9.7e is stuck undersized for 35m, current state
> > active+undersized+remapped, last acting [21,10,17]
> >     pg 9.80 is stuck undersized for 34m, current state
> > active+undersized+degraded, last acting [8,4,17]
> >     pg 9.81 is stuck undersized for 35m, current state
> > active+undersized+degraded, last acting [14,26]
> >     pg 9.82 is stuck undersized for 35m, current state
> > active+undersized+degraded, last acting [26,16]
> >     pg 9.83 is stuck undersized for 34m, current state
> > active+undersized+degraded, last acting [8,4]
> >     pg 9.84 is stuck undersized for 34m, current state
> > active+undersized+degraded, last acting [4,35,6]
> >     pg 9.85 is stuck undersized for 35m, current state
> > active+undersized+degraded, last acting [32,12,9]
> >     pg 9.86 is stuck undersized for 34m, current state
> > active+undersized+degraded, last acting [35,5,8]
> >     pg 9.87 is stuck undersized for 35m, current state
> > active+undersized+degraded, last acting [9,12]
> >     pg 9.88 is stuck undersized for 35m, current state
> > active+undersized+remapped, last acting [19,32,35]
> >     pg 9.89 is stuck undersized for 34m, current state
> > active+undersized+degraded, last acting [10,14,4]
> >     pg 9.8a is stuck undersized for 35m, current state
> > active+undersized+degraded, last acting [21,19]
> >     pg 9.8b is stuck undersized for 34m, current state
> > active+undersized+degraded, last acting [8,35]
> >     pg 9.8c is stuck undersized for 31m, current state
> > active+undersized+remapped, last acting [10,19,5]
> >     pg 9.8d is stuck undersized for 35m, current state
> > active+undersized+remapped, last acting [9,6]
> >     pg 9.8f is stuck undersized for 35m, current state
> > active+undersized+remapped, last acting [19,26,17]
> >     pg 9.90 is stuck undersized for 35m, current state
> > active+undersized+remapped, last acting [35,26]
> >     pg 9.91 is stuck undersized for 35m, current state
> > active+undersized+degraded, last acting [17,5]
> >     pg 9.92 is stuck undersized for 35m, current state
> > active+undersized+degraded, last acting [21,26]
> >     pg 9.93 is stuck undersized for 35m, current state
> > active+undersized+remapped, last acting [19,26,5]
> >     pg 9.94 is stuck undersized for 35m, current state
> > active+undersized+degraded, last acting [21,11]
> >     pg 9.95 is stuck undersized for 34m, current state
> > active+undersized+degraded, last acting [8,19]
> >     pg 9.96 is stuck undersized for 35m, current state
> > active+undersized+degraded, last acting [17,6]
> >     pg 9.97 is stuck undersized for 34m, current state
> > active+undersized+degraded, last acting [8,9,16]
> >     pg 9.98 is stuck undersized for 35m, current state
> > active+undersized+degraded, last acting [6,21]
> >     pg 9.99 is stuck undersized for 35m, current state
> > active+undersized+degraded, last acting [10,9]
> >     pg 9.9a is stuck undersized for 34m, current state
> > active+undersized+remapped, last acting [4,16,10]
> >     pg 9.9b is stuck undersized for 34m, current state
> > active+undersized+degraded, last acting [12,4,11]
> >     pg 9.9c is stuck undersized for 35m, current state
> > active+undersized+degraded, last acting [9,16]
> >     pg 9.9d is stuck undersized for 35m, current state
> > active+undersized+degraded, last acting [26,35]
> >     pg 9.9f is stuck undersized for 35m, current state
> > active+undersized+degraded, last acting [9,17,26]
> >     pg 12.70 is stuck undersized for 35m, current state
> > active+undersized+degraded, last acting [21,35]
> >     pg 12.71 is active+undersized+degraded, acting [6,12]
> >     pg 12.72 is stuck undersized for 34m, current state
> > active+undersized+degraded, last acting [10,14,4]
> >     pg 12.73 is stuck undersized for 35m, current state
> > active+undersized+remapped, last acting [5,17,11]
> >     pg 12.78 is stuck undersized for 34m, current state
> > active+undersized+degraded, last acting [5,8,35]
> >     pg 12.79 is stuck undersized for 34m, current state
> > active+undersized+degraded, last acting [4,17]
> >     pg 12.7a is stuck undersized for 35m, current state
> > active+undersized+degraded, last acting [10,21]
> >     pg 12.7b is stuck undersized for 35m, current state
> > active+undersized+remapped, last acting [17,21,11]
> >     pg 12.7c is stuck undersized for 35m, current state
> > active+undersized+degraded, last acting [32,21,16]
> >     pg 12.7d is stuck undersized for 35m, current state
> > active+undersized+degraded, last acting [35,6,9]
> >     pg 12.7e is stuck undersized for 34m, current state
> > active+undersized+degraded, last acting [26,4]
> >     pg 12.7f is stuck undersized for 35m, current state
> > active+undersized+degraded, last acting [9,14]
> >
> > On Mon, Jul 25, 2022 at 12:43 PM Jeremy Hansen <
> > farnsworth.mcfadden@xxxxxxxxx> wrote:
> >
> > > Pretty desperate here.  Can someone suggest what I might be able to do
> to
> > > get these OSDs back up.  It looks like my recovery had stalled.
> > >
> > >
> > > On Mon, Jul 25, 2022 at 7:26 AM Anthony D'Atri <
> anthony.datri@xxxxxxxxx>
> > > wrote:
> > >
> > >> Do your values for public and cluster network include the new
> addresses
> > >> on all nodes?
> > >>
> > >
> > > This cluster only has one network.  There is no separation between
> > > public and cluster.  Three of the nodes momentarily came up using a
> > > different IP address.
> > >
> > > I've also noticed on one of the nodes that did not move or have any IP
> > > issue, on a single node, from the dashboard, it names the same device
> for
> > > two different osd's:
> > >
> > > 2 cn01 out destroyed hdd TOSHIBA_MG04SCA40EE_21M0A0CKFWZB Unknown sdb
> > osd.2
> > >
> > > 3 cn01 out destroyed ssd SAMSUNG_MZILT3T8HBLS/007_S5G0NE0R200159
> Unknown
> > > sdb osd.3
> > >
> > >
> > > [ceph: root@cn01 /]# ceph-volume inventory
> > >
> > > Device Path               Size         rotates available Model name
> > > /dev/sda                  3.64 TB      True    False     MG04SCA40EE
> > > /dev/sdb                  3.49 TB      False   False
>  MZILT3T8HBLS/007
> > > /dev/sdc                  3.64 TB      True    False     MG04SCA40EE
> > > /dev/sdd                  3.64 TB      True    False     MG04SCA40EE
> > > /dev/sde                  3.49 TB      False   False
>  MZILT3T8HBLS/007
> > > /dev/sdf                  3.64 TB      True    False     MG04SCA40EE
> > > /dev/sdg                  698.64 GB    True    False     SEAGATE
> ST375064
> > >
> > > [ceph: root@cn01 /]# ceph osd info
> > > osd.0 down out weight 0 up_from 30231 up_thru 30564 down_at 30688
> > > last_clean_interval [25500,30228) [v2:
> > > 192.168.30.15:6818/2512683421,v1:192.168.30.15:6819/2512683421] [v2:
> > > 192.168.30.15:6824/2512683421,v1:192.168.30.15:6826/2512683421]
> > > autoout,exists d14cf503-a303-4fa4-a713-9530b67d613a
> > > osd.1 down out weight 0 up_from 30393 up_thru 30688 down_at 30697
> > > last_clean_interval [25518,30321) [v2:
> > > 192.168.30.16:6834/1781855831,v1:192.168.30.16:6835/1781855831] [v2:
> > > 192.168.30.16:6836/1781855831,v1:192.168.30.16:6837/1781855831]
> > > autoout,exists 0d521411-c835-4fa3-beca-3631b4ff6bf7
> > > osd.2 down out weight 0 up_from 31316 up_thru 31293 down_at 31317
> > > last_clean_interval [31218,31296) [v2:
> > > 192.168.30.11:6810/894589880,v1:192.168.30.11:6811/894589880] [v2:
> > > 192.168.30.11:6812/894589880,v1:192.168.30.11:6813/894589880]
> > > destroyed,exists
> > > osd.3 down out weight 0 up_from 31265 up_thru 31266 down_at 31268
> > > last_clean_interval [31254,31256) [v2:
> > > 192.168.30.11:6818/1641948535,v1:192.168.30.11:6819/1641948535] [v2:
> > > 192.168.30.11:6820/1641948535,v1:192.168.30.11:6821/1641948535]
> > > destroyed,exists
> > > osd.4 up   in  weight 1 up_from 31356 up_thru 31581 down_at 31339
> > > last_clean_interval [31320,31338) [v2:
> > > 192.168.30.11:6802/2785067179,v1:192.168.30.11:6803/2785067179] [v2:
> > > 192.168.30.11:6804/2785067179,v1:192.168.30.11:6805/2785067179]
> > exists,up
> > > 3afd06db-b91d-44fe-9305-5eb95f7a59b9
> > > osd.5 up   in  weight 1 up_from 31347 up_thru 31699 down_at 31339
> > > last_clean_interval [31311,31338) [v2:
> > > 192.168.30.11:6818/1936771540,v1:192.168.30.11:6819/1936771540] [v2:
> > > 192.168.30.11:6820/1936771540,v1:192.168.30.11:6821/1936771540]
> > exists,up
> > > 063c2ccf-02ce-4f5e-8252-dddfbb258a95
> > > osd.6 up   in  weight 1 up_from 31218 up_thru 31711 down_at 31217
> > > last_clean_interval [30978,31195) [v2:
> > > 192.168.30.12:6816/1585973160,v1:192.168.30.12:6817/1585973160] [v2:
> > > 192.168.30.12:6818/1585973160,v1:192.168.30.12:6819/1585973160]
> > exists,up
> > > 94250ea2-f12e-4dc6-9135-b626086ccffd
> > > osd.7 down out weight 0 up_from 30353 up_thru 30558 down_at 30688
> > > last_clean_interval [25533,30349) [v2:
> > > 192.168.30.14:6816/4083104061,v1:192.168.30.14:6817/4083104061] [v2:
> > > 192.168.30.14:6840/4094104061,v1:192.168.30.14:6841/4094104061]
> > > autoout,exists de351aec-b91e-4c22-a0bf-85369bc14579
> > > osd.8 up   in  weight 1 up_from 31226 up_thru 31668 down_at 31225
> > > last_clean_interval [30983,31195) [v2:
> > > 192.168.30.12:6824/1312484329,v1:192.168.30.12:6825/1312484329] [v2:
> > > 192.168.30.12:6826/1312484329,v1:192.168.30.12:6827/1312484329]
> > exists,up
> > > 51f665b4-fa5b-4b17-8390-ed130145ef04
> > > osd.9 up   in  weight 1 up_from 31351 up_thru 31673 down_at 31340
> > > last_clean_interval [31315,31338) [v2:
> > > 192.168.30.11:6810/1446838877,v1:192.168.30.11:6811/1446838877] [v2:
> > > 192.168.30.11:6812/1446838877,v1:192.168.30.11:6813/1446838877]
> > exists,up
> > > 985f1127-d126-4629-b8cd-03cf2d914d99
> > > osd.10 up   in  weight 1 up_from 31219 up_thru 31639 down_at 31218
> > > last_clean_interval [30980,31195) [v2:
> > > 192.168.30.12:6808/1587842953,v1:192.168.30.12:6809/1587842953] [v2:
> > > 192.168.30.12:6810/1587842953,v1:192.168.30.12:6811/1587842953]
> > exists,up
> > > c7fca03e-4bd5-4485-a090-658ca967d5f6
> > > osd.11 up   in  weight 1 up_from 31234 up_thru 31659 down_at 31223
> > > last_clean_interval [30978,31195) [v2:
> > > 192.168.30.12:6840/3403200742,v1:192.168.30.12:6841/3403200742] [v2:
> > > 192.168.30.12:6842/3403200742,v1:192.168.30.12:6843/3403200742]
> > exists,up
> > > 81074bd7-ad9f-4e56-8885-cca4745f6c95
> > > osd.12 up   in  weight 1 up_from 31230 up_thru 31717 down_at 31223
> > > last_clean_interval [30975,31195) [v2:
> > > 192.168.30.13:6816/4268732910,v1:192.168.30.13:6817/4268732910] [v2:
> > > 192.168.30.13:6818/4268732910,v1:192.168.30.13:6819/4268732910]
> > exists,up
> > > af1b55dd-c110-4861-aed9-c0737cef8be1
> > > osd.13 down out weight 0 up_from 30389 up_thru 30688 down_at 30695
> > > last_clean_interval [25513,30317) [v2:
> > > 192.168.30.16:6804/1573803727,v1:192.168.30.16:6805/1573803727] [v2:
> > > 192.168.30.16:6806/1573803727,v1:192.168.30.16:6807/1573803727]
> > > autoout,exists 737a3234-0f1f-4286-80e9-e89b581aae41
> > > osd.14 up   in  weight 1 up_from 31218 up_thru 31709 down_at 31217
> > > last_clean_interval [30979,31195) [v2:
> > > 192.168.30.13:6834/2291187817,v1:192.168.30.13:6835/2291187817] [v2:
> > > 192.168.30.13:6836/2291187817,v1:192.168.30.13:6837/2291187817]
> > exists,up
> > > 97cd6ac7-aca0-42fd-a049-d27289f83183
> > > osd.15 down out weight 0 up_from 29463 up_thru 30531 down_at 30688
> > > last_clean_interval [25493,29462) [v2:
> > > 192.168.30.15:6808/2655269991,v1:192.168.30.15:6809/2655269991] [v2:
> > > 192.168.30.15:6802/2662269991,v1:192.168.30.15:6803/2662269991]
> > > autoout,exists 61aea8f4-5905-4be3-ae32-5eacf75a514e
> > > osd.16 up   in  weight 1 up_from 31226 up_thru 31647 down_at 31223
> > > last_clean_interval [30970,31195) [v2:
> > > 192.168.30.13:6808/2624812081,v1:192.168.30.13:6809/2624812081] [v2:
> > > 192.168.30.13:6810/2624812081,v1:192.168.30.13:6811/2624812081]
> > exists,up
> > > 791a7542-87cd-403d-a37e-8f00506b2eb6
> > > osd.17 up   in  weight 1 up_from 31219 up_thru 31703 down_at 31218
> > > last_clean_interval [30975,31195) [v2:
> > > 192.168.30.13:6800/2978036397,v1:192.168.30.13:6801/2978036397] [v2:
> > > 192.168.30.13:6802/2978036397,v1:192.168.30.13:6803/2978036397]
> > exists,up
> > > 4a915645-412f-49e6-8477-1577469905da
> > > osd.18 down out weight 0 up_from 30334 up_thru 30566 down_at 30688
> > > last_clean_interval [25543,30327) [v2:
> > > 192.168.30.14:6832/985432137,v1:192.168.30.14:6833/985432137] [v2:
> > > 192.168.30.14:6848/998432137,v1:192.168.30.14:6849/998432137]
> > > autoout,exists 85f59d83-710c-4896-9200-bda4894fc3e8
> > > osd.19 down in  weight 1 up_from 31303 up_thru 31321 down_at 31323
> > > last_clean_interval [31292,31296) [v2:
> > > 192.168.30.13:6826/375623427,v1:192.168.30.13:6827/375623427] [v2:
> > > 192.168.30.13:6828/375623427,v1:192.168.30.13:6829/375623427] exists
> > > 7d09b51a-bd6d-40f8-a009-78ab9937795d
> > > osd.20 down out weight 0 up_from 30445 up_thru 30531 down_at 30688
> > > last_clean_interval [28947,30444) [v2:
> > > 192.168.30.14:6810/4062649162,v1:192.168.30.14:6811/4062649162] [v2:
> > > 192.168.30.14:6800/4073649162,v1:192.168.30.14:6801/4073649162]
> > > autoout,exists 7ef6cc1a-4755-4a14-b9df-f1f538d903ce
> > > osd.21 up   in  weight 1 up_from 31345 up_thru 31567 down_at 31341
> > > last_clean_interval [31307,31340) [v2:
> > > 192.168.30.11:6826/1625231298,v1:192.168.30.11:6827/1625231298] [v2:
> > > 192.168.30.11:6828/1625231298,v1:192.168.30.11:6829/1625231298]
> > exists,up
> > > 5ef2e39a-a353-4cb8-a49e-093fe39b94ef
> > > osd.22 down in  weight 1 up_from 30383 up_thru 30528 down_at 30688
> > > last_clean_interval [25549,30317) [v2:
> > > 192.168.30.14:6806/1204256629,v1:192.168.30.14:6807/1204256629] [v2:
> > > 192.168.30.14:6812/1204256629,v1:192.168.30.14:6813/1204256629] exists
> > > c9befe11-a035-449c-8d17-42aaf191923d
> > > osd.23 down in  weight 1 up_from 30334 up_thru 30576 down_at 30688
> > > last_clean_interval [30263,30332) [v2:
> > > 192.168.30.14:6802/3837786490,v1:192.168.30.14:6803/3837786490] [v2:
> > > 192.168.30.14:6830/3838786490,v1:192.168.30.14:6831/3838786490] exists
> > > 2081147b-065d-4da7-89d9-747e1ae02b8d
> > > osd.24 down in  weight 1 up_from 29455 up_thru 30570 down_at 30688
> > > last_clean_interval [25487,29453) [v2:
> > > 192.168.30.15:6800/2008474583,v1:192.168.30.15:6801/2008474583] [v2:
> > > 192.168.30.15:6810/2016474583,v1:192.168.30.15:6811/2016474583] exists
> > > 39d78380-261c-4689-b53d-90713e6ffcca
> > > osd.26 up   in  weight 1 up_from 31208 up_thru 31643 down_at 31207
> > > last_clean_interval [30967,31195) [v2:
> > > 192.168.30.12:6800/2861018947,v1:192.168.30.12:6801/2861018947] [v2:
> > > 192.168.30.12:6802/2861018947,v1:192.168.30.12:6803/2861018947]
> > exists,up
> > > 046622c8-c09c-4254-8c15-3dc05a2f01ed
> > > osd.28 down in  weight 1 up_from 30389 up_thru 30574 down_at 30691
> > > last_clean_interval [25513,30312) [v2:
> > > 192.168.30.16:6820/3466284570,v1:192.168.30.16:6821/3466284570] [v2:
> > > 192.168.30.16:6822/3466284570,v1:192.168.30.16:6823/3466284570] exists
> > > 10578b97-e3c4-4553-a8d0-6dcc46af5db1
> > > osd.29 down in  weight 1 up_from 30378 up_thru 30554 down_at 30688
> > > last_clean_interval [28595,30376) [v2:
> > > 192.168.30.14:6808/3739543672,v1:192.168.30.14:6809/3739543672] [v2:
> > > 192.168.30.14:6846/3747543672,v1:192.168.30.14:6847/3747543672] exists
> > > 9698e936-8edf-4adf-92c9-a0b5202ed01a
> > > osd.30 down in  weight 1 up_from 30449 up_thru 30531 down_at 30688
> > > last_clean_interval [25502,30446) [v2:
> > > 192.168.30.15:6825/2375507296,v1:192.168.30.15:6827/2375507296] [v2:
> > > 192.168.30.15:6829/2375507296,v1:192.168.30.15:6831/2375507296] exists
> > > e14d2a0f-a98a-44d4-8c06-4d893f673629
> > > osd.31 down in  weight 1 up_from 30364 up_thru 30688 down_at 30700
> > > last_clean_interval [25514,30361) [v2:
> > > 192.168.30.16:6826/2835000708,v1:192.168.30.16:6827/2835000708] [v2:
> > > 192.168.30.16:6802/2843000708,v1:192.168.30.16:6803/2843000708] exists
> > > 541bca38-e704-483a-8cb8-39e5f69007d1
> > > osd.32 up   in  weight 1 up_from 31209 up_thru 31627 down_at 31208
> > > last_clean_interval [30974,31195) [v2:
> > > 192.168.30.12:6832/3860067997,v1:192.168.30.12:6833/3860067997] [v2:
> > > 192.168.30.12:6834/3860067997,v1:192.168.30.12:6835/3860067997]
> > exists,up
> > > 9200a57e-2845-43ff-9787-8f1f3158fe90
> > > osd.33 down in  weight 1 up_from 30354 up_thru 30688 down_at 30693
> > > last_clean_interval [25521,30350) [v2:
> > > 192.168.30.16:6842/2342555666,v1:192.168.30.16:6843/2342555666] [v2:
> > > 192.168.30.16:6844/2364555666,v1:192.168.30.16:6845/2364555666] exists
> > > 20c55d85-cf9a-4133-a189-7fdad2318f58
> > > osd.34 down in  weight 1 up_from 30390 up_thru 30688 down_at 30691
> > > last_clean_interval [25516,30314) [v2:
> > > 192.168.30.16:6808/2282629870,v1:192.168.30.16:6811/2282629870] [v2:
> > > 192.168.30.16:6812/2282629870,v1:192.168.30.16:6814/2282629870] exists
> > > 77e0ef8f-c047-4f84-afb2-a8ad054e562f
> > > osd.35 up   in  weight 1 up_from 31204 up_thru 31657 down_at 31203
> > > last_clean_interval [30958,31195) [v2:
> > > 192.168.30.13:6842/1919357520,v1:192.168.30.13:6843/1919357520] [v2:
> > > 192.168.30.13:6844/1919357520,v1:192.168.30.13:6845/1919357520]
> > exists,up
> > > 2d2de0cb-6d41-4957-a473-2bbe9ce227bf
> > > osd.36 down in  weight 1 up_from 29494 up_thru 30560 down_at 30688
> > > last_clean_interval [25491,29492) [v2:
> > > 192.168.30.15:6816/2153321591,v1:192.168.30.15:6817/2153321591] [v2:
> > > 192.168.30.15:6842/2158321591,v1:192.168.30.15:6843/2158321591] exists
> > > 26114668-68b2-458b-89c2-cbad5507ab75
> > >
> > >
> > >
> > >>
> > >> > On Jul 25, 2022, at 3:29 AM, Jeremy Hansen <
> > >> farnsworth.mcfadden@xxxxxxxxx> wrote:
> > >> >
> > >> > I transitioned some servers to a new rack and now I'm having major
> > >> issues
> > >> > with Ceph upon bringing things back up.
> > >> >
> > >> > I believe the issue may be related to the ceph nodes coming back up
> > with
> > >> > different IPs before VLANs were set.  That's just a guess because I
> > >> can't
> > >> > think of any other reason this would happen.
> > >> >
> > >> > Current state:
> > >> >
> > >> > Every 2.0s: ceph -s
> > >> >   cn01.ceph.la1.clx.corp: Mon Jul 25 10:13:05 2022
> > >> >
> > >> >  cluster:
> > >> >    id:     bfa2ad58-c049-11eb-9098-3c8cf8ed728d
> > >> >    health: HEALTH_WARN
> > >> >            1 filesystem is degraded
> > >> >            2 MDSs report slow metadata IOs
> > >> >            2/5 mons down, quorum cn02,cn03,cn01
> > >> >            9 osds down
> > >> >            3 hosts (17 osds) down
> > >> >            Reduced data availability: 97 pgs inactive, 9 pgs down
> > >> >            Degraded data redundancy: 13860144/30824413 objects
> > degraded
> > >> > (44.965%), 411 pgs degraded, 482 pgs undersized
> > >> >
> > >> >  services:
> > >> >    mon: 5 daemons, quorum cn02,cn03,cn01 (age 62m), out of quorum:
> > cn05,
> > >> > cn04
> > >> >    mgr: cn02.arszct(active, since 5m)
> > >> >    mds: 2/2 daemons up, 2 standby
> > >> >    osd: 35 osds: 15 up (since 62m), 24 in (since 58m); 222 remapped
> > pgs
> > >> >
> > >> >  data:
> > >> >    volumes: 1/2 healthy, 1 recovering
> > >> >    pools:   8 pools, 545 pgs
> > >> >    objects: 7.71M objects, 6.7 TiB
> > >> >    usage:   15 TiB used, 39 TiB / 54 TiB avail
> > >> >    pgs:     0.367% pgs unknown
> > >> >             17.431% pgs not active
> > >> >             13860144/30824413 objects degraded (44.965%)
> > >> >             1137693/30824413 objects misplaced (3.691%)
> > >> >             280 active+undersized+degraded
> > >> >             67  undersized+degraded+remapped+backfilling+peered
> > >> >             57  active+undersized+remapped
> > >> >             45  active+clean+remapped
> > >> >             44  active+undersized+degraded+remapped+backfilling
> > >> >             18  undersized+degraded+peered
> > >> >             10  active+undersized
> > >> >             9   down
> > >> >             7   active+clean
> > >> >             3   active+undersized+remapped+backfilling
> > >> >             2   active+undersized+degraded+remapped+backfill_wait
> > >> >             2   unknown
> > >> >             1   undersized+peered
> > >> >
> > >> >  io:
> > >> >    client:   170 B/s rd, 0 op/s rd, 0 op/s wr
> > >> >    recovery: 168 MiB/s, 158 keys/s, 166 objects/s
> > >> >
> > >> > I have to disable and re-enable the dashboard just to use it.  It
> > seems
> > >> to
> > >> > get bogged down after a few moments.
> > >> >
> > >> > The three servers that were moved to the new rack Ceph has marked as
> > >> > "Down", but if I do a cephadm host-check, they all seem to pass:
> > >> >
> > >> > ************************ ceph  ************************
> > >> > --------- cn01.ceph.---------
> > >> > podman (/usr/bin/podman) version 4.0.2 is present
> > >> > systemctl is present
> > >> > lvcreate is present
> > >> > Unit chronyd.service is enabled and running
> > >> > Host looks OK
> > >> > --------- cn02.ceph.---------
> > >> > podman (/usr/bin/podman) version 4.0.2 is present
> > >> > systemctl is present
> > >> > lvcreate is present
> > >> > Unit chronyd.service is enabled and running
> > >> > Host looks OK
> > >> > --------- cn03.ceph.---------
> > >> > podman (/usr/bin/podman) version 4.0.2 is present
> > >> > systemctl is present
> > >> > lvcreate is present
> > >> > Unit chronyd.service is enabled and running
> > >> > Host looks OK
> > >> > --------- cn04.ceph.---------
> > >> > podman (/usr/bin/podman) version 4.0.2 is present
> > >> > systemctl is present
> > >> > lvcreate is present
> > >> > Unit chronyd.service is enabled and running
> > >> > Host looks OK
> > >> > --------- cn05.ceph.---------
> > >> > podman|docker (/usr/bin/podman) is present
> > >> > systemctl is present
> > >> > lvcreate is present
> > >> > Unit chronyd.service is enabled and running
> > >> > Host looks OK
> > >> > --------- cn06.ceph.---------
> > >> > podman (/usr/bin/podman) version 4.0.2 is present
> > >> > systemctl is present
> > >> > lvcreate is present
> > >> > Unit chronyd.service is enabled and running
> > >> > Host looks OK
> > >> >
> > >> > It seems to be recovering with what it has left, but a large amount
> of
> > >> OSDs
> > >> > are down.  When trying to restart one of the down'd OSDs, I see a
> huge
> > >> dump.
> > >> >
> > >> > Jul 25 03:19:38 cn06.ceph
> > >> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
> > >> > 2022-07-25T10:19:38.532+0000 7fce14a6c080  0 osd.34 30689 done with
> > >> init,
> > >> > starting boot process
> > >> > Jul 25 03:19:38 cn06.ceph
> > >> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
> > >> > 2022-07-25T10:19:38.532+0000 7fce14a6c080  1 osd.34 30689 start_boot
> > >> > Jul 25 03:20:10 cn06.ceph
> > >> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
> > >> > 2022-07-25T10:20:10.655+0000 7fcdfd12d700  1 osd.34 30689 start_boot
> > >> > Jul 25 03:20:41 cn06.ceph
> > >> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
> > >> > 2022-07-25T10:20:41.159+0000 7fcdfd12d700  1 osd.34 30689 start_boot
> > >> > Jul 25 03:21:11 cn06.ceph
> > >> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
> > >> > 2022-07-25T10:21:11.662+0000 7fcdfd12d700  1 osd.34 30689 start_boot
> > >> >
> > >> > At this point it just keeps printing start_boot, but the dashboard
> has
> > >> it
> > >> > marked as "in" but "down".
> > >> >
> > >> > On these three hosts that moved, there were a bunch marked as "out"
> > and
> > >> > "down", and some with "in" but "down".
> > >> >
> > >> > Not sure where to go next.  I'm going to let the recovery continue
> and
> > >> hope
> > >> > that my 4x replication on these pools saves me.
> > >> >
> > >> > Not sure where to go from here.  Any help is very much appreciated.
> > >> This
> > >> > Ceph cluster holds all of our Cloudstack images...  it would be
> > >> terrible to
> > >> > lose this data.
> > >> > _______________________________________________
> > >> > ceph-users mailing list -- ceph-users@xxxxxxx
> > >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > >>
> > >
> > > On Mon, Jul 25, 2022 at 10:15 AM Jeremy Hansen <
> > > farnsworth.mcfadden@xxxxxxxxx> wrote:
> > >
> > >>
> > >>
> > >> On Mon, Jul 25, 2022 at 7:26 AM Anthony D'Atri <
> anthony.datri@xxxxxxxxx
> > >
> > >> wrote:
> > >>
> > >>> Do your values for public and cluster network include the new
> addresses
> > >>> on all nodes?
> > >>>
> > >>
> > >> This cluster only has one network.  There is no separation between
> > >> public and cluster.  Three of the nodes momentarily came up using a
> > >> different IP address.
> > >>
> > >> I've also noticed on one of the nodes that did not move or have any IP
> > >> issue, on a single node, from the dashboard, it names the same device
> > for
> > >> two different osd's:
> > >>
> > >> 2 cn01 out destroyed hdd TOSHIBA_MG04SCA40EE_21M0A0CKFWZB Unknown sdb
> > >> osd.2
> > >>
> > >> 3 cn01 out destroyed ssd SAMSUNG_MZILT3T8HBLS/007_S5G0NE0R200159
> Unknown
> > >> sdb osd.3
> > >>
> > >>
> > >> [ceph: root@cn01 /]# ceph-volume inventory
> > >>
> > >> Device Path               Size         rotates available Model name
> > >> /dev/sda                  3.64 TB      True    False     MG04SCA40EE
> > >> /dev/sdb                  3.49 TB      False   False
> >  MZILT3T8HBLS/007
> > >> /dev/sdc                  3.64 TB      True    False     MG04SCA40EE
> > >> /dev/sdd                  3.64 TB      True    False     MG04SCA40EE
> > >> /dev/sde                  3.49 TB      False   False
> >  MZILT3T8HBLS/007
> > >> /dev/sdf                  3.64 TB      True    False     MG04SCA40EE
> > >> /dev/sdg                  698.64 GB    True    False     SEAGATE
> > ST375064
> > >>
> > >> [ceph: root@cn01 /]# ceph osd info
> > >> osd.0 down out weight 0 up_from 30231 up_thru 30564 down_at 30688
> > >> last_clean_interval [25500,30228) [v2:
> > >> 192.168.30.15:6818/2512683421,v1:192.168.30.15:6819/2512683421] [v2:
> > >> 192.168.30.15:6824/2512683421,v1:192.168.30.15:6826/2512683421]
> > >> autoout,exists d14cf503-a303-4fa4-a713-9530b67d613a
> > >> osd.1 down out weight 0 up_from 30393 up_thru 30688 down_at 30697
> > >> last_clean_interval [25518,30321) [v2:
> > >> 192.168.30.16:6834/1781855831,v1:192.168.30.16:6835/1781855831] [v2:
> > >> 192.168.30.16:6836/1781855831,v1:192.168.30.16:6837/1781855831]
> > >> autoout,exists 0d521411-c835-4fa3-beca-3631b4ff6bf7
> > >> osd.2 down out weight 0 up_from 31316 up_thru 31293 down_at 31317
> > >> last_clean_interval [31218,31296) [v2:
> > >> 192.168.30.11:6810/894589880,v1:192.168.30.11:6811/894589880] [v2:
> > >> 192.168.30.11:6812/894589880,v1:192.168.30.11:6813/894589880]
> > >> destroyed,exists
> > >> osd.3 down out weight 0 up_from 31265 up_thru 31266 down_at 31268
> > >> last_clean_interval [31254,31256) [v2:
> > >> 192.168.30.11:6818/1641948535,v1:192.168.30.11:6819/1641948535] [v2:
> > >> 192.168.30.11:6820/1641948535,v1:192.168.30.11:6821/1641948535]
> > >> destroyed,exists
> > >> osd.4 up   in  weight 1 up_from 31356 up_thru 31581 down_at 31339
> > >> last_clean_interval [31320,31338) [v2:
> > >> 192.168.30.11:6802/2785067179,v1:192.168.30.11:6803/2785067179] [v2:
> > >> 192.168.30.11:6804/2785067179,v1:192.168.30.11:6805/2785067179]
> > >> exists,up 3afd06db-b91d-44fe-9305-5eb95f7a59b9
> > >> osd.5 up   in  weight 1 up_from 31347 up_thru 31699 down_at 31339
> > >> last_clean_interval [31311,31338) [v2:
> > >> 192.168.30.11:6818/1936771540,v1:192.168.30.11:6819/1936771540] [v2:
> > >> 192.168.30.11:6820/1936771540,v1:192.168.30.11:6821/1936771540]
> > >> exists,up 063c2ccf-02ce-4f5e-8252-dddfbb258a95
> > >> osd.6 up   in  weight 1 up_from 31218 up_thru 31711 down_at 31217
> > >> last_clean_interval [30978,31195) [v2:
> > >> 192.168.30.12:6816/1585973160,v1:192.168.30.12:6817/1585973160] [v2:
> > >> 192.168.30.12:6818/1585973160,v1:192.168.30.12:6819/1585973160]
> > >> exists,up 94250ea2-f12e-4dc6-9135-b626086ccffd
> > >> osd.7 down out weight 0 up_from 30353 up_thru 30558 down_at 30688
> > >> last_clean_interval [25533,30349) [v2:
> > >> 192.168.30.14:6816/4083104061,v1:192.168.30.14:6817/4083104061] [v2:
> > >> 192.168.30.14:6840/4094104061,v1:192.168.30.14:6841/4094104061]
> > >> autoout,exists de351aec-b91e-4c22-a0bf-85369bc14579
> > >> osd.8 up   in  weight 1 up_from 31226 up_thru 31668 down_at 31225
> > >> last_clean_interval [30983,31195) [v2:
> > >> 192.168.30.12:6824/1312484329,v1:192.168.30.12:6825/1312484329] [v2:
> > >> 192.168.30.12:6826/1312484329,v1:192.168.30.12:6827/1312484329]
> > >> exists,up 51f665b4-fa5b-4b17-8390-ed130145ef04
> > >> osd.9 up   in  weight 1 up_from 31351 up_thru 31673 down_at 31340
> > >> last_clean_interval [31315,31338) [v2:
> > >> 192.168.30.11:6810/1446838877,v1:192.168.30.11:6811/1446838877] [v2:
> > >> 192.168.30.11:6812/1446838877,v1:192.168.30.11:6813/1446838877]
> > >> exists,up 985f1127-d126-4629-b8cd-03cf2d914d99
> > >> osd.10 up   in  weight 1 up_from 31219 up_thru 31639 down_at 31218
> > >> last_clean_interval [30980,31195) [v2:
> > >> 192.168.30.12:6808/1587842953,v1:192.168.30.12:6809/1587842953] [v2:
> > >> 192.168.30.12:6810/1587842953,v1:192.168.30.12:6811/1587842953]
> > >> exists,up c7fca03e-4bd5-4485-a090-658ca967d5f6
> > >> osd.11 up   in  weight 1 up_from 31234 up_thru 31659 down_at 31223
> > >> last_clean_interval [30978,31195) [v2:
> > >> 192.168.30.12:6840/3403200742,v1:192.168.30.12:6841/3403200742] [v2:
> > >> 192.168.30.12:6842/3403200742,v1:192.168.30.12:6843/3403200742]
> > >> exists,up 81074bd7-ad9f-4e56-8885-cca4745f6c95
> > >> osd.12 up   in  weight 1 up_from 31230 up_thru 31717 down_at 31223
> > >> last_clean_interval [30975,31195) [v2:
> > >> 192.168.30.13:6816/4268732910,v1:192.168.30.13:6817/4268732910] [v2:
> > >> 192.168.30.13:6818/4268732910,v1:192.168.30.13:6819/4268732910]
> > >> exists,up af1b55dd-c110-4861-aed9-c0737cef8be1
> > >> osd.13 down out weight 0 up_from 30389 up_thru 30688 down_at 30695
> > >> last_clean_interval [25513,30317) [v2:
> > >> 192.168.30.16:6804/1573803727,v1:192.168.30.16:6805/1573803727] [v2:
> > >> 192.168.30.16:6806/1573803727,v1:192.168.30.16:6807/1573803727]
> > >> autoout,exists 737a3234-0f1f-4286-80e9-e89b581aae41
> > >> osd.14 up   in  weight 1 up_from 31218 up_thru 31709 down_at 31217
> > >> last_clean_interval [30979,31195) [v2:
> > >> 192.168.30.13:6834/2291187817,v1:192.168.30.13:6835/2291187817] [v2:
> > >> 192.168.30.13:6836/2291187817,v1:192.168.30.13:6837/2291187817]
> > >> exists,up 97cd6ac7-aca0-42fd-a049-d27289f83183
> > >> osd.15 down out weight 0 up_from 29463 up_thru 30531 down_at 30688
> > >> last_clean_interval [25493,29462) [v2:
> > >> 192.168.30.15:6808/2655269991,v1:192.168.30.15:6809/2655269991] [v2:
> > >> 192.168.30.15:6802/2662269991,v1:192.168.30.15:6803/2662269991]
> > >> autoout,exists 61aea8f4-5905-4be3-ae32-5eacf75a514e
> > >> osd.16 up   in  weight 1 up_from 31226 up_thru 31647 down_at 31223
> > >> last_clean_interval [30970,31195) [v2:
> > >> 192.168.30.13:6808/2624812081,v1:192.168.30.13:6809/2624812081] [v2:
> > >> 192.168.30.13:6810/2624812081,v1:192.168.30.13:6811/2624812081]
> > >> exists,up 791a7542-87cd-403d-a37e-8f00506b2eb6
> > >> osd.17 up   in  weight 1 up_from 31219 up_thru 31703 down_at 31218
> > >> last_clean_interval [30975,31195) [v2:
> > >> 192.168.30.13:6800/2978036397,v1:192.168.30.13:6801/2978036397] [v2:
> > >> 192.168.30.13:6802/2978036397,v1:192.168.30.13:6803/2978036397]
> > >> exists,up 4a915645-412f-49e6-8477-1577469905da
> > >> osd.18 down out weight 0 up_from 30334 up_thru 30566 down_at 30688
> > >> last_clean_interval [25543,30327) [v2:
> > >> 192.168.30.14:6832/985432137,v1:192.168.30.14:6833/985432137] [v2:
> > >> 192.168.30.14:6848/998432137,v1:192.168.30.14:6849/998432137]
> > >> autoout,exists 85f59d83-710c-4896-9200-bda4894fc3e8
> > >> osd.19 down in  weight 1 up_from 31303 up_thru 31321 down_at 31323
> > >> last_clean_interval [31292,31296) [v2:
> > >> 192.168.30.13:6826/375623427,v1:192.168.30.13:6827/375623427] [v2:
> > >> 192.168.30.13:6828/375623427,v1:192.168.30.13:6829/375623427] exists
> > >> 7d09b51a-bd6d-40f8-a009-78ab9937795d
> > >> osd.20 down out weight 0 up_from 30445 up_thru 30531 down_at 30688
> > >> last_clean_interval [28947,30444) [v2:
> > >> 192.168.30.14:6810/4062649162,v1:192.168.30.14:6811/4062649162] [v2:
> > >> 192.168.30.14:6800/4073649162,v1:192.168.30.14:6801/4073649162]
> > >> autoout,exists 7ef6cc1a-4755-4a14-b9df-f1f538d903ce
> > >> osd.21 up   in  weight 1 up_from 31345 up_thru 31567 down_at 31341
> > >> last_clean_interval [31307,31340) [v2:
> > >> 192.168.30.11:6826/1625231298,v1:192.168.30.11:6827/1625231298] [v2:
> > >> 192.168.30.11:6828/1625231298,v1:192.168.30.11:6829/1625231298]
> > >> exists,up 5ef2e39a-a353-4cb8-a49e-093fe39b94ef
> > >> osd.22 down in  weight 1 up_from 30383 up_thru 30528 down_at 30688
> > >> last_clean_interval [25549,30317) [v2:
> > >> 192.168.30.14:6806/1204256629,v1:192.168.30.14:6807/1204256629] [v2:
> > >> 192.168.30.14:6812/1204256629,v1:192.168.30.14:6813/1204256629]
> exists
> > >> c9befe11-a035-449c-8d17-42aaf191923d
> > >> osd.23 down in  weight 1 up_from 30334 up_thru 30576 down_at 30688
> > >> last_clean_interval [30263,30332) [v2:
> > >> 192.168.30.14:6802/3837786490,v1:192.168.30.14:6803/3837786490] [v2:
> > >> 192.168.30.14:6830/3838786490,v1:192.168.30.14:6831/3838786490]
> exists
> > >> 2081147b-065d-4da7-89d9-747e1ae02b8d
> > >> osd.24 down in  weight 1 up_from 29455 up_thru 30570 down_at 30688
> > >> last_clean_interval [25487,29453) [v2:
> > >> 192.168.30.15:6800/2008474583,v1:192.168.30.15:6801/2008474583] [v2:
> > >> 192.168.30.15:6810/2016474583,v1:192.168.30.15:6811/2016474583]
> exists
> > >> 39d78380-261c-4689-b53d-90713e6ffcca
> > >> osd.26 up   in  weight 1 up_from 31208 up_thru 31643 down_at 31207
> > >> last_clean_interval [30967,31195) [v2:
> > >> 192.168.30.12:6800/2861018947,v1:192.168.30.12:6801/2861018947] [v2:
> > >> 192.168.30.12:6802/2861018947,v1:192.168.30.12:6803/2861018947]
> > >> exists,up 046622c8-c09c-4254-8c15-3dc05a2f01ed
> > >> osd.28 down in  weight 1 up_from 30389 up_thru 30574 down_at 30691
> > >> last_clean_interval [25513,30312) [v2:
> > >> 192.168.30.16:6820/3466284570,v1:192.168.30.16:6821/3466284570] [v2:
> > >> 192.168.30.16:6822/3466284570,v1:192.168.30.16:6823/3466284570]
> exists
> > >> 10578b97-e3c4-4553-a8d0-6dcc46af5db1
> > >> osd.29 down in  weight 1 up_from 30378 up_thru 30554 down_at 30688
> > >> last_clean_interval [28595,30376) [v2:
> > >> 192.168.30.14:6808/3739543672,v1:192.168.30.14:6809/3739543672] [v2:
> > >> 192.168.30.14:6846/3747543672,v1:192.168.30.14:6847/3747543672]
> exists
> > >> 9698e936-8edf-4adf-92c9-a0b5202ed01a
> > >> osd.30 down in  weight 1 up_from 30449 up_thru 30531 down_at 30688
> > >> last_clean_interval [25502,30446) [v2:
> > >> 192.168.30.15:6825/2375507296,v1:192.168.30.15:6827/2375507296] [v2:
> > >> 192.168.30.15:6829/2375507296,v1:192.168.30.15:6831/2375507296]
> exists
> > >> e14d2a0f-a98a-44d4-8c06-4d893f673629
> > >> osd.31 down in  weight 1 up_from 30364 up_thru 30688 down_at 30700
> > >> last_clean_interval [25514,30361) [v2:
> > >> 192.168.30.16:6826/2835000708,v1:192.168.30.16:6827/2835000708] [v2:
> > >> 192.168.30.16:6802/2843000708,v1:192.168.30.16:6803/2843000708]
> exists
> > >> 541bca38-e704-483a-8cb8-39e5f69007d1
> > >> osd.32 up   in  weight 1 up_from 31209 up_thru 31627 down_at 31208
> > >> last_clean_interval [30974,31195) [v2:
> > >> 192.168.30.12:6832/3860067997,v1:192.168.30.12:6833/3860067997] [v2:
> > >> 192.168.30.12:6834/3860067997,v1:192.168.30.12:6835/3860067997]
> > >> exists,up 9200a57e-2845-43ff-9787-8f1f3158fe90
> > >> osd.33 down in  weight 1 up_from 30354 up_thru 30688 down_at 30693
> > >> last_clean_interval [25521,30350) [v2:
> > >> 192.168.30.16:6842/2342555666,v1:192.168.30.16:6843/2342555666] [v2:
> > >> 192.168.30.16:6844/2364555666,v1:192.168.30.16:6845/2364555666]
> exists
> > >> 20c55d85-cf9a-4133-a189-7fdad2318f58
> > >> osd.34 down in  weight 1 up_from 30390 up_thru 30688 down_at 30691
> > >> last_clean_interval [25516,30314) [v2:
> > >> 192.168.30.16:6808/2282629870,v1:192.168.30.16:6811/2282629870] [v2:
> > >> 192.168.30.16:6812/2282629870,v1:192.168.30.16:6814/2282629870]
> exists
> > >> 77e0ef8f-c047-4f84-afb2-a8ad054e562f
> > >> osd.35 up   in  weight 1 up_from 31204 up_thru 31657 down_at 31203
> > >> last_clean_interval [30958,31195) [v2:
> > >> 192.168.30.13:6842/1919357520,v1:192.168.30.13:6843/1919357520] [v2:
> > >> 192.168.30.13:6844/1919357520,v1:192.168.30.13:6845/1919357520]
> > >> exists,up 2d2de0cb-6d41-4957-a473-2bbe9ce227bf
> > >> osd.36 down in  weight 1 up_from 29494 up_thru 30560 down_at 30688
> > >> last_clean_interval [25491,29492) [v2:
> > >> 192.168.30.15:6816/2153321591,v1:192.168.30.15:6817/2153321591] [v2:
> > >> 192.168.30.15:6842/2158321591,v1:192.168.30.15:6843/2158321591]
> exists
> > >> 26114668-68b2-458b-89c2-cbad5507ab75
> > >>
> > >>
> > >>
> > >>>
> > >>> > On Jul 25, 2022, at 3:29 AM, Jeremy Hansen <
> > >>> farnsworth.mcfadden@xxxxxxxxx> wrote:
> > >>> >
> > >>> > I transitioned some servers to a new rack and now I'm having major
> > >>> issues
> > >>> > with Ceph upon bringing things back up.
> > >>> >
> > >>> > I believe the issue may be related to the ceph nodes coming back up
> > >>> with
> > >>> > different IPs before VLANs were set.  That's just a guess because I
> > >>> can't
> > >>> > think of any other reason this would happen.
> > >>> >
> > >>> > Current state:
> > >>> >
> > >>> > Every 2.0s: ceph -s
> > >>> >   cn01.ceph.la1.clx.corp: Mon Jul 25 10:13:05 2022
> > >>> >
> > >>> >  cluster:
> > >>> >    id:     bfa2ad58-c049-11eb-9098-3c8cf8ed728d
> > >>> >    health: HEALTH_WARN
> > >>> >            1 filesystem is degraded
> > >>> >            2 MDSs report slow metadata IOs
> > >>> >            2/5 mons down, quorum cn02,cn03,cn01
> > >>> >            9 osds down
> > >>> >            3 hosts (17 osds) down
> > >>> >            Reduced data availability: 97 pgs inactive, 9 pgs down
> > >>> >            Degraded data redundancy: 13860144/30824413 objects
> > degraded
> > >>> > (44.965%), 411 pgs degraded, 482 pgs undersized
> > >>> >
> > >>> >  services:
> > >>> >    mon: 5 daemons, quorum cn02,cn03,cn01 (age 62m), out of quorum:
> > >>> cn05,
> > >>> > cn04
> > >>> >    mgr: cn02.arszct(active, since 5m)
> > >>> >    mds: 2/2 daemons up, 2 standby
> > >>> >    osd: 35 osds: 15 up (since 62m), 24 in (since 58m); 222 remapped
> > pgs
> > >>> >
> > >>> >  data:
> > >>> >    volumes: 1/2 healthy, 1 recovering
> > >>> >    pools:   8 pools, 545 pgs
> > >>> >    objects: 7.71M objects, 6.7 TiB
> > >>> >    usage:   15 TiB used, 39 TiB / 54 TiB avail
> > >>> >    pgs:     0.367% pgs unknown
> > >>> >             17.431% pgs not active
> > >>> >             13860144/30824413 objects degraded (44.965%)
> > >>> >             1137693/30824413 objects misplaced (3.691%)
> > >>> >             280 active+undersized+degraded
> > >>> >             67  undersized+degraded+remapped+backfilling+peered
> > >>> >             57  active+undersized+remapped
> > >>> >             45  active+clean+remapped
> > >>> >             44  active+undersized+degraded+remapped+backfilling
> > >>> >             18  undersized+degraded+peered
> > >>> >             10  active+undersized
> > >>> >             9   down
> > >>> >             7   active+clean
> > >>> >             3   active+undersized+remapped+backfilling
> > >>> >             2   active+undersized+degraded+remapped+backfill_wait
> > >>> >             2   unknown
> > >>> >             1   undersized+peered
> > >>> >
> > >>> >  io:
> > >>> >    client:   170 B/s rd, 0 op/s rd, 0 op/s wr
> > >>> >    recovery: 168 MiB/s, 158 keys/s, 166 objects/s
> > >>> >
> > >>> > I have to disable and re-enable the dashboard just to use it.  It
> > >>> seems to
> > >>> > get bogged down after a few moments.
> > >>> >
> > >>> > The three servers that were moved to the new rack Ceph has marked
> as
> > >>> > "Down", but if I do a cephadm host-check, they all seem to pass:
> > >>> >
> > >>> > ************************ ceph  ************************
> > >>> > --------- cn01.ceph.---------
> > >>> > podman (/usr/bin/podman) version 4.0.2 is present
> > >>> > systemctl is present
> > >>> > lvcreate is present
> > >>> > Unit chronyd.service is enabled and running
> > >>> > Host looks OK
> > >>> > --------- cn02.ceph.---------
> > >>> > podman (/usr/bin/podman) version 4.0.2 is present
> > >>> > systemctl is present
> > >>> > lvcreate is present
> > >>> > Unit chronyd.service is enabled and running
> > >>> > Host looks OK
> > >>> > --------- cn03.ceph.---------
> > >>> > podman (/usr/bin/podman) version 4.0.2 is present
> > >>> > systemctl is present
> > >>> > lvcreate is present
> > >>> > Unit chronyd.service is enabled and running
> > >>> > Host looks OK
> > >>> > --------- cn04.ceph.---------
> > >>> > podman (/usr/bin/podman) version 4.0.2 is present
> > >>> > systemctl is present
> > >>> > lvcreate is present
> > >>> > Unit chronyd.service is enabled and running
> > >>> > Host looks OK
> > >>> > --------- cn05.ceph.---------
> > >>> > podman|docker (/usr/bin/podman) is present
> > >>> > systemctl is present
> > >>> > lvcreate is present
> > >>> > Unit chronyd.service is enabled and running
> > >>> > Host looks OK
> > >>> > --------- cn06.ceph.---------
> > >>> > podman (/usr/bin/podman) version 4.0.2 is present
> > >>> > systemctl is present
> > >>> > lvcreate is present
> > >>> > Unit chronyd.service is enabled and running
> > >>> > Host looks OK
> > >>> >
> > >>> > It seems to be recovering with what it has left, but a large amount
> > of
> > >>> OSDs
> > >>> > are down.  When trying to restart one of the down'd OSDs, I see a
> > huge
> > >>> dump.
> > >>> >
> > >>> > Jul 25 03:19:38 cn06.ceph
> > >>> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
> > >>> > 2022-07-25T10:19:38.532+0000 7fce14a6c080  0 osd.34 30689 done with
> > >>> init,
> > >>> > starting boot process
> > >>> > Jul 25 03:19:38 cn06.ceph
> > >>> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
> > >>> > 2022-07-25T10:19:38.532+0000 7fce14a6c080  1 osd.34 30689
> start_boot
> > >>> > Jul 25 03:20:10 cn06.ceph
> > >>> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
> > >>> > 2022-07-25T10:20:10.655+0000 7fcdfd12d700  1 osd.34 30689
> start_boot
> > >>> > Jul 25 03:20:41 cn06.ceph
> > >>> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
> > >>> > 2022-07-25T10:20:41.159+0000 7fcdfd12d700  1 osd.34 30689
> start_boot
> > >>> > Jul 25 03:21:11 cn06.ceph
> > >>> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
> > >>> > 2022-07-25T10:21:11.662+0000 7fcdfd12d700  1 osd.34 30689
> start_boot
> > >>> >
> > >>> > At this point it just keeps printing start_boot, but the dashboard
> > has
> > >>> it
> > >>> > marked as "in" but "down".
> > >>> >
> > >>> > On these three hosts that moved, there were a bunch marked as "out"
> > and
> > >>> > "down", and some with "in" but "down".
> > >>> >
> > >>> > Not sure where to go next.  I'm going to let the recovery continue
> > and
> > >>> hope
> > >>> > that my 4x replication on these pools saves me.
> > >>> >
> > >>> > Not sure where to go from here.  Any help is very much appreciated.
> > >>> This
> > >>> > Ceph cluster holds all of our Cloudstack images...  it would be
> > >>> terrible to
> > >>> > lose this data.
> > >>> > _______________________________________________
> > >>> > ceph-users mailing list -- ceph-users@xxxxxxx
> > >>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> > >>>
> > >>>
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx