I noticed this on the initial run of ceph health, but I no longer see it. When you say "don't use ceph adm", can you explain why this is bad? This is ceph health outside of cephadm shell: HEALTH_WARN 1 filesystem is degraded; 2 MDSs report slow metadata IOs; 2/5 mons down, quorum cn02,cn03,cn01; 10 osds down; 3 hosts (17 osds) down; Reduced data ava ilability: 13 pgs inactive, 9 pgs down; Degraded data redundancy: 8515690/30862245 objects degraded (27.593%), 326 pgs degraded, 447 pgs undersized [WRN] FS_DEGRADED: 1 filesystem is degraded fs coldlogix is degraded [WRN] MDS_SLOW_METADATA_IO: 2 MDSs report slow metadata IOs mds.coldlogix.cn01.uriofo(mds.0): 2 slow metadata IOs are blocked > 30 secs, oldest blocked for 3701 secs mds.btc.cn02.ouvaus(mds.0): 1 slow metadata IOs are blocked > 30 secs, oldest blocked for 382 secs [WRN] MON_DOWN: 2/5 mons down, quorum cn02,cn03,cn01 mon.cn05 (rank 0) addr [v2:192.168.30.15:3300/0,v1:192.168.30.15:6789/0] is down (out of quorum) mon.cn04 (rank 3) addr [v2:192.168.30.14:3300/0,v1:192.168.30.14:6789/0] is down (out of quorum) [WRN] OSD_DOWN: 10 osds down osd.0 (root=default,host=cn05) is down osd.1 (root=default,host=cn06) is down osd.7 (root=default,host=cn04) is down osd.13 (root=default,host=cn06) is down osd.15 (root=default,host=cn05) is down osd.18 (root=default,host=cn04) is down osd.20 (root=default,host=cn04) is down osd.33 (root=default,host=cn06) is down osd.34 (root=default,host=cn06) is down osd.36 (root=default,host=cn05) is down [WRN] OSD_HOST_DOWN: 3 hosts (17 osds) down host cn04 (root=default) (6 osds) is down host cn05 (root=default) (5 osds) is down host cn06 (root=default) (6 osds) is down [WRN] PG_AVAILABILITY: Reduced data availability: 13 pgs inactive, 9 pgs down pg 9.3a is down, acting [8] pg 9.7a is down, acting [8] pg 9.ba is down, acting [8] pg 9.fa is down, acting [8] pg 11.3 is stuck inactive for 39h, current state undersized+degraded+peered, last acting [11] pg 11.11 is down, acting [19,9] pg 11.1f is stuck inactive for 13h, current state undersized+degraded+peered, last acting [10] pg 12.36 is down, acting [21,16] pg 12.59 is down, acting [26,5] pg 12.66 is down, acting [5] pg 19.4 is stuck inactive for 39h, current state undersized+degraded+peered, last acting [6] pg 19.1c is down, acting [21,16,11] pg 21.1 is stuck inactive for 2m, current state unknown, last acting [] [WRN] PG_DEGRADED: Degraded data redundancy: 8515690/30862245 objects degraded (27.593%), 326 pgs degraded, 447 pgs undersized pg 9.75 is stuck undersized for 61m, current state active+undersized+remapped, last acting [4,8,35] pg 9.76 is stuck undersized for 62m, current state active+undersized+degraded, last acting [35,10,21] pg 9.77 is stuck undersized for 61m, current state active+undersized+remapped, last acting [32,35,4] pg 9.78 is stuck undersized for 62m, current state active+undersized+degraded, last acting [14,10] pg 9.79 is stuck undersized for 62m, current state active+undersized+degraded, last acting [21,32] pg 9.7b is stuck undersized for 61m, current state active+undersized+degraded, last acting [8,12,5] pg 9.7c is stuck undersized for 61m, current state active+undersized+degraded, last acting [4,35,10] pg 9.7d is stuck undersized for 62m, current state active+undersized+degraded, last acting [5,19,10] pg 9.7e is stuck undersized for 62m, current state active+undersized+remapped, last acting [21,10,17] pg 9.80 is stuck undersized for 61m, current state active+undersized+degraded, last acting [8,4,17] pg 9.81 is stuck undersized for 62m, current state active+undersized+degraded, last acting [14,26] pg 9.82 is stuck undersized for 62m, current state active+undersized+degraded, last acting [26,16] pg 9.83 is stuck undersized for 61m, current state active+undersized+degraded, last acting [8,4] pg 9.84 is stuck undersized for 61m, current state active+undersized+degraded, last acting [4,35,6] pg 9.85 is stuck undersized for 61m, current state active+undersized+degraded, last acting [32,12,9] pg 9.86 is stuck undersized for 61m, current state active+undersized+degraded, last acting [35,5,8] pg 9.87 is stuck undersized for 61m, current state active+undersized+degraded, last acting [9,12] pg 9.88 is stuck undersized for 62m, current state active+undersized+remapped, last acting [19,32,35] pg 9.89 is stuck undersized for 61m, current state active+undersized+degraded, last acting [10,14,4] pg 9.8a is stuck undersized for 62m, current state active+undersized+degraded, last acting [21,19] pg 9.8b is stuck undersized for 61m, current state active+undersized+degraded, last acting [8,35] pg 9.8c is stuck undersized for 58m, current state active+undersized+remapped, last acting [10,19,5] pg 9.8d is stuck undersized for 61m, current state active+undersized+remapped, last acting [9,6] pg 9.8f is stuck undersized for 62m, current state active+undersized+remapped, last acting [19,26,17] pg 9.90 is stuck undersized for 62m, current state active+undersized+remapped, last acting [35,26] pg 9.91 is stuck undersized for 62m, current state active+undersized+degraded, last acting [17,5] pg 9.92 is stuck undersized for 62m, current state active+undersized+degraded, last acting [21,26] pg 9.93 is stuck undersized for 62m, current state active+undersized+remapped, last acting [19,26,5] pg 9.94 is stuck undersized for 62m, current state active+undersized+degraded, last acting [21,11] pg 9.95 is stuck undersized for 61m, current state active+undersized+degraded, last acting [8,19] pg 9.96 is stuck undersized for 62m, current state active+undersized+degraded, last acting [17,6] pg 9.97 is stuck undersized for 61m, current state active+undersized+degraded, last acting [8,9,16] pg 9.98 is stuck undersized for 62m, current state active+undersized+degraded, last acting [6,21] pg 9.99 is stuck undersized for 61m, current state active+undersized+degraded, last acting [10,9] pg 9.9a is stuck undersized for 61m, current state active+undersized+remapped, last acting [4,16,10] pg 9.9b is stuck undersized for 61m, current state active+undersized+degraded, last acting [12,4,11] pg 9.9c is stuck undersized for 61m, current state active+undersized+degraded, last acting [9,16] pg 9.9d is stuck undersized for 62m, current state active+undersized+degraded, last acting [26,35] pg 9.9f is stuck undersized for 61m, current state active+undersized+degraded, last acting [9,17,26] pg 12.70 is stuck undersized for 62m, current state active+undersized+degraded, last acting [21,35] pg 12.71 is active+undersized+degraded, acting [6,12] pg 12.72 is stuck undersized for 61m, current state active+undersized+degraded, last acting [10,14,4] pg 12.73 is stuck undersized for 62m, current state active+undersized+remapped, last acting [5,17,11] pg 12.78 is stuck undersized for 61m, current state active+undersized+degraded, last acting [5,8,35] pg 12.79 is stuck undersized for 61m, current state active+undersized+degraded, last acting [4,17] pg 12.7a is stuck undersized for 62m, current state active+undersized+degraded, last acting [10,21] pg 12.7b is stuck undersized for 62m, current state active+undersized+remapped, last acting [17,21,11] pg 12.7c is stuck undersized for 62m, current state active+undersized+degraded, last acting [32,21,16] pg 12.7d is stuck undersized for 61m, current state active+undersized+degraded, last acting [35,6,9] pg 12.7e is stuck undersized for 61m, current state active+undersized+degraded, last acting [26,4] pg 12.7f is stuck undersized for 61m, current state active+undersized+degraded, last acting [9,14] It's no longer giving me the ssh key issues but hasn't done anything to improve my situation. When the machines came up with a different IP, did this somehow throw off some kind of ssh known hosts file or pub key exchange? It's all very strange why a momentary bad IP could wreak so much havoc. Thank you -jeremy On Mon, Jul 25, 2022 at 1:44 PM Frank Schilder <frans@xxxxxx> wrote: > I don't use ceph-adm and I also don't know how you got the "some more > info". However, I did notice that it contains instructions, starting at > "Please make sure that the host is reachable ...". How about starting to > follow those? > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Jeremy Hansen <farnsworth.mcfadden@xxxxxxxxx> > Sent: 25 July 2022 22:32:32 > To: ceph-users@xxxxxxx > Subject: [Warning Possible spam] Re: Issues after a shutdown > > Here's some more info: > > HEALTH_WARN 2 failed cephadm daemon(s); 3 hosts fail cephadm check; 2 > filesystems are degraded; 1 MDSs report slow metadata IOs; 2/5 mons down, > quorum cn02,cn03,cn01; 10 osds down; 3 hosts (17 osds) down; Reduced data > availability: 13 pgs inactive, 9 pgs down; Degraded data redundancy: > 8515690/30862245 objects degraded (27.593%), 326 pgs degraded, 447 pgs > undersized > [WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s) > daemon osd.3 on cn01.ceph is in error state > daemon osd.2 on cn01.ceph is in error state > [WRN] CEPHADM_HOST_CHECK_FAILED: 3 hosts fail cephadm check > host cn04.ceph (192.168.30.14) failed check: Failed to connect to > cn04.ceph (192.168.30.14). > Please make sure that the host is reachable and accepts connections using > the cephadm SSH key > > To add the cephadm SSH key to the host: > > ceph cephadm get-pub-key > ~/ceph.pub > > ssh-copy-id -f -i ~/ceph.pub root@192.168.30.14 > > To check that the host is reachable open a new shell with the --no-hosts > flag: > > cephadm shell --no-hosts > > Then run the following: > > ceph cephadm get-ssh-config > ssh_config > > ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key > > chmod 0600 ~/cephadm_private_key > > ssh -F ssh_config -i ~/cephadm_private_key root@192.168.30.14 > host cn06.ceph (192.168.30.16) failed check: Failed to connect to > cn06.ceph (192.168.30.16). > Please make sure that the host is reachable and accepts connections using > the cephadm SSH key > > To add the cephadm SSH key to the host: > > ceph cephadm get-pub-key > ~/ceph.pub > > ssh-copy-id -f -i ~/ceph.pub root@192.168.30.16 > > To check that the host is reachable open a new shell with the --no-hosts > flag: > > cephadm shell --no-hosts > > Then run the following: > > ceph cephadm get-ssh-config > ssh_config > > ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key > > chmod 0600 ~/cephadm_private_key > > ssh -F ssh_config -i ~/cephadm_private_key root@192.168.30.16 > host cn05.ceph (192.168.30.15) failed check: Failed to connect to > cn05.ceph (192.168.30.15). > Please make sure that the host is reachable and accepts connections using > the cephadm SSH key > > To add the cephadm SSH key to the host: > > ceph cephadm get-pub-key > ~/ceph.pub > > ssh-copy-id -f -i ~/ceph.pub root@192.168.30.15 > > To check that the host is reachable open a new shell with the --no-hosts > flag: > > cephadm shell --no-hosts > > Then run the following: > > ceph cephadm get-ssh-config > ssh_config > > ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key > > chmod 0600 ~/cephadm_private_key > > ssh -F ssh_config -i ~/cephadm_private_key root@192.168.30.15 > [WRN] FS_DEGRADED: 2 filesystems are degraded > fs coldlogix is degraded > fs btc is degraded > [WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs > mds.coldlogix.cn01.uriofo(mds.0): 2 slow metadata IOs are blocked > 30 > secs, oldest blocked for 2096 secs > [WRN] MON_DOWN: 2/5 mons down, quorum cn02,cn03,cn01 > mon.cn05 (rank 0) addr [v2: > 192.168.30.15:3300/0,v1:192.168.30.15:6789/0] > is down (out of quorum) > mon.cn04 (rank 3) addr [v2: > 192.168.30.14:3300/0,v1:192.168.30.14:6789/0] > is down (out of quorum) > [WRN] OSD_DOWN: 10 osds down > osd.0 (root=default,host=cn05) is down > osd.1 (root=default,host=cn06) is down > osd.7 (root=default,host=cn04) is down > osd.13 (root=default,host=cn06) is down > osd.15 (root=default,host=cn05) is down > osd.18 (root=default,host=cn04) is down > osd.20 (root=default,host=cn04) is down > osd.33 (root=default,host=cn06) is down > osd.34 (root=default,host=cn06) is down > osd.36 (root=default,host=cn05) is down > [WRN] OSD_HOST_DOWN: 3 hosts (17 osds) down > host cn04 (root=default) (6 osds) is down > host cn05 (root=default) (5 osds) is down > host cn06 (root=default) (6 osds) is down > [WRN] PG_AVAILABILITY: Reduced data availability: 13 pgs inactive, 9 pgs > down > pg 9.3a is down, acting [8] > pg 9.7a is down, acting [8] > pg 9.ba is down, acting [8] > pg 9.fa is down, acting [8] > pg 11.3 is stuck inactive for 39h, current state > undersized+degraded+peered, last acting [11] > pg 11.11 is down, acting [19,9] > pg 11.1f is stuck inactive for 13h, current state > undersized+degraded+peered, last acting [10] > pg 12.36 is down, acting [21,16] > pg 12.59 is down, acting [26,5] > pg 12.66 is down, acting [5] > pg 19.4 is stuck inactive for 39h, current state > undersized+degraded+peered, last acting [6] > pg 19.1c is down, acting [21,16,11] > pg 21.1 is stuck inactive for 36m, current state unknown, last acting > [] > [WRN] PG_DEGRADED: Degraded data redundancy: 8515690/30862245 objects > degraded (27.593%), 326 pgs degraded, 447 pgs undersized > pg 9.75 is stuck undersized for 34m, current state > active+undersized+remapped, last acting [4,8,35] > pg 9.76 is stuck undersized for 35m, current state > active+undersized+degraded, last acting [35,10,21] > pg 9.77 is stuck undersized for 34m, current state > active+undersized+remapped, last acting [32,35,4] > pg 9.78 is stuck undersized for 35m, current state > active+undersized+degraded, last acting [14,10] > pg 9.79 is stuck undersized for 35m, current state > active+undersized+degraded, last acting [21,32] > pg 9.7b is stuck undersized for 34m, current state > active+undersized+degraded, last acting [8,12,5] > pg 9.7c is stuck undersized for 34m, current state > active+undersized+degraded, last acting [4,35,10] > pg 9.7d is stuck undersized for 35m, current state > active+undersized+degraded, last acting [5,19,10] > pg 9.7e is stuck undersized for 35m, current state > active+undersized+remapped, last acting [21,10,17] > pg 9.80 is stuck undersized for 34m, current state > active+undersized+degraded, last acting [8,4,17] > pg 9.81 is stuck undersized for 35m, current state > active+undersized+degraded, last acting [14,26] > pg 9.82 is stuck undersized for 35m, current state > active+undersized+degraded, last acting [26,16] > pg 9.83 is stuck undersized for 34m, current state > active+undersized+degraded, last acting [8,4] > pg 9.84 is stuck undersized for 34m, current state > active+undersized+degraded, last acting [4,35,6] > pg 9.85 is stuck undersized for 35m, current state > active+undersized+degraded, last acting [32,12,9] > pg 9.86 is stuck undersized for 34m, current state > active+undersized+degraded, last acting [35,5,8] > pg 9.87 is stuck undersized for 35m, current state > active+undersized+degraded, last acting [9,12] > pg 9.88 is stuck undersized for 35m, current state > active+undersized+remapped, last acting [19,32,35] > pg 9.89 is stuck undersized for 34m, current state > active+undersized+degraded, last acting [10,14,4] > pg 9.8a is stuck undersized for 35m, current state > active+undersized+degraded, last acting [21,19] > pg 9.8b is stuck undersized for 34m, current state > active+undersized+degraded, last acting [8,35] > pg 9.8c is stuck undersized for 31m, current state > active+undersized+remapped, last acting [10,19,5] > pg 9.8d is stuck undersized for 35m, current state > active+undersized+remapped, last acting [9,6] > pg 9.8f is stuck undersized for 35m, current state > active+undersized+remapped, last acting [19,26,17] > pg 9.90 is stuck undersized for 35m, current state > active+undersized+remapped, last acting [35,26] > pg 9.91 is stuck undersized for 35m, current state > active+undersized+degraded, last acting [17,5] > pg 9.92 is stuck undersized for 35m, current state > active+undersized+degraded, last acting [21,26] > pg 9.93 is stuck undersized for 35m, current state > active+undersized+remapped, last acting [19,26,5] > pg 9.94 is stuck undersized for 35m, current state > active+undersized+degraded, last acting [21,11] > pg 9.95 is stuck undersized for 34m, current state > active+undersized+degraded, last acting [8,19] > pg 9.96 is stuck undersized for 35m, current state > active+undersized+degraded, last acting [17,6] > pg 9.97 is stuck undersized for 34m, current state > active+undersized+degraded, last acting [8,9,16] > pg 9.98 is stuck undersized for 35m, current state > active+undersized+degraded, last acting [6,21] > pg 9.99 is stuck undersized for 35m, current state > active+undersized+degraded, last acting [10,9] > pg 9.9a is stuck undersized for 34m, current state > active+undersized+remapped, last acting [4,16,10] > pg 9.9b is stuck undersized for 34m, current state > active+undersized+degraded, last acting [12,4,11] > pg 9.9c is stuck undersized for 35m, current state > active+undersized+degraded, last acting [9,16] > pg 9.9d is stuck undersized for 35m, current state > active+undersized+degraded, last acting [26,35] > pg 9.9f is stuck undersized for 35m, current state > active+undersized+degraded, last acting [9,17,26] > pg 12.70 is stuck undersized for 35m, current state > active+undersized+degraded, last acting [21,35] > pg 12.71 is active+undersized+degraded, acting [6,12] > pg 12.72 is stuck undersized for 34m, current state > active+undersized+degraded, last acting [10,14,4] > pg 12.73 is stuck undersized for 35m, current state > active+undersized+remapped, last acting [5,17,11] > pg 12.78 is stuck undersized for 34m, current state > active+undersized+degraded, last acting [5,8,35] > pg 12.79 is stuck undersized for 34m, current state > active+undersized+degraded, last acting [4,17] > pg 12.7a is stuck undersized for 35m, current state > active+undersized+degraded, last acting [10,21] > pg 12.7b is stuck undersized for 35m, current state > active+undersized+remapped, last acting [17,21,11] > pg 12.7c is stuck undersized for 35m, current state > active+undersized+degraded, last acting [32,21,16] > pg 12.7d is stuck undersized for 35m, current state > active+undersized+degraded, last acting [35,6,9] > pg 12.7e is stuck undersized for 34m, current state > active+undersized+degraded, last acting [26,4] > pg 12.7f is stuck undersized for 35m, current state > active+undersized+degraded, last acting [9,14] > > On Mon, Jul 25, 2022 at 12:43 PM Jeremy Hansen < > farnsworth.mcfadden@xxxxxxxxx> wrote: > > > Pretty desperate here. Can someone suggest what I might be able to do to > > get these OSDs back up. It looks like my recovery had stalled. > > > > > > On Mon, Jul 25, 2022 at 7:26 AM Anthony D'Atri <anthony.datri@xxxxxxxxx> > > wrote: > > > >> Do your values for public and cluster network include the new addresses > >> on all nodes? > >> > > > > This cluster only has one network. There is no separation between > > public and cluster. Three of the nodes momentarily came up using a > > different IP address. > > > > I've also noticed on one of the nodes that did not move or have any IP > > issue, on a single node, from the dashboard, it names the same device for > > two different osd's: > > > > 2 cn01 out destroyed hdd TOSHIBA_MG04SCA40EE_21M0A0CKFWZB Unknown sdb > osd.2 > > > > 3 cn01 out destroyed ssd SAMSUNG_MZILT3T8HBLS/007_S5G0NE0R200159 Unknown > > sdb osd.3 > > > > > > [ceph: root@cn01 /]# ceph-volume inventory > > > > Device Path Size rotates available Model name > > /dev/sda 3.64 TB True False MG04SCA40EE > > /dev/sdb 3.49 TB False False MZILT3T8HBLS/007 > > /dev/sdc 3.64 TB True False MG04SCA40EE > > /dev/sdd 3.64 TB True False MG04SCA40EE > > /dev/sde 3.49 TB False False MZILT3T8HBLS/007 > > /dev/sdf 3.64 TB True False MG04SCA40EE > > /dev/sdg 698.64 GB True False SEAGATE ST375064 > > > > [ceph: root@cn01 /]# ceph osd info > > osd.0 down out weight 0 up_from 30231 up_thru 30564 down_at 30688 > > last_clean_interval [25500,30228) [v2: > > 192.168.30.15:6818/2512683421,v1:192.168.30.15:6819/2512683421] [v2: > > 192.168.30.15:6824/2512683421,v1:192.168.30.15:6826/2512683421] > > autoout,exists d14cf503-a303-4fa4-a713-9530b67d613a > > osd.1 down out weight 0 up_from 30393 up_thru 30688 down_at 30697 > > last_clean_interval [25518,30321) [v2: > > 192.168.30.16:6834/1781855831,v1:192.168.30.16:6835/1781855831] [v2: > > 192.168.30.16:6836/1781855831,v1:192.168.30.16:6837/1781855831] > > autoout,exists 0d521411-c835-4fa3-beca-3631b4ff6bf7 > > osd.2 down out weight 0 up_from 31316 up_thru 31293 down_at 31317 > > last_clean_interval [31218,31296) [v2: > > 192.168.30.11:6810/894589880,v1:192.168.30.11:6811/894589880] [v2: > > 192.168.30.11:6812/894589880,v1:192.168.30.11:6813/894589880] > > destroyed,exists > > osd.3 down out weight 0 up_from 31265 up_thru 31266 down_at 31268 > > last_clean_interval [31254,31256) [v2: > > 192.168.30.11:6818/1641948535,v1:192.168.30.11:6819/1641948535] [v2: > > 192.168.30.11:6820/1641948535,v1:192.168.30.11:6821/1641948535] > > destroyed,exists > > osd.4 up in weight 1 up_from 31356 up_thru 31581 down_at 31339 > > last_clean_interval [31320,31338) [v2: > > 192.168.30.11:6802/2785067179,v1:192.168.30.11:6803/2785067179] [v2: > > 192.168.30.11:6804/2785067179,v1:192.168.30.11:6805/2785067179] > exists,up > > 3afd06db-b91d-44fe-9305-5eb95f7a59b9 > > osd.5 up in weight 1 up_from 31347 up_thru 31699 down_at 31339 > > last_clean_interval [31311,31338) [v2: > > 192.168.30.11:6818/1936771540,v1:192.168.30.11:6819/1936771540] [v2: > > 192.168.30.11:6820/1936771540,v1:192.168.30.11:6821/1936771540] > exists,up > > 063c2ccf-02ce-4f5e-8252-dddfbb258a95 > > osd.6 up in weight 1 up_from 31218 up_thru 31711 down_at 31217 > > last_clean_interval [30978,31195) [v2: > > 192.168.30.12:6816/1585973160,v1:192.168.30.12:6817/1585973160] [v2: > > 192.168.30.12:6818/1585973160,v1:192.168.30.12:6819/1585973160] > exists,up > > 94250ea2-f12e-4dc6-9135-b626086ccffd > > osd.7 down out weight 0 up_from 30353 up_thru 30558 down_at 30688 > > last_clean_interval [25533,30349) [v2: > > 192.168.30.14:6816/4083104061,v1:192.168.30.14:6817/4083104061] [v2: > > 192.168.30.14:6840/4094104061,v1:192.168.30.14:6841/4094104061] > > autoout,exists de351aec-b91e-4c22-a0bf-85369bc14579 > > osd.8 up in weight 1 up_from 31226 up_thru 31668 down_at 31225 > > last_clean_interval [30983,31195) [v2: > > 192.168.30.12:6824/1312484329,v1:192.168.30.12:6825/1312484329] [v2: > > 192.168.30.12:6826/1312484329,v1:192.168.30.12:6827/1312484329] > exists,up > > 51f665b4-fa5b-4b17-8390-ed130145ef04 > > osd.9 up in weight 1 up_from 31351 up_thru 31673 down_at 31340 > > last_clean_interval [31315,31338) [v2: > > 192.168.30.11:6810/1446838877,v1:192.168.30.11:6811/1446838877] [v2: > > 192.168.30.11:6812/1446838877,v1:192.168.30.11:6813/1446838877] > exists,up > > 985f1127-d126-4629-b8cd-03cf2d914d99 > > osd.10 up in weight 1 up_from 31219 up_thru 31639 down_at 31218 > > last_clean_interval [30980,31195) [v2: > > 192.168.30.12:6808/1587842953,v1:192.168.30.12:6809/1587842953] [v2: > > 192.168.30.12:6810/1587842953,v1:192.168.30.12:6811/1587842953] > exists,up > > c7fca03e-4bd5-4485-a090-658ca967d5f6 > > osd.11 up in weight 1 up_from 31234 up_thru 31659 down_at 31223 > > last_clean_interval [30978,31195) [v2: > > 192.168.30.12:6840/3403200742,v1:192.168.30.12:6841/3403200742] [v2: > > 192.168.30.12:6842/3403200742,v1:192.168.30.12:6843/3403200742] > exists,up > > 81074bd7-ad9f-4e56-8885-cca4745f6c95 > > osd.12 up in weight 1 up_from 31230 up_thru 31717 down_at 31223 > > last_clean_interval [30975,31195) [v2: > > 192.168.30.13:6816/4268732910,v1:192.168.30.13:6817/4268732910] [v2: > > 192.168.30.13:6818/4268732910,v1:192.168.30.13:6819/4268732910] > exists,up > > af1b55dd-c110-4861-aed9-c0737cef8be1 > > osd.13 down out weight 0 up_from 30389 up_thru 30688 down_at 30695 > > last_clean_interval [25513,30317) [v2: > > 192.168.30.16:6804/1573803727,v1:192.168.30.16:6805/1573803727] [v2: > > 192.168.30.16:6806/1573803727,v1:192.168.30.16:6807/1573803727] > > autoout,exists 737a3234-0f1f-4286-80e9-e89b581aae41 > > osd.14 up in weight 1 up_from 31218 up_thru 31709 down_at 31217 > > last_clean_interval [30979,31195) [v2: > > 192.168.30.13:6834/2291187817,v1:192.168.30.13:6835/2291187817] [v2: > > 192.168.30.13:6836/2291187817,v1:192.168.30.13:6837/2291187817] > exists,up > > 97cd6ac7-aca0-42fd-a049-d27289f83183 > > osd.15 down out weight 0 up_from 29463 up_thru 30531 down_at 30688 > > last_clean_interval [25493,29462) [v2: > > 192.168.30.15:6808/2655269991,v1:192.168.30.15:6809/2655269991] [v2: > > 192.168.30.15:6802/2662269991,v1:192.168.30.15:6803/2662269991] > > autoout,exists 61aea8f4-5905-4be3-ae32-5eacf75a514e > > osd.16 up in weight 1 up_from 31226 up_thru 31647 down_at 31223 > > last_clean_interval [30970,31195) [v2: > > 192.168.30.13:6808/2624812081,v1:192.168.30.13:6809/2624812081] [v2: > > 192.168.30.13:6810/2624812081,v1:192.168.30.13:6811/2624812081] > exists,up > > 791a7542-87cd-403d-a37e-8f00506b2eb6 > > osd.17 up in weight 1 up_from 31219 up_thru 31703 down_at 31218 > > last_clean_interval [30975,31195) [v2: > > 192.168.30.13:6800/2978036397,v1:192.168.30.13:6801/2978036397] [v2: > > 192.168.30.13:6802/2978036397,v1:192.168.30.13:6803/2978036397] > exists,up > > 4a915645-412f-49e6-8477-1577469905da > > osd.18 down out weight 0 up_from 30334 up_thru 30566 down_at 30688 > > last_clean_interval [25543,30327) [v2: > > 192.168.30.14:6832/985432137,v1:192.168.30.14:6833/985432137] [v2: > > 192.168.30.14:6848/998432137,v1:192.168.30.14:6849/998432137] > > autoout,exists 85f59d83-710c-4896-9200-bda4894fc3e8 > > osd.19 down in weight 1 up_from 31303 up_thru 31321 down_at 31323 > > last_clean_interval [31292,31296) [v2: > > 192.168.30.13:6826/375623427,v1:192.168.30.13:6827/375623427] [v2: > > 192.168.30.13:6828/375623427,v1:192.168.30.13:6829/375623427] exists > > 7d09b51a-bd6d-40f8-a009-78ab9937795d > > osd.20 down out weight 0 up_from 30445 up_thru 30531 down_at 30688 > > last_clean_interval [28947,30444) [v2: > > 192.168.30.14:6810/4062649162,v1:192.168.30.14:6811/4062649162] [v2: > > 192.168.30.14:6800/4073649162,v1:192.168.30.14:6801/4073649162] > > autoout,exists 7ef6cc1a-4755-4a14-b9df-f1f538d903ce > > osd.21 up in weight 1 up_from 31345 up_thru 31567 down_at 31341 > > last_clean_interval [31307,31340) [v2: > > 192.168.30.11:6826/1625231298,v1:192.168.30.11:6827/1625231298] [v2: > > 192.168.30.11:6828/1625231298,v1:192.168.30.11:6829/1625231298] > exists,up > > 5ef2e39a-a353-4cb8-a49e-093fe39b94ef > > osd.22 down in weight 1 up_from 30383 up_thru 30528 down_at 30688 > > last_clean_interval [25549,30317) [v2: > > 192.168.30.14:6806/1204256629,v1:192.168.30.14:6807/1204256629] [v2: > > 192.168.30.14:6812/1204256629,v1:192.168.30.14:6813/1204256629] exists > > c9befe11-a035-449c-8d17-42aaf191923d > > osd.23 down in weight 1 up_from 30334 up_thru 30576 down_at 30688 > > last_clean_interval [30263,30332) [v2: > > 192.168.30.14:6802/3837786490,v1:192.168.30.14:6803/3837786490] [v2: > > 192.168.30.14:6830/3838786490,v1:192.168.30.14:6831/3838786490] exists > > 2081147b-065d-4da7-89d9-747e1ae02b8d > > osd.24 down in weight 1 up_from 29455 up_thru 30570 down_at 30688 > > last_clean_interval [25487,29453) [v2: > > 192.168.30.15:6800/2008474583,v1:192.168.30.15:6801/2008474583] [v2: > > 192.168.30.15:6810/2016474583,v1:192.168.30.15:6811/2016474583] exists > > 39d78380-261c-4689-b53d-90713e6ffcca > > osd.26 up in weight 1 up_from 31208 up_thru 31643 down_at 31207 > > last_clean_interval [30967,31195) [v2: > > 192.168.30.12:6800/2861018947,v1:192.168.30.12:6801/2861018947] [v2: > > 192.168.30.12:6802/2861018947,v1:192.168.30.12:6803/2861018947] > exists,up > > 046622c8-c09c-4254-8c15-3dc05a2f01ed > > osd.28 down in weight 1 up_from 30389 up_thru 30574 down_at 30691 > > last_clean_interval [25513,30312) [v2: > > 192.168.30.16:6820/3466284570,v1:192.168.30.16:6821/3466284570] [v2: > > 192.168.30.16:6822/3466284570,v1:192.168.30.16:6823/3466284570] exists > > 10578b97-e3c4-4553-a8d0-6dcc46af5db1 > > osd.29 down in weight 1 up_from 30378 up_thru 30554 down_at 30688 > > last_clean_interval [28595,30376) [v2: > > 192.168.30.14:6808/3739543672,v1:192.168.30.14:6809/3739543672] [v2: > > 192.168.30.14:6846/3747543672,v1:192.168.30.14:6847/3747543672] exists > > 9698e936-8edf-4adf-92c9-a0b5202ed01a > > osd.30 down in weight 1 up_from 30449 up_thru 30531 down_at 30688 > > last_clean_interval [25502,30446) [v2: > > 192.168.30.15:6825/2375507296,v1:192.168.30.15:6827/2375507296] [v2: > > 192.168.30.15:6829/2375507296,v1:192.168.30.15:6831/2375507296] exists > > e14d2a0f-a98a-44d4-8c06-4d893f673629 > > osd.31 down in weight 1 up_from 30364 up_thru 30688 down_at 30700 > > last_clean_interval [25514,30361) [v2: > > 192.168.30.16:6826/2835000708,v1:192.168.30.16:6827/2835000708] [v2: > > 192.168.30.16:6802/2843000708,v1:192.168.30.16:6803/2843000708] exists > > 541bca38-e704-483a-8cb8-39e5f69007d1 > > osd.32 up in weight 1 up_from 31209 up_thru 31627 down_at 31208 > > last_clean_interval [30974,31195) [v2: > > 192.168.30.12:6832/3860067997,v1:192.168.30.12:6833/3860067997] [v2: > > 192.168.30.12:6834/3860067997,v1:192.168.30.12:6835/3860067997] > exists,up > > 9200a57e-2845-43ff-9787-8f1f3158fe90 > > osd.33 down in weight 1 up_from 30354 up_thru 30688 down_at 30693 > > last_clean_interval [25521,30350) [v2: > > 192.168.30.16:6842/2342555666,v1:192.168.30.16:6843/2342555666] [v2: > > 192.168.30.16:6844/2364555666,v1:192.168.30.16:6845/2364555666] exists > > 20c55d85-cf9a-4133-a189-7fdad2318f58 > > osd.34 down in weight 1 up_from 30390 up_thru 30688 down_at 30691 > > last_clean_interval [25516,30314) [v2: > > 192.168.30.16:6808/2282629870,v1:192.168.30.16:6811/2282629870] [v2: > > 192.168.30.16:6812/2282629870,v1:192.168.30.16:6814/2282629870] exists > > 77e0ef8f-c047-4f84-afb2-a8ad054e562f > > osd.35 up in weight 1 up_from 31204 up_thru 31657 down_at 31203 > > last_clean_interval [30958,31195) [v2: > > 192.168.30.13:6842/1919357520,v1:192.168.30.13:6843/1919357520] [v2: > > 192.168.30.13:6844/1919357520,v1:192.168.30.13:6845/1919357520] > exists,up > > 2d2de0cb-6d41-4957-a473-2bbe9ce227bf > > osd.36 down in weight 1 up_from 29494 up_thru 30560 down_at 30688 > > last_clean_interval [25491,29492) [v2: > > 192.168.30.15:6816/2153321591,v1:192.168.30.15:6817/2153321591] [v2: > > 192.168.30.15:6842/2158321591,v1:192.168.30.15:6843/2158321591] exists > > 26114668-68b2-458b-89c2-cbad5507ab75 > > > > > > > >> > >> > On Jul 25, 2022, at 3:29 AM, Jeremy Hansen < > >> farnsworth.mcfadden@xxxxxxxxx> wrote: > >> > > >> > I transitioned some servers to a new rack and now I'm having major > >> issues > >> > with Ceph upon bringing things back up. > >> > > >> > I believe the issue may be related to the ceph nodes coming back up > with > >> > different IPs before VLANs were set. That's just a guess because I > >> can't > >> > think of any other reason this would happen. > >> > > >> > Current state: > >> > > >> > Every 2.0s: ceph -s > >> > cn01.ceph.la1.clx.corp: Mon Jul 25 10:13:05 2022 > >> > > >> > cluster: > >> > id: bfa2ad58-c049-11eb-9098-3c8cf8ed728d > >> > health: HEALTH_WARN > >> > 1 filesystem is degraded > >> > 2 MDSs report slow metadata IOs > >> > 2/5 mons down, quorum cn02,cn03,cn01 > >> > 9 osds down > >> > 3 hosts (17 osds) down > >> > Reduced data availability: 97 pgs inactive, 9 pgs down > >> > Degraded data redundancy: 13860144/30824413 objects > degraded > >> > (44.965%), 411 pgs degraded, 482 pgs undersized > >> > > >> > services: > >> > mon: 5 daemons, quorum cn02,cn03,cn01 (age 62m), out of quorum: > cn05, > >> > cn04 > >> > mgr: cn02.arszct(active, since 5m) > >> > mds: 2/2 daemons up, 2 standby > >> > osd: 35 osds: 15 up (since 62m), 24 in (since 58m); 222 remapped > pgs > >> > > >> > data: > >> > volumes: 1/2 healthy, 1 recovering > >> > pools: 8 pools, 545 pgs > >> > objects: 7.71M objects, 6.7 TiB > >> > usage: 15 TiB used, 39 TiB / 54 TiB avail > >> > pgs: 0.367% pgs unknown > >> > 17.431% pgs not active > >> > 13860144/30824413 objects degraded (44.965%) > >> > 1137693/30824413 objects misplaced (3.691%) > >> > 280 active+undersized+degraded > >> > 67 undersized+degraded+remapped+backfilling+peered > >> > 57 active+undersized+remapped > >> > 45 active+clean+remapped > >> > 44 active+undersized+degraded+remapped+backfilling > >> > 18 undersized+degraded+peered > >> > 10 active+undersized > >> > 9 down > >> > 7 active+clean > >> > 3 active+undersized+remapped+backfilling > >> > 2 active+undersized+degraded+remapped+backfill_wait > >> > 2 unknown > >> > 1 undersized+peered > >> > > >> > io: > >> > client: 170 B/s rd, 0 op/s rd, 0 op/s wr > >> > recovery: 168 MiB/s, 158 keys/s, 166 objects/s > >> > > >> > I have to disable and re-enable the dashboard just to use it. It > seems > >> to > >> > get bogged down after a few moments. > >> > > >> > The three servers that were moved to the new rack Ceph has marked as > >> > "Down", but if I do a cephadm host-check, they all seem to pass: > >> > > >> > ************************ ceph ************************ > >> > --------- cn01.ceph.--------- > >> > podman (/usr/bin/podman) version 4.0.2 is present > >> > systemctl is present > >> > lvcreate is present > >> > Unit chronyd.service is enabled and running > >> > Host looks OK > >> > --------- cn02.ceph.--------- > >> > podman (/usr/bin/podman) version 4.0.2 is present > >> > systemctl is present > >> > lvcreate is present > >> > Unit chronyd.service is enabled and running > >> > Host looks OK > >> > --------- cn03.ceph.--------- > >> > podman (/usr/bin/podman) version 4.0.2 is present > >> > systemctl is present > >> > lvcreate is present > >> > Unit chronyd.service is enabled and running > >> > Host looks OK > >> > --------- cn04.ceph.--------- > >> > podman (/usr/bin/podman) version 4.0.2 is present > >> > systemctl is present > >> > lvcreate is present > >> > Unit chronyd.service is enabled and running > >> > Host looks OK > >> > --------- cn05.ceph.--------- > >> > podman|docker (/usr/bin/podman) is present > >> > systemctl is present > >> > lvcreate is present > >> > Unit chronyd.service is enabled and running > >> > Host looks OK > >> > --------- cn06.ceph.--------- > >> > podman (/usr/bin/podman) version 4.0.2 is present > >> > systemctl is present > >> > lvcreate is present > >> > Unit chronyd.service is enabled and running > >> > Host looks OK > >> > > >> > It seems to be recovering with what it has left, but a large amount of > >> OSDs > >> > are down. When trying to restart one of the down'd OSDs, I see a huge > >> dump. > >> > > >> > Jul 25 03:19:38 cn06.ceph > >> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug > >> > 2022-07-25T10:19:38.532+0000 7fce14a6c080 0 osd.34 30689 done with > >> init, > >> > starting boot process > >> > Jul 25 03:19:38 cn06.ceph > >> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug > >> > 2022-07-25T10:19:38.532+0000 7fce14a6c080 1 osd.34 30689 start_boot > >> > Jul 25 03:20:10 cn06.ceph > >> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug > >> > 2022-07-25T10:20:10.655+0000 7fcdfd12d700 1 osd.34 30689 start_boot > >> > Jul 25 03:20:41 cn06.ceph > >> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug > >> > 2022-07-25T10:20:41.159+0000 7fcdfd12d700 1 osd.34 30689 start_boot > >> > Jul 25 03:21:11 cn06.ceph > >> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug > >> > 2022-07-25T10:21:11.662+0000 7fcdfd12d700 1 osd.34 30689 start_boot > >> > > >> > At this point it just keeps printing start_boot, but the dashboard has > >> it > >> > marked as "in" but "down". > >> > > >> > On these three hosts that moved, there were a bunch marked as "out" > and > >> > "down", and some with "in" but "down". > >> > > >> > Not sure where to go next. I'm going to let the recovery continue and > >> hope > >> > that my 4x replication on these pools saves me. > >> > > >> > Not sure where to go from here. Any help is very much appreciated. > >> This > >> > Ceph cluster holds all of our Cloudstack images... it would be > >> terrible to > >> > lose this data. > >> > _______________________________________________ > >> > ceph-users mailing list -- ceph-users@xxxxxxx > >> > To unsubscribe send an email to ceph-users-leave@xxxxxxx > >> > > > > On Mon, Jul 25, 2022 at 10:15 AM Jeremy Hansen < > > farnsworth.mcfadden@xxxxxxxxx> wrote: > > > >> > >> > >> On Mon, Jul 25, 2022 at 7:26 AM Anthony D'Atri <anthony.datri@xxxxxxxxx > > > >> wrote: > >> > >>> Do your values for public and cluster network include the new addresses > >>> on all nodes? > >>> > >> > >> This cluster only has one network. There is no separation between > >> public and cluster. Three of the nodes momentarily came up using a > >> different IP address. > >> > >> I've also noticed on one of the nodes that did not move or have any IP > >> issue, on a single node, from the dashboard, it names the same device > for > >> two different osd's: > >> > >> 2 cn01 out destroyed hdd TOSHIBA_MG04SCA40EE_21M0A0CKFWZB Unknown sdb > >> osd.2 > >> > >> 3 cn01 out destroyed ssd SAMSUNG_MZILT3T8HBLS/007_S5G0NE0R200159 Unknown > >> sdb osd.3 > >> > >> > >> [ceph: root@cn01 /]# ceph-volume inventory > >> > >> Device Path Size rotates available Model name > >> /dev/sda 3.64 TB True False MG04SCA40EE > >> /dev/sdb 3.49 TB False False > MZILT3T8HBLS/007 > >> /dev/sdc 3.64 TB True False MG04SCA40EE > >> /dev/sdd 3.64 TB True False MG04SCA40EE > >> /dev/sde 3.49 TB False False > MZILT3T8HBLS/007 > >> /dev/sdf 3.64 TB True False MG04SCA40EE > >> /dev/sdg 698.64 GB True False SEAGATE > ST375064 > >> > >> [ceph: root@cn01 /]# ceph osd info > >> osd.0 down out weight 0 up_from 30231 up_thru 30564 down_at 30688 > >> last_clean_interval [25500,30228) [v2: > >> 192.168.30.15:6818/2512683421,v1:192.168.30.15:6819/2512683421] [v2: > >> 192.168.30.15:6824/2512683421,v1:192.168.30.15:6826/2512683421] > >> autoout,exists d14cf503-a303-4fa4-a713-9530b67d613a > >> osd.1 down out weight 0 up_from 30393 up_thru 30688 down_at 30697 > >> last_clean_interval [25518,30321) [v2: > >> 192.168.30.16:6834/1781855831,v1:192.168.30.16:6835/1781855831] [v2: > >> 192.168.30.16:6836/1781855831,v1:192.168.30.16:6837/1781855831] > >> autoout,exists 0d521411-c835-4fa3-beca-3631b4ff6bf7 > >> osd.2 down out weight 0 up_from 31316 up_thru 31293 down_at 31317 > >> last_clean_interval [31218,31296) [v2: > >> 192.168.30.11:6810/894589880,v1:192.168.30.11:6811/894589880] [v2: > >> 192.168.30.11:6812/894589880,v1:192.168.30.11:6813/894589880] > >> destroyed,exists > >> osd.3 down out weight 0 up_from 31265 up_thru 31266 down_at 31268 > >> last_clean_interval [31254,31256) [v2: > >> 192.168.30.11:6818/1641948535,v1:192.168.30.11:6819/1641948535] [v2: > >> 192.168.30.11:6820/1641948535,v1:192.168.30.11:6821/1641948535] > >> destroyed,exists > >> osd.4 up in weight 1 up_from 31356 up_thru 31581 down_at 31339 > >> last_clean_interval [31320,31338) [v2: > >> 192.168.30.11:6802/2785067179,v1:192.168.30.11:6803/2785067179] [v2: > >> 192.168.30.11:6804/2785067179,v1:192.168.30.11:6805/2785067179] > >> exists,up 3afd06db-b91d-44fe-9305-5eb95f7a59b9 > >> osd.5 up in weight 1 up_from 31347 up_thru 31699 down_at 31339 > >> last_clean_interval [31311,31338) [v2: > >> 192.168.30.11:6818/1936771540,v1:192.168.30.11:6819/1936771540] [v2: > >> 192.168.30.11:6820/1936771540,v1:192.168.30.11:6821/1936771540] > >> exists,up 063c2ccf-02ce-4f5e-8252-dddfbb258a95 > >> osd.6 up in weight 1 up_from 31218 up_thru 31711 down_at 31217 > >> last_clean_interval [30978,31195) [v2: > >> 192.168.30.12:6816/1585973160,v1:192.168.30.12:6817/1585973160] [v2: > >> 192.168.30.12:6818/1585973160,v1:192.168.30.12:6819/1585973160] > >> exists,up 94250ea2-f12e-4dc6-9135-b626086ccffd > >> osd.7 down out weight 0 up_from 30353 up_thru 30558 down_at 30688 > >> last_clean_interval [25533,30349) [v2: > >> 192.168.30.14:6816/4083104061,v1:192.168.30.14:6817/4083104061] [v2: > >> 192.168.30.14:6840/4094104061,v1:192.168.30.14:6841/4094104061] > >> autoout,exists de351aec-b91e-4c22-a0bf-85369bc14579 > >> osd.8 up in weight 1 up_from 31226 up_thru 31668 down_at 31225 > >> last_clean_interval [30983,31195) [v2: > >> 192.168.30.12:6824/1312484329,v1:192.168.30.12:6825/1312484329] [v2: > >> 192.168.30.12:6826/1312484329,v1:192.168.30.12:6827/1312484329] > >> exists,up 51f665b4-fa5b-4b17-8390-ed130145ef04 > >> osd.9 up in weight 1 up_from 31351 up_thru 31673 down_at 31340 > >> last_clean_interval [31315,31338) [v2: > >> 192.168.30.11:6810/1446838877,v1:192.168.30.11:6811/1446838877] [v2: > >> 192.168.30.11:6812/1446838877,v1:192.168.30.11:6813/1446838877] > >> exists,up 985f1127-d126-4629-b8cd-03cf2d914d99 > >> osd.10 up in weight 1 up_from 31219 up_thru 31639 down_at 31218 > >> last_clean_interval [30980,31195) [v2: > >> 192.168.30.12:6808/1587842953,v1:192.168.30.12:6809/1587842953] [v2: > >> 192.168.30.12:6810/1587842953,v1:192.168.30.12:6811/1587842953] > >> exists,up c7fca03e-4bd5-4485-a090-658ca967d5f6 > >> osd.11 up in weight 1 up_from 31234 up_thru 31659 down_at 31223 > >> last_clean_interval [30978,31195) [v2: > >> 192.168.30.12:6840/3403200742,v1:192.168.30.12:6841/3403200742] [v2: > >> 192.168.30.12:6842/3403200742,v1:192.168.30.12:6843/3403200742] > >> exists,up 81074bd7-ad9f-4e56-8885-cca4745f6c95 > >> osd.12 up in weight 1 up_from 31230 up_thru 31717 down_at 31223 > >> last_clean_interval [30975,31195) [v2: > >> 192.168.30.13:6816/4268732910,v1:192.168.30.13:6817/4268732910] [v2: > >> 192.168.30.13:6818/4268732910,v1:192.168.30.13:6819/4268732910] > >> exists,up af1b55dd-c110-4861-aed9-c0737cef8be1 > >> osd.13 down out weight 0 up_from 30389 up_thru 30688 down_at 30695 > >> last_clean_interval [25513,30317) [v2: > >> 192.168.30.16:6804/1573803727,v1:192.168.30.16:6805/1573803727] [v2: > >> 192.168.30.16:6806/1573803727,v1:192.168.30.16:6807/1573803727] > >> autoout,exists 737a3234-0f1f-4286-80e9-e89b581aae41 > >> osd.14 up in weight 1 up_from 31218 up_thru 31709 down_at 31217 > >> last_clean_interval [30979,31195) [v2: > >> 192.168.30.13:6834/2291187817,v1:192.168.30.13:6835/2291187817] [v2: > >> 192.168.30.13:6836/2291187817,v1:192.168.30.13:6837/2291187817] > >> exists,up 97cd6ac7-aca0-42fd-a049-d27289f83183 > >> osd.15 down out weight 0 up_from 29463 up_thru 30531 down_at 30688 > >> last_clean_interval [25493,29462) [v2: > >> 192.168.30.15:6808/2655269991,v1:192.168.30.15:6809/2655269991] [v2: > >> 192.168.30.15:6802/2662269991,v1:192.168.30.15:6803/2662269991] > >> autoout,exists 61aea8f4-5905-4be3-ae32-5eacf75a514e > >> osd.16 up in weight 1 up_from 31226 up_thru 31647 down_at 31223 > >> last_clean_interval [30970,31195) [v2: > >> 192.168.30.13:6808/2624812081,v1:192.168.30.13:6809/2624812081] [v2: > >> 192.168.30.13:6810/2624812081,v1:192.168.30.13:6811/2624812081] > >> exists,up 791a7542-87cd-403d-a37e-8f00506b2eb6 > >> osd.17 up in weight 1 up_from 31219 up_thru 31703 down_at 31218 > >> last_clean_interval [30975,31195) [v2: > >> 192.168.30.13:6800/2978036397,v1:192.168.30.13:6801/2978036397] [v2: > >> 192.168.30.13:6802/2978036397,v1:192.168.30.13:6803/2978036397] > >> exists,up 4a915645-412f-49e6-8477-1577469905da > >> osd.18 down out weight 0 up_from 30334 up_thru 30566 down_at 30688 > >> last_clean_interval [25543,30327) [v2: > >> 192.168.30.14:6832/985432137,v1:192.168.30.14:6833/985432137] [v2: > >> 192.168.30.14:6848/998432137,v1:192.168.30.14:6849/998432137] > >> autoout,exists 85f59d83-710c-4896-9200-bda4894fc3e8 > >> osd.19 down in weight 1 up_from 31303 up_thru 31321 down_at 31323 > >> last_clean_interval [31292,31296) [v2: > >> 192.168.30.13:6826/375623427,v1:192.168.30.13:6827/375623427] [v2: > >> 192.168.30.13:6828/375623427,v1:192.168.30.13:6829/375623427] exists > >> 7d09b51a-bd6d-40f8-a009-78ab9937795d > >> osd.20 down out weight 0 up_from 30445 up_thru 30531 down_at 30688 > >> last_clean_interval [28947,30444) [v2: > >> 192.168.30.14:6810/4062649162,v1:192.168.30.14:6811/4062649162] [v2: > >> 192.168.30.14:6800/4073649162,v1:192.168.30.14:6801/4073649162] > >> autoout,exists 7ef6cc1a-4755-4a14-b9df-f1f538d903ce > >> osd.21 up in weight 1 up_from 31345 up_thru 31567 down_at 31341 > >> last_clean_interval [31307,31340) [v2: > >> 192.168.30.11:6826/1625231298,v1:192.168.30.11:6827/1625231298] [v2: > >> 192.168.30.11:6828/1625231298,v1:192.168.30.11:6829/1625231298] > >> exists,up 5ef2e39a-a353-4cb8-a49e-093fe39b94ef > >> osd.22 down in weight 1 up_from 30383 up_thru 30528 down_at 30688 > >> last_clean_interval [25549,30317) [v2: > >> 192.168.30.14:6806/1204256629,v1:192.168.30.14:6807/1204256629] [v2: > >> 192.168.30.14:6812/1204256629,v1:192.168.30.14:6813/1204256629] exists > >> c9befe11-a035-449c-8d17-42aaf191923d > >> osd.23 down in weight 1 up_from 30334 up_thru 30576 down_at 30688 > >> last_clean_interval [30263,30332) [v2: > >> 192.168.30.14:6802/3837786490,v1:192.168.30.14:6803/3837786490] [v2: > >> 192.168.30.14:6830/3838786490,v1:192.168.30.14:6831/3838786490] exists > >> 2081147b-065d-4da7-89d9-747e1ae02b8d > >> osd.24 down in weight 1 up_from 29455 up_thru 30570 down_at 30688 > >> last_clean_interval [25487,29453) [v2: > >> 192.168.30.15:6800/2008474583,v1:192.168.30.15:6801/2008474583] [v2: > >> 192.168.30.15:6810/2016474583,v1:192.168.30.15:6811/2016474583] exists > >> 39d78380-261c-4689-b53d-90713e6ffcca > >> osd.26 up in weight 1 up_from 31208 up_thru 31643 down_at 31207 > >> last_clean_interval [30967,31195) [v2: > >> 192.168.30.12:6800/2861018947,v1:192.168.30.12:6801/2861018947] [v2: > >> 192.168.30.12:6802/2861018947,v1:192.168.30.12:6803/2861018947] > >> exists,up 046622c8-c09c-4254-8c15-3dc05a2f01ed > >> osd.28 down in weight 1 up_from 30389 up_thru 30574 down_at 30691 > >> last_clean_interval [25513,30312) [v2: > >> 192.168.30.16:6820/3466284570,v1:192.168.30.16:6821/3466284570] [v2: > >> 192.168.30.16:6822/3466284570,v1:192.168.30.16:6823/3466284570] exists > >> 10578b97-e3c4-4553-a8d0-6dcc46af5db1 > >> osd.29 down in weight 1 up_from 30378 up_thru 30554 down_at 30688 > >> last_clean_interval [28595,30376) [v2: > >> 192.168.30.14:6808/3739543672,v1:192.168.30.14:6809/3739543672] [v2: > >> 192.168.30.14:6846/3747543672,v1:192.168.30.14:6847/3747543672] exists > >> 9698e936-8edf-4adf-92c9-a0b5202ed01a > >> osd.30 down in weight 1 up_from 30449 up_thru 30531 down_at 30688 > >> last_clean_interval [25502,30446) [v2: > >> 192.168.30.15:6825/2375507296,v1:192.168.30.15:6827/2375507296] [v2: > >> 192.168.30.15:6829/2375507296,v1:192.168.30.15:6831/2375507296] exists > >> e14d2a0f-a98a-44d4-8c06-4d893f673629 > >> osd.31 down in weight 1 up_from 30364 up_thru 30688 down_at 30700 > >> last_clean_interval [25514,30361) [v2: > >> 192.168.30.16:6826/2835000708,v1:192.168.30.16:6827/2835000708] [v2: > >> 192.168.30.16:6802/2843000708,v1:192.168.30.16:6803/2843000708] exists > >> 541bca38-e704-483a-8cb8-39e5f69007d1 > >> osd.32 up in weight 1 up_from 31209 up_thru 31627 down_at 31208 > >> last_clean_interval [30974,31195) [v2: > >> 192.168.30.12:6832/3860067997,v1:192.168.30.12:6833/3860067997] [v2: > >> 192.168.30.12:6834/3860067997,v1:192.168.30.12:6835/3860067997] > >> exists,up 9200a57e-2845-43ff-9787-8f1f3158fe90 > >> osd.33 down in weight 1 up_from 30354 up_thru 30688 down_at 30693 > >> last_clean_interval [25521,30350) [v2: > >> 192.168.30.16:6842/2342555666,v1:192.168.30.16:6843/2342555666] [v2: > >> 192.168.30.16:6844/2364555666,v1:192.168.30.16:6845/2364555666] exists > >> 20c55d85-cf9a-4133-a189-7fdad2318f58 > >> osd.34 down in weight 1 up_from 30390 up_thru 30688 down_at 30691 > >> last_clean_interval [25516,30314) [v2: > >> 192.168.30.16:6808/2282629870,v1:192.168.30.16:6811/2282629870] [v2: > >> 192.168.30.16:6812/2282629870,v1:192.168.30.16:6814/2282629870] exists > >> 77e0ef8f-c047-4f84-afb2-a8ad054e562f > >> osd.35 up in weight 1 up_from 31204 up_thru 31657 down_at 31203 > >> last_clean_interval [30958,31195) [v2: > >> 192.168.30.13:6842/1919357520,v1:192.168.30.13:6843/1919357520] [v2: > >> 192.168.30.13:6844/1919357520,v1:192.168.30.13:6845/1919357520] > >> exists,up 2d2de0cb-6d41-4957-a473-2bbe9ce227bf > >> osd.36 down in weight 1 up_from 29494 up_thru 30560 down_at 30688 > >> last_clean_interval [25491,29492) [v2: > >> 192.168.30.15:6816/2153321591,v1:192.168.30.15:6817/2153321591] [v2: > >> 192.168.30.15:6842/2158321591,v1:192.168.30.15:6843/2158321591] exists > >> 26114668-68b2-458b-89c2-cbad5507ab75 > >> > >> > >> > >>> > >>> > On Jul 25, 2022, at 3:29 AM, Jeremy Hansen < > >>> farnsworth.mcfadden@xxxxxxxxx> wrote: > >>> > > >>> > I transitioned some servers to a new rack and now I'm having major > >>> issues > >>> > with Ceph upon bringing things back up. > >>> > > >>> > I believe the issue may be related to the ceph nodes coming back up > >>> with > >>> > different IPs before VLANs were set. That's just a guess because I > >>> can't > >>> > think of any other reason this would happen. > >>> > > >>> > Current state: > >>> > > >>> > Every 2.0s: ceph -s > >>> > cn01.ceph.la1.clx.corp: Mon Jul 25 10:13:05 2022 > >>> > > >>> > cluster: > >>> > id: bfa2ad58-c049-11eb-9098-3c8cf8ed728d > >>> > health: HEALTH_WARN > >>> > 1 filesystem is degraded > >>> > 2 MDSs report slow metadata IOs > >>> > 2/5 mons down, quorum cn02,cn03,cn01 > >>> > 9 osds down > >>> > 3 hosts (17 osds) down > >>> > Reduced data availability: 97 pgs inactive, 9 pgs down > >>> > Degraded data redundancy: 13860144/30824413 objects > degraded > >>> > (44.965%), 411 pgs degraded, 482 pgs undersized > >>> > > >>> > services: > >>> > mon: 5 daemons, quorum cn02,cn03,cn01 (age 62m), out of quorum: > >>> cn05, > >>> > cn04 > >>> > mgr: cn02.arszct(active, since 5m) > >>> > mds: 2/2 daemons up, 2 standby > >>> > osd: 35 osds: 15 up (since 62m), 24 in (since 58m); 222 remapped > pgs > >>> > > >>> > data: > >>> > volumes: 1/2 healthy, 1 recovering > >>> > pools: 8 pools, 545 pgs > >>> > objects: 7.71M objects, 6.7 TiB > >>> > usage: 15 TiB used, 39 TiB / 54 TiB avail > >>> > pgs: 0.367% pgs unknown > >>> > 17.431% pgs not active > >>> > 13860144/30824413 objects degraded (44.965%) > >>> > 1137693/30824413 objects misplaced (3.691%) > >>> > 280 active+undersized+degraded > >>> > 67 undersized+degraded+remapped+backfilling+peered > >>> > 57 active+undersized+remapped > >>> > 45 active+clean+remapped > >>> > 44 active+undersized+degraded+remapped+backfilling > >>> > 18 undersized+degraded+peered > >>> > 10 active+undersized > >>> > 9 down > >>> > 7 active+clean > >>> > 3 active+undersized+remapped+backfilling > >>> > 2 active+undersized+degraded+remapped+backfill_wait > >>> > 2 unknown > >>> > 1 undersized+peered > >>> > > >>> > io: > >>> > client: 170 B/s rd, 0 op/s rd, 0 op/s wr > >>> > recovery: 168 MiB/s, 158 keys/s, 166 objects/s > >>> > > >>> > I have to disable and re-enable the dashboard just to use it. It > >>> seems to > >>> > get bogged down after a few moments. > >>> > > >>> > The three servers that were moved to the new rack Ceph has marked as > >>> > "Down", but if I do a cephadm host-check, they all seem to pass: > >>> > > >>> > ************************ ceph ************************ > >>> > --------- cn01.ceph.--------- > >>> > podman (/usr/bin/podman) version 4.0.2 is present > >>> > systemctl is present > >>> > lvcreate is present > >>> > Unit chronyd.service is enabled and running > >>> > Host looks OK > >>> > --------- cn02.ceph.--------- > >>> > podman (/usr/bin/podman) version 4.0.2 is present > >>> > systemctl is present > >>> > lvcreate is present > >>> > Unit chronyd.service is enabled and running > >>> > Host looks OK > >>> > --------- cn03.ceph.--------- > >>> > podman (/usr/bin/podman) version 4.0.2 is present > >>> > systemctl is present > >>> > lvcreate is present > >>> > Unit chronyd.service is enabled and running > >>> > Host looks OK > >>> > --------- cn04.ceph.--------- > >>> > podman (/usr/bin/podman) version 4.0.2 is present > >>> > systemctl is present > >>> > lvcreate is present > >>> > Unit chronyd.service is enabled and running > >>> > Host looks OK > >>> > --------- cn05.ceph.--------- > >>> > podman|docker (/usr/bin/podman) is present > >>> > systemctl is present > >>> > lvcreate is present > >>> > Unit chronyd.service is enabled and running > >>> > Host looks OK > >>> > --------- cn06.ceph.--------- > >>> > podman (/usr/bin/podman) version 4.0.2 is present > >>> > systemctl is present > >>> > lvcreate is present > >>> > Unit chronyd.service is enabled and running > >>> > Host looks OK > >>> > > >>> > It seems to be recovering with what it has left, but a large amount > of > >>> OSDs > >>> > are down. When trying to restart one of the down'd OSDs, I see a > huge > >>> dump. > >>> > > >>> > Jul 25 03:19:38 cn06.ceph > >>> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug > >>> > 2022-07-25T10:19:38.532+0000 7fce14a6c080 0 osd.34 30689 done with > >>> init, > >>> > starting boot process > >>> > Jul 25 03:19:38 cn06.ceph > >>> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug > >>> > 2022-07-25T10:19:38.532+0000 7fce14a6c080 1 osd.34 30689 start_boot > >>> > Jul 25 03:20:10 cn06.ceph > >>> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug > >>> > 2022-07-25T10:20:10.655+0000 7fcdfd12d700 1 osd.34 30689 start_boot > >>> > Jul 25 03:20:41 cn06.ceph > >>> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug > >>> > 2022-07-25T10:20:41.159+0000 7fcdfd12d700 1 osd.34 30689 start_boot > >>> > Jul 25 03:21:11 cn06.ceph > >>> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug > >>> > 2022-07-25T10:21:11.662+0000 7fcdfd12d700 1 osd.34 30689 start_boot > >>> > > >>> > At this point it just keeps printing start_boot, but the dashboard > has > >>> it > >>> > marked as "in" but "down". > >>> > > >>> > On these three hosts that moved, there were a bunch marked as "out" > and > >>> > "down", and some with "in" but "down". > >>> > > >>> > Not sure where to go next. I'm going to let the recovery continue > and > >>> hope > >>> > that my 4x replication on these pools saves me. > >>> > > >>> > Not sure where to go from here. Any help is very much appreciated. > >>> This > >>> > Ceph cluster holds all of our Cloudstack images... it would be > >>> terrible to > >>> > lose this data. > >>> > _______________________________________________ > >>> > ceph-users mailing list -- ceph-users@xxxxxxx > >>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx > >>> > >>> > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx