Re: Issues after a shutdown

Jeremy Hansen <farnsworth.mcfadden@xxxxxxxxx> · Mon, 25 Jul 2022 13:32:32 -0700

Here's some more info:

HEALTH_WARN 2 failed cephadm daemon(s); 3 hosts fail cephadm check; 2
filesystems are degraded; 1 MDSs report slow metadata IOs; 2/5 mons down,
quorum cn02,cn03,cn01; 10 osds down; 3 hosts (17 osds) down; Reduced data
availability: 13 pgs inactive, 9 pgs down; Degraded data redundancy:
8515690/30862245 objects degraded (27.593%), 326 pgs degraded, 447 pgs
undersized
[WRN] CEPHADM_FAILED_DAEMON: 2 failed cephadm daemon(s)
    daemon osd.3 on cn01.ceph is in error state
    daemon osd.2 on cn01.ceph is in error state
[WRN] CEPHADM_HOST_CHECK_FAILED: 3 hosts fail cephadm check
    host cn04.ceph (192.168.30.14) failed check: Failed to connect to
cn04.ceph (192.168.30.14).
Please make sure that the host is reachable and accepts connections using
the cephadm SSH key

To add the cephadm SSH key to the host:
> ceph cephadm get-pub-key > ~/ceph.pub
> ssh-copy-id -f -i ~/ceph.pub root@192.168.30.14

To check that the host is reachable open a new shell with the --no-hosts
flag:
> cephadm shell --no-hosts

Then run the following:
> ceph cephadm get-ssh-config > ssh_config
> ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key
> chmod 0600 ~/cephadm_private_key
> ssh -F ssh_config -i ~/cephadm_private_key root@192.168.30.14
    host cn06.ceph (192.168.30.16) failed check: Failed to connect to
cn06.ceph (192.168.30.16).
Please make sure that the host is reachable and accepts connections using
the cephadm SSH key

To add the cephadm SSH key to the host:
> ceph cephadm get-pub-key > ~/ceph.pub
> ssh-copy-id -f -i ~/ceph.pub root@192.168.30.16

To check that the host is reachable open a new shell with the --no-hosts
flag:
> cephadm shell --no-hosts

Then run the following:
> ceph cephadm get-ssh-config > ssh_config
> ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key
> chmod 0600 ~/cephadm_private_key
> ssh -F ssh_config -i ~/cephadm_private_key root@192.168.30.16
    host cn05.ceph (192.168.30.15) failed check: Failed to connect to
cn05.ceph (192.168.30.15).
Please make sure that the host is reachable and accepts connections using
the cephadm SSH key

To add the cephadm SSH key to the host:
> ceph cephadm get-pub-key > ~/ceph.pub
> ssh-copy-id -f -i ~/ceph.pub root@192.168.30.15

To check that the host is reachable open a new shell with the --no-hosts
flag:
> cephadm shell --no-hosts

Then run the following:
> ceph cephadm get-ssh-config > ssh_config
> ceph config-key get mgr/cephadm/ssh_identity_key > ~/cephadm_private_key
> chmod 0600 ~/cephadm_private_key
> ssh -F ssh_config -i ~/cephadm_private_key root@192.168.30.15
[WRN] FS_DEGRADED: 2 filesystems are degraded
    fs coldlogix is degraded
    fs btc is degraded
[WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
    mds.coldlogix.cn01.uriofo(mds.0): 2 slow metadata IOs are blocked > 30
secs, oldest blocked for 2096 secs
[WRN] MON_DOWN: 2/5 mons down, quorum cn02,cn03,cn01
    mon.cn05 (rank 0) addr [v2:192.168.30.15:3300/0,v1:192.168.30.15:6789/0]
is down (out of quorum)
    mon.cn04 (rank 3) addr [v2:192.168.30.14:3300/0,v1:192.168.30.14:6789/0]
is down (out of quorum)
[WRN] OSD_DOWN: 10 osds down
    osd.0 (root=default,host=cn05) is down
    osd.1 (root=default,host=cn06) is down
    osd.7 (root=default,host=cn04) is down
    osd.13 (root=default,host=cn06) is down
    osd.15 (root=default,host=cn05) is down
    osd.18 (root=default,host=cn04) is down
    osd.20 (root=default,host=cn04) is down
    osd.33 (root=default,host=cn06) is down
    osd.34 (root=default,host=cn06) is down
    osd.36 (root=default,host=cn05) is down
[WRN] OSD_HOST_DOWN: 3 hosts (17 osds) down
    host cn04 (root=default) (6 osds) is down
    host cn05 (root=default) (5 osds) is down
    host cn06 (root=default) (6 osds) is down
[WRN] PG_AVAILABILITY: Reduced data availability: 13 pgs inactive, 9 pgs
down
    pg 9.3a is down, acting [8]
    pg 9.7a is down, acting [8]
    pg 9.ba is down, acting [8]
    pg 9.fa is down, acting [8]
    pg 11.3 is stuck inactive for 39h, current state
undersized+degraded+peered, last acting [11]
    pg 11.11 is down, acting [19,9]
    pg 11.1f is stuck inactive for 13h, current state
undersized+degraded+peered, last acting [10]
    pg 12.36 is down, acting [21,16]
    pg 12.59 is down, acting [26,5]
    pg 12.66 is down, acting [5]
    pg 19.4 is stuck inactive for 39h, current state
undersized+degraded+peered, last acting [6]
    pg 19.1c is down, acting [21,16,11]
    pg 21.1 is stuck inactive for 36m, current state unknown, last acting []
[WRN] PG_DEGRADED: Degraded data redundancy: 8515690/30862245 objects
degraded (27.593%), 326 pgs degraded, 447 pgs undersized
    pg 9.75 is stuck undersized for 34m, current state
active+undersized+remapped, last acting [4,8,35]
    pg 9.76 is stuck undersized for 35m, current state
active+undersized+degraded, last acting [35,10,21]
    pg 9.77 is stuck undersized for 34m, current state
active+undersized+remapped, last acting [32,35,4]
    pg 9.78 is stuck undersized for 35m, current state
active+undersized+degraded, last acting [14,10]
    pg 9.79 is stuck undersized for 35m, current state
active+undersized+degraded, last acting [21,32]
    pg 9.7b is stuck undersized for 34m, current state
active+undersized+degraded, last acting [8,12,5]
    pg 9.7c is stuck undersized for 34m, current state
active+undersized+degraded, last acting [4,35,10]
    pg 9.7d is stuck undersized for 35m, current state
active+undersized+degraded, last acting [5,19,10]
    pg 9.7e is stuck undersized for 35m, current state
active+undersized+remapped, last acting [21,10,17]
    pg 9.80 is stuck undersized for 34m, current state
active+undersized+degraded, last acting [8,4,17]
    pg 9.81 is stuck undersized for 35m, current state
active+undersized+degraded, last acting [14,26]
    pg 9.82 is stuck undersized for 35m, current state
active+undersized+degraded, last acting [26,16]
    pg 9.83 is stuck undersized for 34m, current state
active+undersized+degraded, last acting [8,4]
    pg 9.84 is stuck undersized for 34m, current state
active+undersized+degraded, last acting [4,35,6]
    pg 9.85 is stuck undersized for 35m, current state
active+undersized+degraded, last acting [32,12,9]
    pg 9.86 is stuck undersized for 34m, current state
active+undersized+degraded, last acting [35,5,8]
    pg 9.87 is stuck undersized for 35m, current state
active+undersized+degraded, last acting [9,12]
    pg 9.88 is stuck undersized for 35m, current state
active+undersized+remapped, last acting [19,32,35]
    pg 9.89 is stuck undersized for 34m, current state
active+undersized+degraded, last acting [10,14,4]
    pg 9.8a is stuck undersized for 35m, current state
active+undersized+degraded, last acting [21,19]
    pg 9.8b is stuck undersized for 34m, current state
active+undersized+degraded, last acting [8,35]
    pg 9.8c is stuck undersized for 31m, current state
active+undersized+remapped, last acting [10,19,5]
    pg 9.8d is stuck undersized for 35m, current state
active+undersized+remapped, last acting [9,6]
    pg 9.8f is stuck undersized for 35m, current state
active+undersized+remapped, last acting [19,26,17]
    pg 9.90 is stuck undersized for 35m, current state
active+undersized+remapped, last acting [35,26]
    pg 9.91 is stuck undersized for 35m, current state
active+undersized+degraded, last acting [17,5]
    pg 9.92 is stuck undersized for 35m, current state
active+undersized+degraded, last acting [21,26]
    pg 9.93 is stuck undersized for 35m, current state
active+undersized+remapped, last acting [19,26,5]
    pg 9.94 is stuck undersized for 35m, current state
active+undersized+degraded, last acting [21,11]
    pg 9.95 is stuck undersized for 34m, current state
active+undersized+degraded, last acting [8,19]
    pg 9.96 is stuck undersized for 35m, current state
active+undersized+degraded, last acting [17,6]
    pg 9.97 is stuck undersized for 34m, current state
active+undersized+degraded, last acting [8,9,16]
    pg 9.98 is stuck undersized for 35m, current state
active+undersized+degraded, last acting [6,21]
    pg 9.99 is stuck undersized for 35m, current state
active+undersized+degraded, last acting [10,9]
    pg 9.9a is stuck undersized for 34m, current state
active+undersized+remapped, last acting [4,16,10]
    pg 9.9b is stuck undersized for 34m, current state
active+undersized+degraded, last acting [12,4,11]
    pg 9.9c is stuck undersized for 35m, current state
active+undersized+degraded, last acting [9,16]
    pg 9.9d is stuck undersized for 35m, current state
active+undersized+degraded, last acting [26,35]
    pg 9.9f is stuck undersized for 35m, current state
active+undersized+degraded, last acting [9,17,26]
    pg 12.70 is stuck undersized for 35m, current state
active+undersized+degraded, last acting [21,35]
    pg 12.71 is active+undersized+degraded, acting [6,12]
    pg 12.72 is stuck undersized for 34m, current state
active+undersized+degraded, last acting [10,14,4]
    pg 12.73 is stuck undersized for 35m, current state
active+undersized+remapped, last acting [5,17,11]
    pg 12.78 is stuck undersized for 34m, current state
active+undersized+degraded, last acting [5,8,35]
    pg 12.79 is stuck undersized for 34m, current state
active+undersized+degraded, last acting [4,17]
    pg 12.7a is stuck undersized for 35m, current state
active+undersized+degraded, last acting [10,21]
    pg 12.7b is stuck undersized for 35m, current state
active+undersized+remapped, last acting [17,21,11]
    pg 12.7c is stuck undersized for 35m, current state
active+undersized+degraded, last acting [32,21,16]
    pg 12.7d is stuck undersized for 35m, current state
active+undersized+degraded, last acting [35,6,9]
    pg 12.7e is stuck undersized for 34m, current state
active+undersized+degraded, last acting [26,4]
    pg 12.7f is stuck undersized for 35m, current state
active+undersized+degraded, last acting [9,14]

On Mon, Jul 25, 2022 at 12:43 PM Jeremy Hansen <
farnsworth.mcfadden@xxxxxxxxx> wrote:

> Pretty desperate here.  Can someone suggest what I might be able to do to
> get these OSDs back up.  It looks like my recovery had stalled.
>
>
> On Mon, Jul 25, 2022 at 7:26 AM Anthony D'Atri <anthony.datri@xxxxxxxxx>
> wrote:
>
>> Do your values for public and cluster network include the new addresses
>> on all nodes?
>>
>
> This cluster only has one network.  There is no separation between
> public and cluster.  Three of the nodes momentarily came up using a
> different IP address.
>
> I've also noticed on one of the nodes that did not move or have any IP
> issue, on a single node, from the dashboard, it names the same device for
> two different osd's:
>
> 2 cn01 out destroyed hdd TOSHIBA_MG04SCA40EE_21M0A0CKFWZB Unknown sdb osd.2
>
> 3 cn01 out destroyed ssd SAMSUNG_MZILT3T8HBLS/007_S5G0NE0R200159 Unknown
> sdb osd.3
>
>
> [ceph: root@cn01 /]# ceph-volume inventory
>
> Device Path               Size         rotates available Model name
> /dev/sda                  3.64 TB      True    False     MG04SCA40EE
> /dev/sdb                  3.49 TB      False   False     MZILT3T8HBLS/007
> /dev/sdc                  3.64 TB      True    False     MG04SCA40EE
> /dev/sdd                  3.64 TB      True    False     MG04SCA40EE
> /dev/sde                  3.49 TB      False   False     MZILT3T8HBLS/007
> /dev/sdf                  3.64 TB      True    False     MG04SCA40EE
> /dev/sdg                  698.64 GB    True    False     SEAGATE ST375064
>
> [ceph: root@cn01 /]# ceph osd info
> osd.0 down out weight 0 up_from 30231 up_thru 30564 down_at 30688
> last_clean_interval [25500,30228) [v2:
> 192.168.30.15:6818/2512683421,v1:192.168.30.15:6819/2512683421] [v2:
> 192.168.30.15:6824/2512683421,v1:192.168.30.15:6826/2512683421]
> autoout,exists d14cf503-a303-4fa4-a713-9530b67d613a
> osd.1 down out weight 0 up_from 30393 up_thru 30688 down_at 30697
> last_clean_interval [25518,30321) [v2:
> 192.168.30.16:6834/1781855831,v1:192.168.30.16:6835/1781855831] [v2:
> 192.168.30.16:6836/1781855831,v1:192.168.30.16:6837/1781855831]
> autoout,exists 0d521411-c835-4fa3-beca-3631b4ff6bf7
> osd.2 down out weight 0 up_from 31316 up_thru 31293 down_at 31317
> last_clean_interval [31218,31296) [v2:
> 192.168.30.11:6810/894589880,v1:192.168.30.11:6811/894589880] [v2:
> 192.168.30.11:6812/894589880,v1:192.168.30.11:6813/894589880]
> destroyed,exists
> osd.3 down out weight 0 up_from 31265 up_thru 31266 down_at 31268
> last_clean_interval [31254,31256) [v2:
> 192.168.30.11:6818/1641948535,v1:192.168.30.11:6819/1641948535] [v2:
> 192.168.30.11:6820/1641948535,v1:192.168.30.11:6821/1641948535]
> destroyed,exists
> osd.4 up   in  weight 1 up_from 31356 up_thru 31581 down_at 31339
> last_clean_interval [31320,31338) [v2:
> 192.168.30.11:6802/2785067179,v1:192.168.30.11:6803/2785067179] [v2:
> 192.168.30.11:6804/2785067179,v1:192.168.30.11:6805/2785067179] exists,up
> 3afd06db-b91d-44fe-9305-5eb95f7a59b9
> osd.5 up   in  weight 1 up_from 31347 up_thru 31699 down_at 31339
> last_clean_interval [31311,31338) [v2:
> 192.168.30.11:6818/1936771540,v1:192.168.30.11:6819/1936771540] [v2:
> 192.168.30.11:6820/1936771540,v1:192.168.30.11:6821/1936771540] exists,up
> 063c2ccf-02ce-4f5e-8252-dddfbb258a95
> osd.6 up   in  weight 1 up_from 31218 up_thru 31711 down_at 31217
> last_clean_interval [30978,31195) [v2:
> 192.168.30.12:6816/1585973160,v1:192.168.30.12:6817/1585973160] [v2:
> 192.168.30.12:6818/1585973160,v1:192.168.30.12:6819/1585973160] exists,up
> 94250ea2-f12e-4dc6-9135-b626086ccffd
> osd.7 down out weight 0 up_from 30353 up_thru 30558 down_at 30688
> last_clean_interval [25533,30349) [v2:
> 192.168.30.14:6816/4083104061,v1:192.168.30.14:6817/4083104061] [v2:
> 192.168.30.14:6840/4094104061,v1:192.168.30.14:6841/4094104061]
> autoout,exists de351aec-b91e-4c22-a0bf-85369bc14579
> osd.8 up   in  weight 1 up_from 31226 up_thru 31668 down_at 31225
> last_clean_interval [30983,31195) [v2:
> 192.168.30.12:6824/1312484329,v1:192.168.30.12:6825/1312484329] [v2:
> 192.168.30.12:6826/1312484329,v1:192.168.30.12:6827/1312484329] exists,up
> 51f665b4-fa5b-4b17-8390-ed130145ef04
> osd.9 up   in  weight 1 up_from 31351 up_thru 31673 down_at 31340
> last_clean_interval [31315,31338) [v2:
> 192.168.30.11:6810/1446838877,v1:192.168.30.11:6811/1446838877] [v2:
> 192.168.30.11:6812/1446838877,v1:192.168.30.11:6813/1446838877] exists,up
> 985f1127-d126-4629-b8cd-03cf2d914d99
> osd.10 up   in  weight 1 up_from 31219 up_thru 31639 down_at 31218
> last_clean_interval [30980,31195) [v2:
> 192.168.30.12:6808/1587842953,v1:192.168.30.12:6809/1587842953] [v2:
> 192.168.30.12:6810/1587842953,v1:192.168.30.12:6811/1587842953] exists,up
> c7fca03e-4bd5-4485-a090-658ca967d5f6
> osd.11 up   in  weight 1 up_from 31234 up_thru 31659 down_at 31223
> last_clean_interval [30978,31195) [v2:
> 192.168.30.12:6840/3403200742,v1:192.168.30.12:6841/3403200742] [v2:
> 192.168.30.12:6842/3403200742,v1:192.168.30.12:6843/3403200742] exists,up
> 81074bd7-ad9f-4e56-8885-cca4745f6c95
> osd.12 up   in  weight 1 up_from 31230 up_thru 31717 down_at 31223
> last_clean_interval [30975,31195) [v2:
> 192.168.30.13:6816/4268732910,v1:192.168.30.13:6817/4268732910] [v2:
> 192.168.30.13:6818/4268732910,v1:192.168.30.13:6819/4268732910] exists,up
> af1b55dd-c110-4861-aed9-c0737cef8be1
> osd.13 down out weight 0 up_from 30389 up_thru 30688 down_at 30695
> last_clean_interval [25513,30317) [v2:
> 192.168.30.16:6804/1573803727,v1:192.168.30.16:6805/1573803727] [v2:
> 192.168.30.16:6806/1573803727,v1:192.168.30.16:6807/1573803727]
> autoout,exists 737a3234-0f1f-4286-80e9-e89b581aae41
> osd.14 up   in  weight 1 up_from 31218 up_thru 31709 down_at 31217
> last_clean_interval [30979,31195) [v2:
> 192.168.30.13:6834/2291187817,v1:192.168.30.13:6835/2291187817] [v2:
> 192.168.30.13:6836/2291187817,v1:192.168.30.13:6837/2291187817] exists,up
> 97cd6ac7-aca0-42fd-a049-d27289f83183
> osd.15 down out weight 0 up_from 29463 up_thru 30531 down_at 30688
> last_clean_interval [25493,29462) [v2:
> 192.168.30.15:6808/2655269991,v1:192.168.30.15:6809/2655269991] [v2:
> 192.168.30.15:6802/2662269991,v1:192.168.30.15:6803/2662269991]
> autoout,exists 61aea8f4-5905-4be3-ae32-5eacf75a514e
> osd.16 up   in  weight 1 up_from 31226 up_thru 31647 down_at 31223
> last_clean_interval [30970,31195) [v2:
> 192.168.30.13:6808/2624812081,v1:192.168.30.13:6809/2624812081] [v2:
> 192.168.30.13:6810/2624812081,v1:192.168.30.13:6811/2624812081] exists,up
> 791a7542-87cd-403d-a37e-8f00506b2eb6
> osd.17 up   in  weight 1 up_from 31219 up_thru 31703 down_at 31218
> last_clean_interval [30975,31195) [v2:
> 192.168.30.13:6800/2978036397,v1:192.168.30.13:6801/2978036397] [v2:
> 192.168.30.13:6802/2978036397,v1:192.168.30.13:6803/2978036397] exists,up
> 4a915645-412f-49e6-8477-1577469905da
> osd.18 down out weight 0 up_from 30334 up_thru 30566 down_at 30688
> last_clean_interval [25543,30327) [v2:
> 192.168.30.14:6832/985432137,v1:192.168.30.14:6833/985432137] [v2:
> 192.168.30.14:6848/998432137,v1:192.168.30.14:6849/998432137]
> autoout,exists 85f59d83-710c-4896-9200-bda4894fc3e8
> osd.19 down in  weight 1 up_from 31303 up_thru 31321 down_at 31323
> last_clean_interval [31292,31296) [v2:
> 192.168.30.13:6826/375623427,v1:192.168.30.13:6827/375623427] [v2:
> 192.168.30.13:6828/375623427,v1:192.168.30.13:6829/375623427] exists
> 7d09b51a-bd6d-40f8-a009-78ab9937795d
> osd.20 down out weight 0 up_from 30445 up_thru 30531 down_at 30688
> last_clean_interval [28947,30444) [v2:
> 192.168.30.14:6810/4062649162,v1:192.168.30.14:6811/4062649162] [v2:
> 192.168.30.14:6800/4073649162,v1:192.168.30.14:6801/4073649162]
> autoout,exists 7ef6cc1a-4755-4a14-b9df-f1f538d903ce
> osd.21 up   in  weight 1 up_from 31345 up_thru 31567 down_at 31341
> last_clean_interval [31307,31340) [v2:
> 192.168.30.11:6826/1625231298,v1:192.168.30.11:6827/1625231298] [v2:
> 192.168.30.11:6828/1625231298,v1:192.168.30.11:6829/1625231298] exists,up
> 5ef2e39a-a353-4cb8-a49e-093fe39b94ef
> osd.22 down in  weight 1 up_from 30383 up_thru 30528 down_at 30688
> last_clean_interval [25549,30317) [v2:
> 192.168.30.14:6806/1204256629,v1:192.168.30.14:6807/1204256629] [v2:
> 192.168.30.14:6812/1204256629,v1:192.168.30.14:6813/1204256629] exists
> c9befe11-a035-449c-8d17-42aaf191923d
> osd.23 down in  weight 1 up_from 30334 up_thru 30576 down_at 30688
> last_clean_interval [30263,30332) [v2:
> 192.168.30.14:6802/3837786490,v1:192.168.30.14:6803/3837786490] [v2:
> 192.168.30.14:6830/3838786490,v1:192.168.30.14:6831/3838786490] exists
> 2081147b-065d-4da7-89d9-747e1ae02b8d
> osd.24 down in  weight 1 up_from 29455 up_thru 30570 down_at 30688
> last_clean_interval [25487,29453) [v2:
> 192.168.30.15:6800/2008474583,v1:192.168.30.15:6801/2008474583] [v2:
> 192.168.30.15:6810/2016474583,v1:192.168.30.15:6811/2016474583] exists
> 39d78380-261c-4689-b53d-90713e6ffcca
> osd.26 up   in  weight 1 up_from 31208 up_thru 31643 down_at 31207
> last_clean_interval [30967,31195) [v2:
> 192.168.30.12:6800/2861018947,v1:192.168.30.12:6801/2861018947] [v2:
> 192.168.30.12:6802/2861018947,v1:192.168.30.12:6803/2861018947] exists,up
> 046622c8-c09c-4254-8c15-3dc05a2f01ed
> osd.28 down in  weight 1 up_from 30389 up_thru 30574 down_at 30691
> last_clean_interval [25513,30312) [v2:
> 192.168.30.16:6820/3466284570,v1:192.168.30.16:6821/3466284570] [v2:
> 192.168.30.16:6822/3466284570,v1:192.168.30.16:6823/3466284570] exists
> 10578b97-e3c4-4553-a8d0-6dcc46af5db1
> osd.29 down in  weight 1 up_from 30378 up_thru 30554 down_at 30688
> last_clean_interval [28595,30376) [v2:
> 192.168.30.14:6808/3739543672,v1:192.168.30.14:6809/3739543672] [v2:
> 192.168.30.14:6846/3747543672,v1:192.168.30.14:6847/3747543672] exists
> 9698e936-8edf-4adf-92c9-a0b5202ed01a
> osd.30 down in  weight 1 up_from 30449 up_thru 30531 down_at 30688
> last_clean_interval [25502,30446) [v2:
> 192.168.30.15:6825/2375507296,v1:192.168.30.15:6827/2375507296] [v2:
> 192.168.30.15:6829/2375507296,v1:192.168.30.15:6831/2375507296] exists
> e14d2a0f-a98a-44d4-8c06-4d893f673629
> osd.31 down in  weight 1 up_from 30364 up_thru 30688 down_at 30700
> last_clean_interval [25514,30361) [v2:
> 192.168.30.16:6826/2835000708,v1:192.168.30.16:6827/2835000708] [v2:
> 192.168.30.16:6802/2843000708,v1:192.168.30.16:6803/2843000708] exists
> 541bca38-e704-483a-8cb8-39e5f69007d1
> osd.32 up   in  weight 1 up_from 31209 up_thru 31627 down_at 31208
> last_clean_interval [30974,31195) [v2:
> 192.168.30.12:6832/3860067997,v1:192.168.30.12:6833/3860067997] [v2:
> 192.168.30.12:6834/3860067997,v1:192.168.30.12:6835/3860067997] exists,up
> 9200a57e-2845-43ff-9787-8f1f3158fe90
> osd.33 down in  weight 1 up_from 30354 up_thru 30688 down_at 30693
> last_clean_interval [25521,30350) [v2:
> 192.168.30.16:6842/2342555666,v1:192.168.30.16:6843/2342555666] [v2:
> 192.168.30.16:6844/2364555666,v1:192.168.30.16:6845/2364555666] exists
> 20c55d85-cf9a-4133-a189-7fdad2318f58
> osd.34 down in  weight 1 up_from 30390 up_thru 30688 down_at 30691
> last_clean_interval [25516,30314) [v2:
> 192.168.30.16:6808/2282629870,v1:192.168.30.16:6811/2282629870] [v2:
> 192.168.30.16:6812/2282629870,v1:192.168.30.16:6814/2282629870] exists
> 77e0ef8f-c047-4f84-afb2-a8ad054e562f
> osd.35 up   in  weight 1 up_from 31204 up_thru 31657 down_at 31203
> last_clean_interval [30958,31195) [v2:
> 192.168.30.13:6842/1919357520,v1:192.168.30.13:6843/1919357520] [v2:
> 192.168.30.13:6844/1919357520,v1:192.168.30.13:6845/1919357520] exists,up
> 2d2de0cb-6d41-4957-a473-2bbe9ce227bf
> osd.36 down in  weight 1 up_from 29494 up_thru 30560 down_at 30688
> last_clean_interval [25491,29492) [v2:
> 192.168.30.15:6816/2153321591,v1:192.168.30.15:6817/2153321591] [v2:
> 192.168.30.15:6842/2158321591,v1:192.168.30.15:6843/2158321591] exists
> 26114668-68b2-458b-89c2-cbad5507ab75
>
>
>
>>
>> > On Jul 25, 2022, at 3:29 AM, Jeremy Hansen <
>> farnsworth.mcfadden@xxxxxxxxx> wrote:
>> >
>> > I transitioned some servers to a new rack and now I'm having major
>> issues
>> > with Ceph upon bringing things back up.
>> >
>> > I believe the issue may be related to the ceph nodes coming back up with
>> > different IPs before VLANs were set.  That's just a guess because I
>> can't
>> > think of any other reason this would happen.
>> >
>> > Current state:
>> >
>> > Every 2.0s: ceph -s
>> >   cn01.ceph.la1.clx.corp: Mon Jul 25 10:13:05 2022
>> >
>> >  cluster:
>> >    id:     bfa2ad58-c049-11eb-9098-3c8cf8ed728d
>> >    health: HEALTH_WARN
>> >            1 filesystem is degraded
>> >            2 MDSs report slow metadata IOs
>> >            2/5 mons down, quorum cn02,cn03,cn01
>> >            9 osds down
>> >            3 hosts (17 osds) down
>> >            Reduced data availability: 97 pgs inactive, 9 pgs down
>> >            Degraded data redundancy: 13860144/30824413 objects degraded
>> > (44.965%), 411 pgs degraded, 482 pgs undersized
>> >
>> >  services:
>> >    mon: 5 daemons, quorum cn02,cn03,cn01 (age 62m), out of quorum: cn05,
>> > cn04
>> >    mgr: cn02.arszct(active, since 5m)
>> >    mds: 2/2 daemons up, 2 standby
>> >    osd: 35 osds: 15 up (since 62m), 24 in (since 58m); 222 remapped pgs
>> >
>> >  data:
>> >    volumes: 1/2 healthy, 1 recovering
>> >    pools:   8 pools, 545 pgs
>> >    objects: 7.71M objects, 6.7 TiB
>> >    usage:   15 TiB used, 39 TiB / 54 TiB avail
>> >    pgs:     0.367% pgs unknown
>> >             17.431% pgs not active
>> >             13860144/30824413 objects degraded (44.965%)
>> >             1137693/30824413 objects misplaced (3.691%)
>> >             280 active+undersized+degraded
>> >             67  undersized+degraded+remapped+backfilling+peered
>> >             57  active+undersized+remapped
>> >             45  active+clean+remapped
>> >             44  active+undersized+degraded+remapped+backfilling
>> >             18  undersized+degraded+peered
>> >             10  active+undersized
>> >             9   down
>> >             7   active+clean
>> >             3   active+undersized+remapped+backfilling
>> >             2   active+undersized+degraded+remapped+backfill_wait
>> >             2   unknown
>> >             1   undersized+peered
>> >
>> >  io:
>> >    client:   170 B/s rd, 0 op/s rd, 0 op/s wr
>> >    recovery: 168 MiB/s, 158 keys/s, 166 objects/s
>> >
>> > I have to disable and re-enable the dashboard just to use it.  It seems
>> to
>> > get bogged down after a few moments.
>> >
>> > The three servers that were moved to the new rack Ceph has marked as
>> > "Down", but if I do a cephadm host-check, they all seem to pass:
>> >
>> > ************************ ceph  ************************
>> > --------- cn01.ceph.---------
>> > podman (/usr/bin/podman) version 4.0.2 is present
>> > systemctl is present
>> > lvcreate is present
>> > Unit chronyd.service is enabled and running
>> > Host looks OK
>> > --------- cn02.ceph.---------
>> > podman (/usr/bin/podman) version 4.0.2 is present
>> > systemctl is present
>> > lvcreate is present
>> > Unit chronyd.service is enabled and running
>> > Host looks OK
>> > --------- cn03.ceph.---------
>> > podman (/usr/bin/podman) version 4.0.2 is present
>> > systemctl is present
>> > lvcreate is present
>> > Unit chronyd.service is enabled and running
>> > Host looks OK
>> > --------- cn04.ceph.---------
>> > podman (/usr/bin/podman) version 4.0.2 is present
>> > systemctl is present
>> > lvcreate is present
>> > Unit chronyd.service is enabled and running
>> > Host looks OK
>> > --------- cn05.ceph.---------
>> > podman|docker (/usr/bin/podman) is present
>> > systemctl is present
>> > lvcreate is present
>> > Unit chronyd.service is enabled and running
>> > Host looks OK
>> > --------- cn06.ceph.---------
>> > podman (/usr/bin/podman) version 4.0.2 is present
>> > systemctl is present
>> > lvcreate is present
>> > Unit chronyd.service is enabled and running
>> > Host looks OK
>> >
>> > It seems to be recovering with what it has left, but a large amount of
>> OSDs
>> > are down.  When trying to restart one of the down'd OSDs, I see a huge
>> dump.
>> >
>> > Jul 25 03:19:38 cn06.ceph
>> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
>> > 2022-07-25T10:19:38.532+0000 7fce14a6c080  0 osd.34 30689 done with
>> init,
>> > starting boot process
>> > Jul 25 03:19:38 cn06.ceph
>> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
>> > 2022-07-25T10:19:38.532+0000 7fce14a6c080  1 osd.34 30689 start_boot
>> > Jul 25 03:20:10 cn06.ceph
>> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
>> > 2022-07-25T10:20:10.655+0000 7fcdfd12d700  1 osd.34 30689 start_boot
>> > Jul 25 03:20:41 cn06.ceph
>> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
>> > 2022-07-25T10:20:41.159+0000 7fcdfd12d700  1 osd.34 30689 start_boot
>> > Jul 25 03:21:11 cn06.ceph
>> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
>> > 2022-07-25T10:21:11.662+0000 7fcdfd12d700  1 osd.34 30689 start_boot
>> >
>> > At this point it just keeps printing start_boot, but the dashboard has
>> it
>> > marked as "in" but "down".
>> >
>> > On these three hosts that moved, there were a bunch marked as "out" and
>> > "down", and some with "in" but "down".
>> >
>> > Not sure where to go next.  I'm going to let the recovery continue and
>> hope
>> > that my 4x replication on these pools saves me.
>> >
>> > Not sure where to go from here.  Any help is very much appreciated.
>> This
>> > Ceph cluster holds all of our Cloudstack images...  it would be
>> terrible to
>> > lose this data.
>> > _______________________________________________
>> > ceph-users mailing list -- ceph-users@xxxxxxx
>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>
> On Mon, Jul 25, 2022 at 10:15 AM Jeremy Hansen <
> farnsworth.mcfadden@xxxxxxxxx> wrote:
>
>>
>>
>> On Mon, Jul 25, 2022 at 7:26 AM Anthony D'Atri <anthony.datri@xxxxxxxxx>
>> wrote:
>>
>>> Do your values for public and cluster network include the new addresses
>>> on all nodes?
>>>
>>
>> This cluster only has one network.  There is no separation between
>> public and cluster.  Three of the nodes momentarily came up using a
>> different IP address.
>>
>> I've also noticed on one of the nodes that did not move or have any IP
>> issue, on a single node, from the dashboard, it names the same device for
>> two different osd's:
>>
>> 2 cn01 out destroyed hdd TOSHIBA_MG04SCA40EE_21M0A0CKFWZB Unknown sdb
>> osd.2
>>
>> 3 cn01 out destroyed ssd SAMSUNG_MZILT3T8HBLS/007_S5G0NE0R200159 Unknown
>> sdb osd.3
>>
>>
>> [ceph: root@cn01 /]# ceph-volume inventory
>>
>> Device Path               Size         rotates available Model name
>> /dev/sda                  3.64 TB      True    False     MG04SCA40EE
>> /dev/sdb                  3.49 TB      False   False     MZILT3T8HBLS/007
>> /dev/sdc                  3.64 TB      True    False     MG04SCA40EE
>> /dev/sdd                  3.64 TB      True    False     MG04SCA40EE
>> /dev/sde                  3.49 TB      False   False     MZILT3T8HBLS/007
>> /dev/sdf                  3.64 TB      True    False     MG04SCA40EE
>> /dev/sdg                  698.64 GB    True    False     SEAGATE ST375064
>>
>> [ceph: root@cn01 /]# ceph osd info
>> osd.0 down out weight 0 up_from 30231 up_thru 30564 down_at 30688
>> last_clean_interval [25500,30228) [v2:
>> 192.168.30.15:6818/2512683421,v1:192.168.30.15:6819/2512683421] [v2:
>> 192.168.30.15:6824/2512683421,v1:192.168.30.15:6826/2512683421]
>> autoout,exists d14cf503-a303-4fa4-a713-9530b67d613a
>> osd.1 down out weight 0 up_from 30393 up_thru 30688 down_at 30697
>> last_clean_interval [25518,30321) [v2:
>> 192.168.30.16:6834/1781855831,v1:192.168.30.16:6835/1781855831] [v2:
>> 192.168.30.16:6836/1781855831,v1:192.168.30.16:6837/1781855831]
>> autoout,exists 0d521411-c835-4fa3-beca-3631b4ff6bf7
>> osd.2 down out weight 0 up_from 31316 up_thru 31293 down_at 31317
>> last_clean_interval [31218,31296) [v2:
>> 192.168.30.11:6810/894589880,v1:192.168.30.11:6811/894589880] [v2:
>> 192.168.30.11:6812/894589880,v1:192.168.30.11:6813/894589880]
>> destroyed,exists
>> osd.3 down out weight 0 up_from 31265 up_thru 31266 down_at 31268
>> last_clean_interval [31254,31256) [v2:
>> 192.168.30.11:6818/1641948535,v1:192.168.30.11:6819/1641948535] [v2:
>> 192.168.30.11:6820/1641948535,v1:192.168.30.11:6821/1641948535]
>> destroyed,exists
>> osd.4 up   in  weight 1 up_from 31356 up_thru 31581 down_at 31339
>> last_clean_interval [31320,31338) [v2:
>> 192.168.30.11:6802/2785067179,v1:192.168.30.11:6803/2785067179] [v2:
>> 192.168.30.11:6804/2785067179,v1:192.168.30.11:6805/2785067179]
>> exists,up 3afd06db-b91d-44fe-9305-5eb95f7a59b9
>> osd.5 up   in  weight 1 up_from 31347 up_thru 31699 down_at 31339
>> last_clean_interval [31311,31338) [v2:
>> 192.168.30.11:6818/1936771540,v1:192.168.30.11:6819/1936771540] [v2:
>> 192.168.30.11:6820/1936771540,v1:192.168.30.11:6821/1936771540]
>> exists,up 063c2ccf-02ce-4f5e-8252-dddfbb258a95
>> osd.6 up   in  weight 1 up_from 31218 up_thru 31711 down_at 31217
>> last_clean_interval [30978,31195) [v2:
>> 192.168.30.12:6816/1585973160,v1:192.168.30.12:6817/1585973160] [v2:
>> 192.168.30.12:6818/1585973160,v1:192.168.30.12:6819/1585973160]
>> exists,up 94250ea2-f12e-4dc6-9135-b626086ccffd
>> osd.7 down out weight 0 up_from 30353 up_thru 30558 down_at 30688
>> last_clean_interval [25533,30349) [v2:
>> 192.168.30.14:6816/4083104061,v1:192.168.30.14:6817/4083104061] [v2:
>> 192.168.30.14:6840/4094104061,v1:192.168.30.14:6841/4094104061]
>> autoout,exists de351aec-b91e-4c22-a0bf-85369bc14579
>> osd.8 up   in  weight 1 up_from 31226 up_thru 31668 down_at 31225
>> last_clean_interval [30983,31195) [v2:
>> 192.168.30.12:6824/1312484329,v1:192.168.30.12:6825/1312484329] [v2:
>> 192.168.30.12:6826/1312484329,v1:192.168.30.12:6827/1312484329]
>> exists,up 51f665b4-fa5b-4b17-8390-ed130145ef04
>> osd.9 up   in  weight 1 up_from 31351 up_thru 31673 down_at 31340
>> last_clean_interval [31315,31338) [v2:
>> 192.168.30.11:6810/1446838877,v1:192.168.30.11:6811/1446838877] [v2:
>> 192.168.30.11:6812/1446838877,v1:192.168.30.11:6813/1446838877]
>> exists,up 985f1127-d126-4629-b8cd-03cf2d914d99
>> osd.10 up   in  weight 1 up_from 31219 up_thru 31639 down_at 31218
>> last_clean_interval [30980,31195) [v2:
>> 192.168.30.12:6808/1587842953,v1:192.168.30.12:6809/1587842953] [v2:
>> 192.168.30.12:6810/1587842953,v1:192.168.30.12:6811/1587842953]
>> exists,up c7fca03e-4bd5-4485-a090-658ca967d5f6
>> osd.11 up   in  weight 1 up_from 31234 up_thru 31659 down_at 31223
>> last_clean_interval [30978,31195) [v2:
>> 192.168.30.12:6840/3403200742,v1:192.168.30.12:6841/3403200742] [v2:
>> 192.168.30.12:6842/3403200742,v1:192.168.30.12:6843/3403200742]
>> exists,up 81074bd7-ad9f-4e56-8885-cca4745f6c95
>> osd.12 up   in  weight 1 up_from 31230 up_thru 31717 down_at 31223
>> last_clean_interval [30975,31195) [v2:
>> 192.168.30.13:6816/4268732910,v1:192.168.30.13:6817/4268732910] [v2:
>> 192.168.30.13:6818/4268732910,v1:192.168.30.13:6819/4268732910]
>> exists,up af1b55dd-c110-4861-aed9-c0737cef8be1
>> osd.13 down out weight 0 up_from 30389 up_thru 30688 down_at 30695
>> last_clean_interval [25513,30317) [v2:
>> 192.168.30.16:6804/1573803727,v1:192.168.30.16:6805/1573803727] [v2:
>> 192.168.30.16:6806/1573803727,v1:192.168.30.16:6807/1573803727]
>> autoout,exists 737a3234-0f1f-4286-80e9-e89b581aae41
>> osd.14 up   in  weight 1 up_from 31218 up_thru 31709 down_at 31217
>> last_clean_interval [30979,31195) [v2:
>> 192.168.30.13:6834/2291187817,v1:192.168.30.13:6835/2291187817] [v2:
>> 192.168.30.13:6836/2291187817,v1:192.168.30.13:6837/2291187817]
>> exists,up 97cd6ac7-aca0-42fd-a049-d27289f83183
>> osd.15 down out weight 0 up_from 29463 up_thru 30531 down_at 30688
>> last_clean_interval [25493,29462) [v2:
>> 192.168.30.15:6808/2655269991,v1:192.168.30.15:6809/2655269991] [v2:
>> 192.168.30.15:6802/2662269991,v1:192.168.30.15:6803/2662269991]
>> autoout,exists 61aea8f4-5905-4be3-ae32-5eacf75a514e
>> osd.16 up   in  weight 1 up_from 31226 up_thru 31647 down_at 31223
>> last_clean_interval [30970,31195) [v2:
>> 192.168.30.13:6808/2624812081,v1:192.168.30.13:6809/2624812081] [v2:
>> 192.168.30.13:6810/2624812081,v1:192.168.30.13:6811/2624812081]
>> exists,up 791a7542-87cd-403d-a37e-8f00506b2eb6
>> osd.17 up   in  weight 1 up_from 31219 up_thru 31703 down_at 31218
>> last_clean_interval [30975,31195) [v2:
>> 192.168.30.13:6800/2978036397,v1:192.168.30.13:6801/2978036397] [v2:
>> 192.168.30.13:6802/2978036397,v1:192.168.30.13:6803/2978036397]
>> exists,up 4a915645-412f-49e6-8477-1577469905da
>> osd.18 down out weight 0 up_from 30334 up_thru 30566 down_at 30688
>> last_clean_interval [25543,30327) [v2:
>> 192.168.30.14:6832/985432137,v1:192.168.30.14:6833/985432137] [v2:
>> 192.168.30.14:6848/998432137,v1:192.168.30.14:6849/998432137]
>> autoout,exists 85f59d83-710c-4896-9200-bda4894fc3e8
>> osd.19 down in  weight 1 up_from 31303 up_thru 31321 down_at 31323
>> last_clean_interval [31292,31296) [v2:
>> 192.168.30.13:6826/375623427,v1:192.168.30.13:6827/375623427] [v2:
>> 192.168.30.13:6828/375623427,v1:192.168.30.13:6829/375623427] exists
>> 7d09b51a-bd6d-40f8-a009-78ab9937795d
>> osd.20 down out weight 0 up_from 30445 up_thru 30531 down_at 30688
>> last_clean_interval [28947,30444) [v2:
>> 192.168.30.14:6810/4062649162,v1:192.168.30.14:6811/4062649162] [v2:
>> 192.168.30.14:6800/4073649162,v1:192.168.30.14:6801/4073649162]
>> autoout,exists 7ef6cc1a-4755-4a14-b9df-f1f538d903ce
>> osd.21 up   in  weight 1 up_from 31345 up_thru 31567 down_at 31341
>> last_clean_interval [31307,31340) [v2:
>> 192.168.30.11:6826/1625231298,v1:192.168.30.11:6827/1625231298] [v2:
>> 192.168.30.11:6828/1625231298,v1:192.168.30.11:6829/1625231298]
>> exists,up 5ef2e39a-a353-4cb8-a49e-093fe39b94ef
>> osd.22 down in  weight 1 up_from 30383 up_thru 30528 down_at 30688
>> last_clean_interval [25549,30317) [v2:
>> 192.168.30.14:6806/1204256629,v1:192.168.30.14:6807/1204256629] [v2:
>> 192.168.30.14:6812/1204256629,v1:192.168.30.14:6813/1204256629] exists
>> c9befe11-a035-449c-8d17-42aaf191923d
>> osd.23 down in  weight 1 up_from 30334 up_thru 30576 down_at 30688
>> last_clean_interval [30263,30332) [v2:
>> 192.168.30.14:6802/3837786490,v1:192.168.30.14:6803/3837786490] [v2:
>> 192.168.30.14:6830/3838786490,v1:192.168.30.14:6831/3838786490] exists
>> 2081147b-065d-4da7-89d9-747e1ae02b8d
>> osd.24 down in  weight 1 up_from 29455 up_thru 30570 down_at 30688
>> last_clean_interval [25487,29453) [v2:
>> 192.168.30.15:6800/2008474583,v1:192.168.30.15:6801/2008474583] [v2:
>> 192.168.30.15:6810/2016474583,v1:192.168.30.15:6811/2016474583] exists
>> 39d78380-261c-4689-b53d-90713e6ffcca
>> osd.26 up   in  weight 1 up_from 31208 up_thru 31643 down_at 31207
>> last_clean_interval [30967,31195) [v2:
>> 192.168.30.12:6800/2861018947,v1:192.168.30.12:6801/2861018947] [v2:
>> 192.168.30.12:6802/2861018947,v1:192.168.30.12:6803/2861018947]
>> exists,up 046622c8-c09c-4254-8c15-3dc05a2f01ed
>> osd.28 down in  weight 1 up_from 30389 up_thru 30574 down_at 30691
>> last_clean_interval [25513,30312) [v2:
>> 192.168.30.16:6820/3466284570,v1:192.168.30.16:6821/3466284570] [v2:
>> 192.168.30.16:6822/3466284570,v1:192.168.30.16:6823/3466284570] exists
>> 10578b97-e3c4-4553-a8d0-6dcc46af5db1
>> osd.29 down in  weight 1 up_from 30378 up_thru 30554 down_at 30688
>> last_clean_interval [28595,30376) [v2:
>> 192.168.30.14:6808/3739543672,v1:192.168.30.14:6809/3739543672] [v2:
>> 192.168.30.14:6846/3747543672,v1:192.168.30.14:6847/3747543672] exists
>> 9698e936-8edf-4adf-92c9-a0b5202ed01a
>> osd.30 down in  weight 1 up_from 30449 up_thru 30531 down_at 30688
>> last_clean_interval [25502,30446) [v2:
>> 192.168.30.15:6825/2375507296,v1:192.168.30.15:6827/2375507296] [v2:
>> 192.168.30.15:6829/2375507296,v1:192.168.30.15:6831/2375507296] exists
>> e14d2a0f-a98a-44d4-8c06-4d893f673629
>> osd.31 down in  weight 1 up_from 30364 up_thru 30688 down_at 30700
>> last_clean_interval [25514,30361) [v2:
>> 192.168.30.16:6826/2835000708,v1:192.168.30.16:6827/2835000708] [v2:
>> 192.168.30.16:6802/2843000708,v1:192.168.30.16:6803/2843000708] exists
>> 541bca38-e704-483a-8cb8-39e5f69007d1
>> osd.32 up   in  weight 1 up_from 31209 up_thru 31627 down_at 31208
>> last_clean_interval [30974,31195) [v2:
>> 192.168.30.12:6832/3860067997,v1:192.168.30.12:6833/3860067997] [v2:
>> 192.168.30.12:6834/3860067997,v1:192.168.30.12:6835/3860067997]
>> exists,up 9200a57e-2845-43ff-9787-8f1f3158fe90
>> osd.33 down in  weight 1 up_from 30354 up_thru 30688 down_at 30693
>> last_clean_interval [25521,30350) [v2:
>> 192.168.30.16:6842/2342555666,v1:192.168.30.16:6843/2342555666] [v2:
>> 192.168.30.16:6844/2364555666,v1:192.168.30.16:6845/2364555666] exists
>> 20c55d85-cf9a-4133-a189-7fdad2318f58
>> osd.34 down in  weight 1 up_from 30390 up_thru 30688 down_at 30691
>> last_clean_interval [25516,30314) [v2:
>> 192.168.30.16:6808/2282629870,v1:192.168.30.16:6811/2282629870] [v2:
>> 192.168.30.16:6812/2282629870,v1:192.168.30.16:6814/2282629870] exists
>> 77e0ef8f-c047-4f84-afb2-a8ad054e562f
>> osd.35 up   in  weight 1 up_from 31204 up_thru 31657 down_at 31203
>> last_clean_interval [30958,31195) [v2:
>> 192.168.30.13:6842/1919357520,v1:192.168.30.13:6843/1919357520] [v2:
>> 192.168.30.13:6844/1919357520,v1:192.168.30.13:6845/1919357520]
>> exists,up 2d2de0cb-6d41-4957-a473-2bbe9ce227bf
>> osd.36 down in  weight 1 up_from 29494 up_thru 30560 down_at 30688
>> last_clean_interval [25491,29492) [v2:
>> 192.168.30.15:6816/2153321591,v1:192.168.30.15:6817/2153321591] [v2:
>> 192.168.30.15:6842/2158321591,v1:192.168.30.15:6843/2158321591] exists
>> 26114668-68b2-458b-89c2-cbad5507ab75
>>
>>
>>
>>>
>>> > On Jul 25, 2022, at 3:29 AM, Jeremy Hansen <
>>> farnsworth.mcfadden@xxxxxxxxx> wrote:
>>> >
>>> > I transitioned some servers to a new rack and now I'm having major
>>> issues
>>> > with Ceph upon bringing things back up.
>>> >
>>> > I believe the issue may be related to the ceph nodes coming back up
>>> with
>>> > different IPs before VLANs were set.  That's just a guess because I
>>> can't
>>> > think of any other reason this would happen.
>>> >
>>> > Current state:
>>> >
>>> > Every 2.0s: ceph -s
>>> >   cn01.ceph.la1.clx.corp: Mon Jul 25 10:13:05 2022
>>> >
>>> >  cluster:
>>> >    id:     bfa2ad58-c049-11eb-9098-3c8cf8ed728d
>>> >    health: HEALTH_WARN
>>> >            1 filesystem is degraded
>>> >            2 MDSs report slow metadata IOs
>>> >            2/5 mons down, quorum cn02,cn03,cn01
>>> >            9 osds down
>>> >            3 hosts (17 osds) down
>>> >            Reduced data availability: 97 pgs inactive, 9 pgs down
>>> >            Degraded data redundancy: 13860144/30824413 objects degraded
>>> > (44.965%), 411 pgs degraded, 482 pgs undersized
>>> >
>>> >  services:
>>> >    mon: 5 daemons, quorum cn02,cn03,cn01 (age 62m), out of quorum:
>>> cn05,
>>> > cn04
>>> >    mgr: cn02.arszct(active, since 5m)
>>> >    mds: 2/2 daemons up, 2 standby
>>> >    osd: 35 osds: 15 up (since 62m), 24 in (since 58m); 222 remapped pgs
>>> >
>>> >  data:
>>> >    volumes: 1/2 healthy, 1 recovering
>>> >    pools:   8 pools, 545 pgs
>>> >    objects: 7.71M objects, 6.7 TiB
>>> >    usage:   15 TiB used, 39 TiB / 54 TiB avail
>>> >    pgs:     0.367% pgs unknown
>>> >             17.431% pgs not active
>>> >             13860144/30824413 objects degraded (44.965%)
>>> >             1137693/30824413 objects misplaced (3.691%)
>>> >             280 active+undersized+degraded
>>> >             67  undersized+degraded+remapped+backfilling+peered
>>> >             57  active+undersized+remapped
>>> >             45  active+clean+remapped
>>> >             44  active+undersized+degraded+remapped+backfilling
>>> >             18  undersized+degraded+peered
>>> >             10  active+undersized
>>> >             9   down
>>> >             7   active+clean
>>> >             3   active+undersized+remapped+backfilling
>>> >             2   active+undersized+degraded+remapped+backfill_wait
>>> >             2   unknown
>>> >             1   undersized+peered
>>> >
>>> >  io:
>>> >    client:   170 B/s rd, 0 op/s rd, 0 op/s wr
>>> >    recovery: 168 MiB/s, 158 keys/s, 166 objects/s
>>> >
>>> > I have to disable and re-enable the dashboard just to use it.  It
>>> seems to
>>> > get bogged down after a few moments.
>>> >
>>> > The three servers that were moved to the new rack Ceph has marked as
>>> > "Down", but if I do a cephadm host-check, they all seem to pass:
>>> >
>>> > ************************ ceph  ************************
>>> > --------- cn01.ceph.---------
>>> > podman (/usr/bin/podman) version 4.0.2 is present
>>> > systemctl is present
>>> > lvcreate is present
>>> > Unit chronyd.service is enabled and running
>>> > Host looks OK
>>> > --------- cn02.ceph.---------
>>> > podman (/usr/bin/podman) version 4.0.2 is present
>>> > systemctl is present
>>> > lvcreate is present
>>> > Unit chronyd.service is enabled and running
>>> > Host looks OK
>>> > --------- cn03.ceph.---------
>>> > podman (/usr/bin/podman) version 4.0.2 is present
>>> > systemctl is present
>>> > lvcreate is present
>>> > Unit chronyd.service is enabled and running
>>> > Host looks OK
>>> > --------- cn04.ceph.---------
>>> > podman (/usr/bin/podman) version 4.0.2 is present
>>> > systemctl is present
>>> > lvcreate is present
>>> > Unit chronyd.service is enabled and running
>>> > Host looks OK
>>> > --------- cn05.ceph.---------
>>> > podman|docker (/usr/bin/podman) is present
>>> > systemctl is present
>>> > lvcreate is present
>>> > Unit chronyd.service is enabled and running
>>> > Host looks OK
>>> > --------- cn06.ceph.---------
>>> > podman (/usr/bin/podman) version 4.0.2 is present
>>> > systemctl is present
>>> > lvcreate is present
>>> > Unit chronyd.service is enabled and running
>>> > Host looks OK
>>> >
>>> > It seems to be recovering with what it has left, but a large amount of
>>> OSDs
>>> > are down.  When trying to restart one of the down'd OSDs, I see a huge
>>> dump.
>>> >
>>> > Jul 25 03:19:38 cn06.ceph
>>> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
>>> > 2022-07-25T10:19:38.532+0000 7fce14a6c080  0 osd.34 30689 done with
>>> init,
>>> > starting boot process
>>> > Jul 25 03:19:38 cn06.ceph
>>> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
>>> > 2022-07-25T10:19:38.532+0000 7fce14a6c080  1 osd.34 30689 start_boot
>>> > Jul 25 03:20:10 cn06.ceph
>>> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
>>> > 2022-07-25T10:20:10.655+0000 7fcdfd12d700  1 osd.34 30689 start_boot
>>> > Jul 25 03:20:41 cn06.ceph
>>> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
>>> > 2022-07-25T10:20:41.159+0000 7fcdfd12d700  1 osd.34 30689 start_boot
>>> > Jul 25 03:21:11 cn06.ceph
>>> > ceph-bfa2ad58-c049-11eb-9098-3c8cf8ed728d-osd-34[9516]: debug
>>> > 2022-07-25T10:21:11.662+0000 7fcdfd12d700  1 osd.34 30689 start_boot
>>> >
>>> > At this point it just keeps printing start_boot, but the dashboard has
>>> it
>>> > marked as "in" but "down".
>>> >
>>> > On these three hosts that moved, there were a bunch marked as "out" and
>>> > "down", and some with "in" but "down".
>>> >
>>> > Not sure where to go next.  I'm going to let the recovery continue and
>>> hope
>>> > that my 4x replication on these pools saves me.
>>> >
>>> > Not sure where to go from here.  Any help is very much appreciated.
>>> This
>>> > Ceph cluster holds all of our Cloudstack images...  it would be
>>> terrible to
>>> > lose this data.
>>> > _______________________________________________
>>> > ceph-users mailing list -- ceph-users@xxxxxxx
>>> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>>
>>>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx