Ran sudo systemctl status ceph\*.service ceph\*.target on all monitor nodes from cli All showed *root@node7:~# sudo systemctl status ceph\*.service ceph\*.target* ● ceph-mds.target - ceph target allowing to start/stop all ceph-mds@.service instances at once Loaded: loaded (/lib/systemd/system/ceph-mds.target; enabled; vendor preset: enabled) Active: active since Mon 2022-03-14 00:34:34 CDT; 1min 58s ago Mar 14 00:34:34 node7 systemd[1]: Reached target ceph target allowing to start/stop all ceph-mds@.service instances at once. ● *ceph-mgr.target* - ceph target allowing to start/stop all ceph-mgr@.service instances at once Loaded: loaded (/lib/systemd/system/ceph-mgr.target; enabled; vendor preset: enabled) *Active: active *since Mon 2022-03-14 00:34:34 CDT; 1min 59s ago Mar 14 00:34:34 node7 systemd[1]: Reached target ceph target allowing to start/stop all ceph-mgr@.service instances at once. ● c*eph-mds@node7.service* - Ceph metadata server daemon Loaded: loaded (/lib/systemd/system/ceph-mds@.service; enabled; vendor preset: enabled) Drop-In: /usr/lib/systemd/system/ceph-mds@.service.d └─ceph-after-pve-cluster.conf *Active: active (running)* since Mon 2022-03-14 00:34:34 CDT; 1min 59s ago Main PID: 864832 (ceph-mds) Tasks: 9 Memory: 9.7M CPU: 96ms CGroup: /system.slice/system-ceph\x2dmds.slice/ceph-mds@node7.service └─864832 /usr/bin/ceph-mds -f --cluster ceph --id node7 --setuser ceph --setgroup ceph Mar 14 00:34:34 node7 systemd[1]: Started Ceph metadata server daemon. ● ceph-mds.target - ceph target allowing to start/stop all ceph-mds@.service instances at once Loaded: loaded (/lib/systemd/system/ceph-mds.target; enabled; vendor preset: enabled) Active: active since Mon 2022-03-14 00:34:34 CDT; 1min 58s ago Mar 14 00:34:34 node7 systemd[1]: Reached target ceph target allowing to start/stop all ceph-mds@.service instances at once. ● ceph-mgr.target - ceph target allowing to start/stop all ceph-mgr@.service instances at once Loaded: loaded (/lib/systemd/system/ceph-mgr.target; enabled; vendor preset: enabled) Active: active since Mon 2022-03-14 00:34:34 CDT; 1min 59s ago Mar 14 00:34:34 node7 systemd[1]: Reached target ceph target allowing to start/stop all ceph-mgr@.service instances at once. ● ceph-mds@node7.service - Ceph metadata server daemon Loaded: loaded (/lib/systemd/system/ceph-mds@.service; enabled; vendor preset: enabled) Drop-In: /usr/lib/systemd/system/ceph-mds@.service.d └─ceph-after-pve-cluster.conf Active: active (running) since Mon 2022-03-14 00:34:34 CDT; 1min 59s ago Main PID: 864832 (ceph-mds) Tasks: 9 Memory: 9.7M CPU: 96ms CGroup: /system.slice/system-ceph\x2dmds.slice/ceph-mds@node7.service └─864832 /usr/bin/ceph-mds -f --cluster ceph --id node7 --setuser ceph --setgroup ceph Mar 14 00:34:34 node7 systemd[1]: Started Ceph metadata server daemon. ● *ceph-mgr@node7.service - Ceph cluster manager daemon* Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; enabled; vendor preset: enabled) Drop-In: /usr/lib/systemd/system/ceph-mgr@.service.d └─ceph-after-pve-cluster.conf *Active: active *(running) since Mon 2022-03-14 00:34:34 CDT; 1min 59s ago Main PID: 864833 (ceph-mgr) Tasks: 9 (limit: 19118) Memory: 10.2M CPU: 99ms CGroup: /system.slice/system-ceph\x2dmgr.slice/ceph-mgr@node7.service └─864833 /usr/bin/ceph-mgr -f --cluster ceph --id node7 --setuser ceph --setgroup ceph Mar 14 00:34:34 node7 systemd[1]: Started Ceph cluster manager daemon. ● *ceph-osd@6.service - Ceph object storage daemon osd.6* Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled) Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d └─ceph-after-pve-cluster.conf *Active: active *(running) since Mon 2022-03-14 00:34:34 CDT; 1min 59s ago Main PID: 864839 (ceph-osd) Tasks: 9 Memory: 10.1M CPU: 100ms CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@6.service └─864839 /usr/bin/ceph-osd -f --cluster ceph --id 6 --setuser ceph --setgroup ceph Mar 14 00:34:34 node7 systemd[1]: Starting Ceph object storage daemon osd.6... found node900 had mon down, ceph-osd daemon down ceph crash dump was working however lol osd.11 was active not seeing the other osd though..l wondering what is going on... *root@node900:/etc/ceph# sudo systemctl status ceph\*.service ceph\*.target* ● ceph-mgr.target - ceph target allowing to start/stop all ceph-mgr@.service instances at once Loaded: loaded (/lib/systemd/system/ceph-mgr.target; enabled; vendor preset: enabled) Active: active since Mon 2022-03-14 00:34:25 CDT; 3min 37s ago Mar 14 00:34:25 node900 systemd[1]: Reached target ceph target allowing to start/stop all ceph-mgr@.service instances at once. ● ceph-osd@11.service - Ceph object storage daemon osd.11 Loaded: loaded (/lib/systemd/system/ceph-osd@.service; enabled-runtime; vendor preset: enabled) Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d └─ceph-after-pve-cluster.conf Active: active (running) since Mon 2022-03-14 00:34:25 CDT; 3min 37s ago Main PID: 343715 (ceph-osd) Tasks: 9 Memory: 10.8M CPU: 258ms CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@11.service └─343715 /usr/bin/ceph-osd -f --cluster ceph --id 11 --setuser ceph --setgroup ceph Mar 14 00:34:25 node900 systemd[1]: Starting Ceph object storage daemon osd.11... Mar 14 00:34:25 node900 systemd[1]: Started Ceph object storage daemon osd.11. ● ceph-mon@stack1.service - Ceph cluster monitor daemon Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: enabled) Drop-In: /usr/lib/systemd/system/ceph-mon@.service.d └─ceph-after-pve-cluster.conf * Active: failed (Result: exit-code) since Mon 2022-03-14 00:35:17 CDT; 2min 46s ago Main PID: 344012 (code=exited, status=1/FAILURE) CPU: 109ms* Mar 14 00:35:17 node900 systemd[1]: ceph-mon@stack1.service: Scheduled restart job, restart counter is at 5. Mar 14 00:35:17 node900 systemd[1]: Stopped Ceph cluster monitor daemon. *Mar 14 00:35:17 node900 systemd[1]: ceph-mon@stack1.service: Start request repeated too quickly.Mar 14 00:35:17 node900 systemd[1]: ceph-mon@stack1.service: Failed with result 'exit-code'.Mar 14 00:35:17 node900 systemd[1]: Failed to start Ceph cluster monitor daemon.* ● ceph-osd@1410.service - Ceph object storage daemon osd.1410 Loaded: loaded (/lib/systemd/system/ceph-osd@.service; disabled; vendor preset: enabled) Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d └─ceph-after-pve-cluster.conf *Active: failed (Result: exit-code) since Fri 2022-03-04 23:17:05 CST; 1 weeks 2 days ago CPU: 38ms* Mar 04 23:17:05 node900 systemd[1]: ceph-osd@1410.service: Scheduled restart job, restart counter is at 6. Mar 04 23:17:05 node900 systemd[1]: Stopped Ceph object storage daemon osd.1410. *Mar 04 23:17:05 node900 systemd[1]: ceph-osd@1410.service: Start request repeated too quickly.Mar 04 23:17:05 node900 systemd[1]: ceph-osd@1410.service: Failed with result 'exit-code'.Mar 04 23:17:05 node900 systemd[1]: Failed to start Ceph object storage daemon osd.1410.* ● ceph-crash.service - Ceph crash dump collector Loaded: loaded (/lib/systemd/system/ceph-crash.service; enabled; vendor preset: enabled) Active: active (running) since Mon 2022-03-14 00:34:25 CDT; 3min 37s ago Main PID: 343694 (ceph-crash) Tasks: 1 (limit: 77203) Memory: 5.6M CPU: 92ms CGroup: /system.slice/ceph-crash.service └─343694 /usr/bin/python3.9 /usr/bin/ceph-crash *root@node900:/etc/ceph# systemctl status ceph-mon@node900* ● ceph-mon@node900.service - Ceph cluster monitor daemon Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; vendor preset: enabled) Drop-In: /usr/lib/systemd/system/ceph-mon@.service.d └─ceph-after-pve-cluster.conf Active: active (running) since Mon 2022-03-14 00:34:25 CDT; 8min ago Main PID: 343697 (ceph-mon) Tasks: 26 Memory: 22.1M CPU: 1.292s CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon@node900.service └─343697 /usr/bin/ceph-mon -f --cluster ceph --id node900 --setuser ceph --setgroup ceph Mar 14 00:42:11 node900 ceph-mon[343697]: 2022-03-14T00:42:11.104-0500 7fa89da40700 -1 mon.node900@2(probing) e14 get_health_metrics reporting 3 slow ops, oldest is auth(proto 0 32 bytes epoch 0) Mar 14 00:42:16 node900 ceph-mon[343697]: 2022-03-14T00:42:16.104-0500 7fa89da40700 -1 mon.node900@2(probing) e14 get_health_metrics reporting 3 slow ops, oldest is auth(proto 0 32 bytes epoch 0) Mar 14 00:42:21 node900 ceph-mon[343697]: 2022-03-14T00:42:21.104-0500 7fa89da40700 -1 mon.node900@2(probing) e14 get_health_metrics reporting 3 slow ops, oldest is auth(proto 0 32 bytes epoch 0) Mar 14 00:42:26 node900 ceph-mon[343697]: 2022-03-14T00:42:26.104-0500 7fa89da40700 -1 mon.node900@2(probing) e14 get_health_metrics reporting 3 slow ops, oldest is auth(proto 0 32 bytes epoch 0) Mar 14 00:42:31 node900 ceph-mon[343697]: 2022-03-14T00:42:31.105-0500 7fa89da40700 -1 mon.node900@2(probing) e14 get_health_metrics reporting 3 slow ops, oldest is auth(proto 0 32 bytes epoch 0) Mar 14 00:42:36 node900 ceph-mon[343697]: 2022-03-14T00:42:36.105-0500 7fa89da40700 -1 mon.node900@2(probing) e14 get_health_metrics reporting 3 slow ops, oldest is auth(proto 0 32 bytes epoch 0) Mar 14 00:42:41 node900 ceph-mon[343697]: 2022-03-14T00:42:41.109-0500 7fa89da40700 -1 mon.node900@2(probing) e14 get_health_metrics reporting 3 slow ops, oldest is auth(proto 0 32 bytes epoch 0) Mar 14 00:42:46 node900 ceph-mon[343697]: 2022-03-14T00:42:46.109-0500 7fa89da40700 -1 mon.node900@2(probing) e14 get_health_metrics reporting 3 slow ops, oldest is auth(proto 0 32 bytes epoch 0) Mar 14 00:42:51 node900 ceph-mon[343697]: 2022-03-14T00:42:51.109-0500 7fa89da40700 -1 mon.node900@2(probing) e14 get_health_metrics reporting 3 slow ops, oldest is auth(proto 0 32 bytes epoch 0) Mar 14 00:42:56 node900 ceph-mon[343697]: 2022-03-14T00:42:56.109-0500 7fa89da40700 -1 mon.node900@2(probing) e14 get_health_metrics reporting 3 slow ops, oldest is auth(proto 0 32 bytes epoch 0) root@node900:/etc/ceph# ---------- logs now show some other errors critical space issues 2022-02-21T11:23:40.212664-0600 mon.node2 (mon.0) 2000851 : cluster [INF] mon.node2 is new leader, mons node2,stack1,node7 in quorum (ranks 0,1,3) 2022-02-21T11:23:40.219565-0600 mon.node2 (mon.0) 2000852 : cluster [DBG] monmap e14: 4 mons at {node2=[v2:10.0.1.2:3300/0,v1:10.0.1.2:6789/0 ],node7=[v2:10.0.1.7:3300/0,v1:10.0.1.7:6789/0],node900=[v2: 10.0.90.0:3300/0,v1:10.0.90.0:6789/0],stack1=[v2: 10.0.1.1:3300/0,v1:10.0.1.1:6789/0]} 2022-02-21T11:23:40.219633-0600 mon.node2 (mon.0) 2000853 : cluster [DBG] fsmap cephfs:1 {0=node2=up:active} 2 up:standby 2022-02-21T11:23:40.219653-0600 mon.node2 (mon.0) 2000854 : cluster [DBG] osdmap e967678: 14 total, 4 up, 10 in 2022-02-21T11:23:40.220140-0600 mon.node2 (mon.0) 2000855 : cluster [DBG] mgrmap e649: stack1(active, since 5d), standbys: node2, node7 2022-02-21T11:23:40.228388-0600 mon.node2 (mon.0) 2000856 : cluster [ERR] Health detail: HEALTH_ERR 1 MDSs report slow metadata IOs; mon node7 is very low on available space; mon stack1 is low on available space; 1/4 mons down, quorum node2,stack1,node7; 6 osds down; 1 host (7 osds) down; Reduced data availability: 169 pgs inactive, 45 pgs down, 124 pgs peering, 388 pgs stale; 138 slow ops, oldest one blocked for 61680 sec, osd.0 has slow ops 2022-02-21T11:23:40.228404-0600 mon.node2 (mon.0) 2000857 : cluster [ERR] [WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs 2022-02-21T11:23:40.228409-0600 mon.node2 (mon.0) 2000858 : cluster [ERR] mds.node2(mds.0): 7 slow metadata IOs are blocked > 30 secs, oldest blocked for 78670 secs 2022-02-21T11:23:40.228413-0600 mon.node2 (mon.0) 2000859 : cluster [ERR] [ERR] MON_DISK_CRIT: mon node7 is very low on available space 2022-02-21T11:23:40.228416-0600 mon.node2 (mon.0) 2000860 : cluster [ERR] mon.node7 has 1% avail 2022-02-21T11:23:40.228422-0600 mon.node2 (mon.0) 2000861 : cluster [ERR] [WRN] MON_DISK_LOW: mon stack1 is low on available space 2022-02-21T11:23:40.228428-0600 mon.node2 (mon.0) 2000862 : cluster [ERR] mon.stack1 has 8% avail 2022-02-21T11:23:40.228432-0600 mon.node2 (mon.0) 2000863 : cluster [ERR] [WRN] MON_DOWN: 1/4 mons down, quorum node2,stack1,node7 2022-02-21T11:23:40.228437-0600 mon.node2 (mon.0) 2000864 : cluster [ERR] mon.node900 (rank 2) addr [v2:10.0.90.0:3300/0,v1:10.0.90.0:6789/0] is down (out of quorum) 2022-02-21T11:23:40.228443-0600 mon.node2 (mon.0) 2000865 : cluster [ERR] [WRN] OSD_DOWN: 6 osds down 2022-02-21T11:23:40.228449-0600 mon.node2 (mon.0) 2000866 : cluster [ERR] osd.8 (root=default,host=node900) is down 2022-02-21T11:23:40.228454-0600 mon.node2 (mon.0) 2000867 : cluster [ERR] osd.9 (root=default,host=node900) is down 2022-02-21T11:23:40.228460-0600 mon.node2 (mon.0) 2000868 : cluster [ERR] osd.10 (root=default,host=node900) is down 2022-02-21T11:23:40.228466-0600 mon.node2 (mon.0) 2000869 : cluster [ERR] osd.11 (root=default,host=node900) is down 2022-02-21T11:23:40.228471-0600 mon.node2 (mon.0) 2000870 : cluster [ERR] osd.12 (root=default,host=node900) is down 2022-02-21T11:23:40.228477-0600 mon.node2 (mon.0) 2000871 : cluster [ERR] osd.13 (root=default,host=node900) is down 2022-02-21T11:23:40.228483-0600 mon.node2 (mon.0) 2000872 : cluster [ERR] [WRN] OSD_HOST_DOWN: 1 host (7 osds) down 2022-02-21T11:23:40.228488-0600 mon.node2 (mon.0) 2000873 : cluster [ERR] host node900 (root=default) (7 osds) is down 2022-02-21T11:23:40.228527-0600 mon.node2 (mon.0) 2000874 : cluster [ERR] [WRN] PG_AVAILABILITY: Reduced data availability: 169 pgs inactive, 45 pgs down, 124 pgs peering, 388 pgs stale 2022-02-21T11:23:40.228534-0600 mon.node2 (mon.0) 2000875 : cluster [ERR] pg 7.cd is stuck inactive for 21h, current state stale+down, last acting [0] 2022-02-21T11:23:40.228539-0600 mon.node2 (mon.0) 2000876 : cluster [ERR] pg 7.ce is stuck peering for 21h, current state peering, last acting [0,7] 2022-02-21T11:23:40.228544-0600 mon.node2 (mon.0) 2000877 : cluster [ERR] pg 7.cf is stuck stale for 21h, current state stale+active+clean, last acting [6,3,8] 2022-02-21T11:23:40.228550-0600 mon.node2 (mon.0) 2000878 : cluster [ERR] pg 7.d0 is stuck stale for 21h, current state stale+active+clean, last acting [12,2,6] 2022-02-21T11:23:40.228555-0600 mon.node2 (mon.0) 2000879 : cluster [ERR] pg 7.d1 is stuck stale for 21h, current state stale+active+clean, last acting [9,1,2] 2022-02-21T11:23:40.228561-0600 mon.node2 (mon.0) 2000880 : cluster [ERR] pg 7.d2 is stuck stale for 21h, current state stale+active+clean, last acting [3,9,2] 2022-02-21T11:23:40.228567-0600 mon.node2 (mon.0) 2000881 : cluster [ERR] pg 7.d3 is stuck peering for 21h, current state peering, last acting [0,6] On Sun, Mar 13, 2022 at 1:56 PM Alvaro Soto <alsotoes@xxxxxxxxx> wrote: > Is your network up/up on all nodes? What about your mgr daemon? > > Try to recover first the daemons, also, can you run a status using the > asok file? > > On Sun, Mar 13, 2022, 11:39 AM GoZippy <gotadvantage@xxxxxxxxx> wrote: > >> Using 9 node cluster on Proxmox >> Ceph was updated automatically with system update >> When rebooted nodes I did not set noout and just rebooted nodes - might be >> root cause of all the rebalancing and lost connectivity... >> >> I see monitors active (running) with systemctl status ceph-mon@ >> <nodename>.service >> >> I see my HDD physical disks still assigned as OSD's 1-14 or whatever I >> have >> >> Ceph -s hangs >> Ceph commands pretty much all hang. Using system control I can turn >> monitors off and stop all services - I think... >> >> sudo systemctl stop ceph\*.service ceph\*.target >> >> seems to have stopped the services on all nodes... restarting does not >> help. >> >> /var/log/ceph is full of logs as expected (not sure where to start with >> them to look at issues) >> >> Pools and MDS do not show up anymore -maybe got purged or deleted... >> >> Wondering if there is a way to undelete maps and pool data from debian >> host >> or use ceph tools to make the same name file system, rebuild the map and >> not lose data on the ODS stores... >> >> SEE the following posts I was posting screenshots and logs and trying to >> sort it out but got nowhere... came here to list for last resort before I >> nuke it all and start over.. would like to save some VM's I have on those >> OSD's. >> >> >> https://forum.proxmox.com/threads/ceph-not-working-monitors-and-managers-lost.100672/page-2 >> >> I read on >> >> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/K5X6DPSUHMPZ3P7ADV64B4YLPQPWQS5J/ >> >> it might be possible to just overwrite the name and let the system "heal" >> but seems risky and I am not not exactly sure what command to use for >> making the same name file system and setting up the same metadata info >> too... where can I go about looking in logs or proxmox or otherwise to >> make >> sure I do use the correct names if this is an option - or is there any >> better documented way to recover a lost filesystem and restore quorum? >> >> Home-brew and learning as I go.. so forgive the lack of expertise here - I >> am learning. >> >> Respectfully, >> >> William Henderson >> 316-518-9350 >> _______________________________________________ >> ceph-users mailing list -- ceph-users@xxxxxxx >> To unsubscribe send an email to ceph-users-leave@xxxxxxx >> > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx