Re: No MDS No FS after update and restart - respectfully request help to rebuild FS and maps

GoZippy <gotadvantage@xxxxxxxxx> · Mon, 14 Mar 2022 02:29:28 -0500

Ran

sudo systemctl status ceph\*.service ceph\*.target on all monitor nodes
from cli
All showed

*root@node7:~# sudo systemctl status ceph\*.service ceph\*.target*
● ceph-mds.target - ceph target allowing to start/stop all ceph-mds@.service
instances at once
     Loaded: loaded (/lib/systemd/system/ceph-mds.target; enabled; vendor
preset: enabled)
     Active: active since Mon 2022-03-14 00:34:34 CDT; 1min 58s ago

Mar 14 00:34:34 node7 systemd[1]: Reached target ceph target allowing to
start/stop all ceph-mds@.service instances at once.

● *ceph-mgr.target* - ceph target allowing to start/stop all ceph-mgr@.service
instances at once
     Loaded: loaded (/lib/systemd/system/ceph-mgr.target; enabled; vendor
preset: enabled)
     *Active: active *since Mon 2022-03-14 00:34:34 CDT; 1min 59s ago

Mar 14 00:34:34 node7 systemd[1]: Reached target ceph target allowing to
start/stop all ceph-mgr@.service instances at once.

● c*eph-mds@node7.service* - Ceph metadata server daemon
     Loaded: loaded (/lib/systemd/system/ceph-mds@.service; enabled; vendor
preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-mds@.service.d
             └─ceph-after-pve-cluster.conf
     *Active: active (running)* since Mon 2022-03-14 00:34:34 CDT; 1min 59s
ago
   Main PID: 864832 (ceph-mds)
      Tasks: 9
     Memory: 9.7M
        CPU: 96ms
     CGroup: /system.slice/system-ceph\x2dmds.slice/ceph-mds@node7.service
             └─864832 /usr/bin/ceph-mds -f --cluster ceph --id node7
--setuser ceph --setgroup ceph

Mar 14 00:34:34 node7 systemd[1]: Started Ceph metadata server daemon.
● ceph-mds.target - ceph target allowing to start/stop all ceph-mds@.service
instances at once
     Loaded: loaded (/lib/systemd/system/ceph-mds.target; enabled; vendor
preset: enabled)
     Active: active since Mon 2022-03-14 00:34:34 CDT; 1min 58s ago

Mar 14 00:34:34 node7 systemd[1]: Reached target ceph target allowing to
start/stop all ceph-mds@.service instances at once.

● ceph-mgr.target - ceph target allowing to start/stop all ceph-mgr@.service
instances at once
     Loaded: loaded (/lib/systemd/system/ceph-mgr.target; enabled; vendor
preset: enabled)
     Active: active since Mon 2022-03-14 00:34:34 CDT; 1min 59s ago

Mar 14 00:34:34 node7 systemd[1]: Reached target ceph target allowing to
start/stop all ceph-mgr@.service instances at once.

● ceph-mds@node7.service - Ceph metadata server daemon
     Loaded: loaded (/lib/systemd/system/ceph-mds@.service; enabled; vendor
preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-mds@.service.d
             └─ceph-after-pve-cluster.conf
     Active: active (running) since Mon 2022-03-14 00:34:34 CDT; 1min 59s
ago
   Main PID: 864832 (ceph-mds)
      Tasks: 9
     Memory: 9.7M
        CPU: 96ms
     CGroup: /system.slice/system-ceph\x2dmds.slice/ceph-mds@node7.service
             └─864832 /usr/bin/ceph-mds -f --cluster ceph --id node7
--setuser ceph --setgroup ceph

Mar 14 00:34:34 node7 systemd[1]: Started Ceph metadata server daemon.

● *ceph-mgr@node7.service - Ceph cluster manager daemon*
     Loaded: loaded (/lib/systemd/system/ceph-mgr@.service; enabled; vendor
preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-mgr@.service.d
             └─ceph-after-pve-cluster.conf
     *Active: active *(running) since Mon 2022-03-14 00:34:34 CDT; 1min 59s
ago
   Main PID: 864833 (ceph-mgr)
      Tasks: 9 (limit: 19118)
     Memory: 10.2M
        CPU: 99ms
     CGroup: /system.slice/system-ceph\x2dmgr.slice/ceph-mgr@node7.service
             └─864833 /usr/bin/ceph-mgr -f --cluster ceph --id node7
--setuser ceph --setgroup ceph

Mar 14 00:34:34 node7 systemd[1]: Started Ceph cluster manager daemon.

● *ceph-osd@6.service - Ceph object storage daemon osd.6*
     Loaded: loaded (/lib/systemd/system/ceph-osd@.service;
enabled-runtime; vendor preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
             └─ceph-after-pve-cluster.conf
     *Active: active *(running) since Mon 2022-03-14 00:34:34 CDT; 1min 59s
ago
   Main PID: 864839 (ceph-osd)
      Tasks: 9
     Memory: 10.1M
        CPU: 100ms
     CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@6.service
             └─864839 /usr/bin/ceph-osd -f --cluster ceph --id 6 --setuser
ceph --setgroup ceph

Mar 14 00:34:34 node7 systemd[1]: Starting Ceph object storage daemon
osd.6...

found node900 had mon down, ceph-osd daemon down
ceph crash dump was working however lol
osd.11 was active

not seeing the other osd though..l wondering what is going on...

*root@node900:/etc/ceph# sudo systemctl status ceph\*.service ceph\*.target*
● ceph-mgr.target - ceph target allowing to start/stop all ceph-mgr@.service
instances at once
     Loaded: loaded (/lib/systemd/system/ceph-mgr.target; enabled; vendor
preset: enabled)
     Active: active since Mon 2022-03-14 00:34:25 CDT; 3min 37s ago

Mar 14 00:34:25 node900 systemd[1]: Reached target ceph target allowing to
start/stop all ceph-mgr@.service instances at once.

● ceph-osd@11.service - Ceph object storage daemon osd.11
     Loaded: loaded (/lib/systemd/system/ceph-osd@.service;
enabled-runtime; vendor preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
             └─ceph-after-pve-cluster.conf
     Active: active (running) since Mon 2022-03-14 00:34:25 CDT; 3min 37s
ago
   Main PID: 343715 (ceph-osd)
      Tasks: 9
     Memory: 10.8M
        CPU: 258ms
     CGroup: /system.slice/system-ceph\x2dosd.slice/ceph-osd@11.service
             └─343715 /usr/bin/ceph-osd -f --cluster ceph --id 11 --setuser
ceph --setgroup ceph

Mar 14 00:34:25 node900 systemd[1]: Starting Ceph object storage daemon
osd.11...
Mar 14 00:34:25 node900 systemd[1]: Started Ceph object storage daemon
osd.11.

● ceph-mon@stack1.service - Ceph cluster monitor daemon
     Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; vendor
preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-mon@.service.d
             └─ceph-after-pve-cluster.conf

* Active: failed (Result: exit-code) since Mon 2022-03-14 00:35:17 CDT;
2min 46s ago   Main PID: 344012 (code=exited, status=1/FAILURE)        CPU:
109ms*

Mar 14 00:35:17 node900 systemd[1]: ceph-mon@stack1.service: Scheduled
restart job, restart counter is at 5.
Mar 14 00:35:17 node900 systemd[1]: Stopped Ceph cluster monitor daemon.

*Mar 14 00:35:17 node900 systemd[1]: ceph-mon@stack1.service: Start request
repeated too quickly.Mar 14 00:35:17 node900 systemd[1]:
ceph-mon@stack1.service: Failed with result 'exit-code'.Mar 14 00:35:17
node900 systemd[1]: Failed to start Ceph cluster monitor daemon.*
● ceph-osd@1410.service - Ceph object storage daemon osd.1410
     Loaded: loaded (/lib/systemd/system/ceph-osd@.service; disabled;
vendor preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-osd@.service.d
             └─ceph-after-pve-cluster.conf

*Active: failed (Result: exit-code) since Fri 2022-03-04 23:17:05 CST; 1
weeks 2 days ago        CPU: 38ms*

Mar 04 23:17:05 node900 systemd[1]: ceph-osd@1410.service: Scheduled
restart job, restart counter is at 6.
Mar 04 23:17:05 node900 systemd[1]: Stopped Ceph object storage daemon
osd.1410.

*Mar 04 23:17:05 node900 systemd[1]: ceph-osd@1410.service: Start request
repeated too quickly.Mar 04 23:17:05 node900 systemd[1]:
ceph-osd@1410.service: Failed with result 'exit-code'.Mar 04 23:17:05
node900 systemd[1]: Failed to start Ceph object storage daemon osd.1410.*

● ceph-crash.service - Ceph crash dump collector
     Loaded: loaded (/lib/systemd/system/ceph-crash.service; enabled;
vendor preset: enabled)
     Active: active (running) since Mon 2022-03-14 00:34:25 CDT; 3min 37s
ago
   Main PID: 343694 (ceph-crash)
      Tasks: 1 (limit: 77203)
     Memory: 5.6M
        CPU: 92ms
     CGroup: /system.slice/ceph-crash.service
             └─343694 /usr/bin/python3.9 /usr/bin/ceph-crash

*root@node900:/etc/ceph# systemctl status ceph-mon@node900*
● ceph-mon@node900.service - Ceph cluster monitor daemon
     Loaded: loaded (/lib/systemd/system/ceph-mon@.service; enabled; vendor
preset: enabled)
    Drop-In: /usr/lib/systemd/system/ceph-mon@.service.d
             └─ceph-after-pve-cluster.conf
     Active: active (running) since Mon 2022-03-14 00:34:25 CDT; 8min ago
   Main PID: 343697 (ceph-mon)
      Tasks: 26
     Memory: 22.1M
        CPU: 1.292s
     CGroup: /system.slice/system-ceph\x2dmon.slice/ceph-mon@node900.service
             └─343697 /usr/bin/ceph-mon -f --cluster ceph --id node900
--setuser ceph --setgroup ceph

Mar 14 00:42:11 node900 ceph-mon[343697]: 2022-03-14T00:42:11.104-0500
7fa89da40700 -1 mon.node900@2(probing) e14 get_health_metrics reporting 3
slow ops, oldest is auth(proto 0 32 bytes epoch 0)
Mar 14 00:42:16 node900 ceph-mon[343697]: 2022-03-14T00:42:16.104-0500
7fa89da40700 -1 mon.node900@2(probing) e14 get_health_metrics reporting 3
slow ops, oldest is auth(proto 0 32 bytes epoch 0)
Mar 14 00:42:21 node900 ceph-mon[343697]: 2022-03-14T00:42:21.104-0500
7fa89da40700 -1 mon.node900@2(probing) e14 get_health_metrics reporting 3
slow ops, oldest is auth(proto 0 32 bytes epoch 0)
Mar 14 00:42:26 node900 ceph-mon[343697]: 2022-03-14T00:42:26.104-0500
7fa89da40700 -1 mon.node900@2(probing) e14 get_health_metrics reporting 3
slow ops, oldest is auth(proto 0 32 bytes epoch 0)
Mar 14 00:42:31 node900 ceph-mon[343697]: 2022-03-14T00:42:31.105-0500
7fa89da40700 -1 mon.node900@2(probing) e14 get_health_metrics reporting 3
slow ops, oldest is auth(proto 0 32 bytes epoch 0)
Mar 14 00:42:36 node900 ceph-mon[343697]: 2022-03-14T00:42:36.105-0500
7fa89da40700 -1 mon.node900@2(probing) e14 get_health_metrics reporting 3
slow ops, oldest is auth(proto 0 32 bytes epoch 0)
Mar 14 00:42:41 node900 ceph-mon[343697]: 2022-03-14T00:42:41.109-0500
7fa89da40700 -1 mon.node900@2(probing) e14 get_health_metrics reporting 3
slow ops, oldest is auth(proto 0 32 bytes epoch 0)
Mar 14 00:42:46 node900 ceph-mon[343697]: 2022-03-14T00:42:46.109-0500
7fa89da40700 -1 mon.node900@2(probing) e14 get_health_metrics reporting 3
slow ops, oldest is auth(proto 0 32 bytes epoch 0)
Mar 14 00:42:51 node900 ceph-mon[343697]: 2022-03-14T00:42:51.109-0500
7fa89da40700 -1 mon.node900@2(probing) e14 get_health_metrics reporting 3
slow ops, oldest is auth(proto 0 32 bytes epoch 0)
Mar 14 00:42:56 node900 ceph-mon[343697]: 2022-03-14T00:42:56.109-0500
7fa89da40700 -1 mon.node900@2(probing) e14 get_health_metrics reporting 3
slow ops, oldest is auth(proto 0 32 bytes epoch 0)
root@node900:/etc/ceph#

----------
logs now show some other errors critical space issues

2022-02-21T11:23:40.212664-0600 mon.node2 (mon.0) 2000851 : cluster [INF]
mon.node2 is new leader, mons node2,stack1,node7 in quorum (ranks 0,1,3)
2022-02-21T11:23:40.219565-0600 mon.node2 (mon.0) 2000852 : cluster [DBG]
monmap e14: 4 mons at {node2=[v2:10.0.1.2:3300/0,v1:10.0.1.2:6789/0
],node7=[v2:10.0.1.7:3300/0,v1:10.0.1.7:6789/0],node900=[v2:
10.0.90.0:3300/0,v1:10.0.90.0:6789/0],stack1=[v2:
10.0.1.1:3300/0,v1:10.0.1.1:6789/0]}
2022-02-21T11:23:40.219633-0600 mon.node2 (mon.0) 2000853 : cluster [DBG]
fsmap cephfs:1 {0=node2=up:active} 2 up:standby
2022-02-21T11:23:40.219653-0600 mon.node2 (mon.0) 2000854 : cluster [DBG]
osdmap e967678: 14 total, 4 up, 10 in
2022-02-21T11:23:40.220140-0600 mon.node2 (mon.0) 2000855 : cluster [DBG]
mgrmap e649: stack1(active, since 5d), standbys: node2, node7
2022-02-21T11:23:40.228388-0600 mon.node2 (mon.0) 2000856 : cluster [ERR]
Health detail: HEALTH_ERR 1 MDSs report slow metadata IOs; mon node7 is
very low on available space; mon stack1 is low on available space; 1/4 mons
down, quorum node2,stack1,node7; 6 osds down; 1 host (7 osds) down; Reduced
data availability: 169 pgs inactive, 45 pgs down, 124 pgs peering, 388 pgs
stale; 138 slow ops, oldest one blocked for 61680 sec, osd.0 has slow ops
2022-02-21T11:23:40.228404-0600 mon.node2 (mon.0) 2000857 : cluster [ERR]
[WRN] MDS_SLOW_METADATA_IO: 1 MDSs report slow metadata IOs
2022-02-21T11:23:40.228409-0600 mon.node2 (mon.0) 2000858 : cluster [ERR]
mds.node2(mds.0): 7 slow metadata IOs are blocked > 30 secs, oldest blocked
for 78670 secs
2022-02-21T11:23:40.228413-0600 mon.node2 (mon.0) 2000859 : cluster [ERR]
[ERR] MON_DISK_CRIT: mon node7 is very low on available space
2022-02-21T11:23:40.228416-0600 mon.node2 (mon.0) 2000860 : cluster [ERR]
mon.node7 has 1% avail
2022-02-21T11:23:40.228422-0600 mon.node2 (mon.0) 2000861 : cluster [ERR]
[WRN] MON_DISK_LOW: mon stack1 is low on available space
2022-02-21T11:23:40.228428-0600 mon.node2 (mon.0) 2000862 : cluster [ERR]
mon.stack1 has 8% avail
2022-02-21T11:23:40.228432-0600 mon.node2 (mon.0) 2000863 : cluster [ERR]
[WRN] MON_DOWN: 1/4 mons down, quorum node2,stack1,node7
2022-02-21T11:23:40.228437-0600 mon.node2 (mon.0) 2000864 : cluster [ERR]
mon.node900 (rank 2) addr [v2:10.0.90.0:3300/0,v1:10.0.90.0:6789/0] is down
(out of quorum)
2022-02-21T11:23:40.228443-0600 mon.node2 (mon.0) 2000865 : cluster [ERR]
[WRN] OSD_DOWN: 6 osds down
2022-02-21T11:23:40.228449-0600 mon.node2 (mon.0) 2000866 : cluster [ERR]
osd.8 (root=default,host=node900) is down
2022-02-21T11:23:40.228454-0600 mon.node2 (mon.0) 2000867 : cluster [ERR]
osd.9 (root=default,host=node900) is down
2022-02-21T11:23:40.228460-0600 mon.node2 (mon.0) 2000868 : cluster [ERR]
osd.10 (root=default,host=node900) is down
2022-02-21T11:23:40.228466-0600 mon.node2 (mon.0) 2000869 : cluster [ERR]
osd.11 (root=default,host=node900) is down
2022-02-21T11:23:40.228471-0600 mon.node2 (mon.0) 2000870 : cluster [ERR]
osd.12 (root=default,host=node900) is down
2022-02-21T11:23:40.228477-0600 mon.node2 (mon.0) 2000871 : cluster [ERR]
osd.13 (root=default,host=node900) is down
2022-02-21T11:23:40.228483-0600 mon.node2 (mon.0) 2000872 : cluster [ERR]
[WRN] OSD_HOST_DOWN: 1 host (7 osds) down
2022-02-21T11:23:40.228488-0600 mon.node2 (mon.0) 2000873 : cluster [ERR]
host node900 (root=default) (7 osds) is down
2022-02-21T11:23:40.228527-0600 mon.node2 (mon.0) 2000874 : cluster [ERR]
[WRN] PG_AVAILABILITY: Reduced data availability: 169 pgs inactive, 45 pgs
down, 124 pgs peering, 388 pgs stale
2022-02-21T11:23:40.228534-0600 mon.node2 (mon.0) 2000875 : cluster [ERR]
pg 7.cd is stuck inactive for 21h, current state stale+down, last acting [0]
2022-02-21T11:23:40.228539-0600 mon.node2 (mon.0) 2000876 : cluster [ERR]
pg 7.ce is stuck peering for 21h, current state peering, last acting [0,7]
2022-02-21T11:23:40.228544-0600 mon.node2 (mon.0) 2000877 : cluster [ERR]
pg 7.cf is stuck stale for 21h, current state stale+active+clean, last
acting [6,3,8]
2022-02-21T11:23:40.228550-0600 mon.node2 (mon.0) 2000878 : cluster [ERR]
pg 7.d0 is stuck stale for 21h, current state stale+active+clean, last
acting [12,2,6]
2022-02-21T11:23:40.228555-0600 mon.node2 (mon.0) 2000879 : cluster [ERR]
pg 7.d1 is stuck stale for 21h, current state stale+active+clean, last
acting [9,1,2]
2022-02-21T11:23:40.228561-0600 mon.node2 (mon.0) 2000880 : cluster [ERR]
pg 7.d2 is stuck stale for 21h, current state stale+active+clean, last
acting [3,9,2]
2022-02-21T11:23:40.228567-0600 mon.node2 (mon.0) 2000881 : cluster [ERR]
pg 7.d3 is stuck peering for 21h, current state peering, last acting [0,6]

On Sun, Mar 13, 2022 at 1:56 PM Alvaro Soto <alsotoes@xxxxxxxxx> wrote:

> Is your network up/up on all nodes? What about your mgr daemon?
>
> Try to recover first the daemons, also, can you run a status using the
> asok file?
>
> On Sun, Mar 13, 2022, 11:39 AM GoZippy <gotadvantage@xxxxxxxxx> wrote:
>
>> Using 9 node cluster on Proxmox
>> Ceph was updated automatically with system update
>> When rebooted nodes I did not set noout and just rebooted nodes - might be
>> root cause of all the rebalancing and lost connectivity...
>>
>> I see monitors active (running) with systemctl status ceph-mon@
>> <nodename>.service
>>
>> I see my HDD physical disks still assigned as OSD's 1-14 or whatever I
>> have
>>
>> Ceph -s hangs
>> Ceph commands pretty much all hang. Using system control I can turn
>> monitors off and stop all services - I think...
>>
>> sudo systemctl stop ceph\*.service ceph\*.target
>>
>> seems to have stopped the services on all nodes... restarting does not
>> help.
>>
>> /var/log/ceph is full of logs as expected (not sure where to start with
>> them to look at issues)
>>
>> Pools and MDS do not show up anymore -maybe got purged or deleted...
>>
>> Wondering if there is a way to undelete maps and pool data from debian
>> host
>> or use ceph tools to make the same name file system, rebuild the map and
>> not lose data on the ODS stores...
>>
>> SEE the following posts I was posting screenshots and logs and trying to
>> sort it out but got nowhere... came here to list for last resort before I
>> nuke it all and start over.. would like to save some VM's I have on those
>> OSD's.
>>
>>
>> https://forum.proxmox.com/threads/ceph-not-working-monitors-and-managers-lost.100672/page-2
>>
>> I read on
>>
>> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/message/K5X6DPSUHMPZ3P7ADV64B4YLPQPWQS5J/
>>
>> it might be possible to just overwrite the name and let the system "heal"
>> but seems risky and I am not not exactly sure what command to use for
>> making the same name file system and setting up the same metadata info
>> too... where can I go about looking in logs or proxmox or otherwise to
>> make
>> sure I do use the correct names if this is an option - or is there any
>> better documented way to recover a lost filesystem and restore quorum?
>>
>> Home-brew and learning as I go.. so forgive the lack of expertise here - I
>> am learning.
>>
>> Respectfully,
>>
>> William Henderson
>> 316-518-9350
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx