Hi, I'm really lost with my Ceph system. I built a small cluster for home usage which has two uses for me: I want to replace an old NAS and I want to learn about Ceph so that I have hands-on experience. We're using it in our company but I need some real-life experience without risking any company or customers data. That's my preferred way of learning. The cluster consists of 3 Raspberry Pis plus a few VMs running on Proxmox. I'm not using Proxmox' built in Ceph because I want to focus on Ceph and not just use it as a preconfigured tool. All hosts are running Fedora (x86_64 and arm64) and during an Upgrade from F36 to F37 my cluster suddenly showed all PGs as unavailable. I worked nearly a week to get it back online and I learned a lot about Ceph management and recovery. The cluster is back but I still can't access my data. Maybe you can help me? Here are my versions: [ceph: root@ceph04 /]# ceph versions { "mon": { "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 3 }, "mgr": { "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 3 }, "osd": { "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 5 }, "mds": { "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 4 }, "overall": { "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy (stable)": 15 } } Here's MDS status output of one MDS: [ceph: root@ceph04 /]# ceph tell mds.mds01.ceph05.pqxmvt status 2023-01-14T15:30:28.607+0000 7fb9e17fa700 0 client.60986454 ms_handle_reset on v2:192.168.23.65:6800/2680651694 2023-01-14T15:30:28.640+0000 7fb9e17fa700 0 client.60986460 ms_handle_reset on v2:192.168.23.65:6800/2680651694 { "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", "whoami": 0, "id": 60984167, "want_state": "up:replay", "state": "up:replay", "fs_name": "cephfs", "replay_status": { "journal_read_pos": 0, "journal_write_pos": 0, "journal_expire_pos": 0, "num_events": 0, "num_segments": 0 }, "rank_uptime": 1127.54018615, "mdsmap_epoch": 98056, "osdmap_epoch": 12362, "osdmap_epoch_barrier": 0, "uptime": 1127.957307273 } It's staying like that for days now. If there was a counter moving, I just would wait but it doesn't change anything and alle stats says, the MDS aren't working at all. The symptom I have is that Dashboard and all other tools I use say, it's more or less ok. (Some old messages about failed daemons and scrubbing aside). But I can't mount anything. When I try to start a VM that's on RDS I just get a timeout. And when I try to mount a CephFS, mount just hangs forever. Whatever command I give MDS or journal, it just hangs. The only thing I could do, was take all CephFS offline, kill the MDS's and do a "ceph fs reset <fs name> --yes-i-really-mean-it". After that I rebooted all nodes, just to be sure but I still have no access to data. Could you please help me? I'm kinda desperate. If you need any more information, just let me know. Cheers, Thomas -- Thomas Widhalm Lead Systems Engineer NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg Tel: +49 911 92885-0 | Fax: +49 911 92885-77 CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 https://www.netways.de | thomas.widhalm@xxxxxxxxxx ** stackconf 2023 - September - https://stackconf.eu ** ** OSMC 2023 - November - https://osmc.de ** ** New at NWS: Managed Database - https://nws.netways.de/managed-database ** ** NETWAYS Web Services - https://nws.netways.de ** _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx