Hi Thomas, Sorry, I misread the mds state to be stuck in 'up:resolve' state. The mds is stuck in 'up:replay' which means the MDS taking over a failed rank. This state represents that the MDS is recovering its journal and other metadata. I notice that there are two filesystems 'cephfs' and 'cephfs_insecure' and the active mds for both filesystems are stuck in 'up:replay'. The mds logs shared are not providing much information to infer anything. Could you please enable the debug logs and pass on the mds logs ? Thanks, Kotresh H R On Mon, Jan 16, 2023 at 2:38 PM Thomas Widhalm <thomas.widhalm@xxxxxxxxxx> wrote: > Hi Kotresh, > > Thanks for your reply! > > I only have one rank. Here's the output of all MDS I have: > > ################### > > [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph05.pqxmvt status > 2023-01-16T08:55:26.055+0000 7f3412ffd700 0 client.61249926 > ms_handle_reset on v2:192.168.23.65:6800/2680651694 > 2023-01-16T08:55:26.084+0000 7f3412ffd700 0 client.61299199 > ms_handle_reset on v2:192.168.23.65:6800/2680651694 > { > "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", > "whoami": 0, > "id": 60984167, > "want_state": "up:replay", > "state": "up:replay", > "fs_name": "cephfs", > "replay_status": { > "journal_read_pos": 0, > "journal_write_pos": 0, > "journal_expire_pos": 0, > "num_events": 0, > "num_segments": 0 > }, > "rank_uptime": 150224.982558844, > "mdsmap_epoch": 143757, > "osdmap_epoch": 12395, > "osdmap_epoch_barrier": 0, > "uptime": 150225.39968057699 > } > > ######################## > > [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph04.cvdhsx status > 2023-01-16T08:59:05.434+0000 7fdb82ff5700 0 client.61299598 > ms_handle_reset on v2:192.168.23.64:6800/3930607515 > 2023-01-16T08:59:05.466+0000 7fdb82ff5700 0 client.61299604 > ms_handle_reset on v2:192.168.23.64:6800/3930607515 > { > "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", > "whoami": 0, > "id": 60984134, > "want_state": "up:replay", > "state": "up:replay", > "fs_name": "cephfs_insecure", > "replay_status": { > "journal_read_pos": 0, > "journal_write_pos": 0, > "journal_expire_pos": 0, > "num_events": 0, > "num_segments": 0 > }, > "rank_uptime": 150450.96934037199, > "mdsmap_epoch": 143815, > "osdmap_epoch": 12395, > "osdmap_epoch_barrier": 0, > "uptime": 150451.93533502301 > } > > ########################### > > [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph06.wcfdom status > 2023-01-16T08:59:28.572+0000 7f16538c0b80 -1 client.61250376 > resolve_mds: no MDS daemons found by name `mds01.ceph06.wcfdom' > 2023-01-16T08:59:28.583+0000 7f16538c0b80 -1 client.61250376 FSMap: > cephfs:1/1 cephfs_insecure:1/1 > > {cephfs:0=mds01.ceph05.pqxmvt=up:replay,cephfs_insecure:0=mds01.ceph04.cvdhsx=up:replay} > 2 up:standby > Error ENOENT: problem getting command descriptions from > mds.mds01.ceph06.wcfdom > > ############################ > > [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph07.omdisd status > 2023-01-16T09:00:02.802+0000 7fb7affff700 0 client.61250454 > ms_handle_reset on v2:192.168.23.67:6800/942898192 > 2023-01-16T09:00:02.831+0000 7fb7affff700 0 client.61299751 > ms_handle_reset on v2:192.168.23.67:6800/942898192 > { > "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", > "whoami": -1, > "id": 60984161, > "want_state": "up:standby", > "state": "up:standby", > "mdsmap_epoch": 97687, > "osdmap_epoch": 0, > "osdmap_epoch_barrier": 0, > "uptime": 150508.29091721401 > } > > The error message from ceph06 is new to me. That didn't happen the last > times. > > [ceph: root@ceph06 /]# ceph fs dump > e143850 > enable_multiple, ever_enabled_multiple: 1,1 > default compat: compat={},rocompat={},incompat={1=base v0.20,2=client > writeable ranges,3=default file layouts on dirs,4=dir inode in separate > object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no > anchor table,9=file layout v2,10=snaprealm v2} > legacy client fscid: 2 > > Filesystem 'cephfs' (2) > fs_name cephfs > epoch 143850 > flags 12 joinable allow_snaps allow_multimds_snaps > created 2023-01-14T14:30:05.723421+0000 > modified 2023-01-16T09:00:53.663007+0000 > tableserver 0 > root 0 > session_timeout 60 > session_autoclose 300 > max_file_size 1099511627776 > required_client_features {} > last_failure 0 > last_failure_osd_epoch 12321 > compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable > ranges,3=default file layouts on dirs,4=dir inode in separate > object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds > uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} > max_mds 1 > in 0 > up {0=60984167} > failed > damaged > stopped > data_pools [4] > metadata_pool 5 > inline_data disabled > balancer > standby_count_wanted 1 > [mds.mds01.ceph05.pqxmvt{0:60984167} state up:replay seq 37637 addr > [v2:192.168.23.65:6800/2680651694,v1:192.168.23.65:6801/2680651694] > compat {c=[1],r=[1],i=[7ff]}] > > > Filesystem 'cephfs_insecure' (3) > fs_name cephfs_insecure > epoch 143849 > flags 12 joinable allow_snaps allow_multimds_snaps > created 2023-01-14T14:22:46.360062+0000 > modified 2023-01-16T09:00:52.632163+0000 > tableserver 0 > root 0 > session_timeout 60 > session_autoclose 300 > max_file_size 1099511627776 > required_client_features {} > last_failure 0 > last_failure_osd_epoch 12319 > compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable > ranges,3=default file layouts on dirs,4=dir inode in separate > object,5=mds uses versioned encoding,6=dirfrag is stored in omap,7=mds > uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2} > max_mds 1 > in 0 > up {0=60984134} > failed > damaged > stopped > data_pools [7] > metadata_pool 6 > inline_data disabled > balancer > standby_count_wanted 1 > [mds.mds01.ceph04.cvdhsx{0:60984134} state up:replay seq 37639 addr > [v2:192.168.23.64:6800/3930607515,v1:192.168.23.64:6801/3930607515] > compat {c=[1],r=[1],i=[7ff]}] > > > Standby daemons: > > [mds.mds01.ceph07.omdisd{-1:60984161} state up:standby seq 2 addr > [v2:192.168.23.67:6800/942898192,v1:192.168.23.67:6800/942898192] compat > {c=[1],r=[1],i=[7ff]}] > [mds.mds01.ceph06.hsuhqd{-1:60984828} state up:standby seq 1 addr > [v2:192.168.23.66:6800/4259514518,v1:192.168.23.66:6801/4259514518] > compat {c=[1],r=[1],i=[7ff]}] > dumped fsmap epoch 143850 > > ############################# > > [ceph: root@ceph06 /]# ceph fs status > > (doesn't come back) > > ############################# > > All MDS show log lines similar to this one: > > Jan 16 10:05:00 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating > MDS map to version 143927 from mon.1 > Jan 16 10:05:05 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating > MDS map to version 143929 from mon.1 > Jan 16 10:05:09 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating > MDS map to version 143930 from mon.1 > Jan 16 10:05:13 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating > MDS map to version 143931 from mon.1 > Jan 16 10:05:20 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating > MDS map to version 143933 from mon.1 > Jan 16 10:05:24 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating > MDS map to version 143935 from mon.1 > Jan 16 10:05:29 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating > MDS map to version 143936 from mon.1 > Jan 16 10:05:33 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating > MDS map to version 143937 from mon.1 > Jan 16 10:05:40 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating > MDS map to version 143939 from mon.1 > Jan 16 10:05:44 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating > MDS map to version 143941 from mon.1 > Jan 16 10:05:49 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx Updating > MDS map to version 143942 from mon.1 > > Anything else, I can provide? > > Cheers and thanks again! > Thomas > > On 16.01.23 06:01, Kotresh Hiremath Ravishankar wrote: > > Hi Thomas, > > > > As the documentation says, the MDS enters up:resolve from |up:replay| if > > the Ceph file system has multiple ranks (including this one), i.e. it’s > > not a single active MDS cluster. > > The MDS is resolving any uncommitted inter-MDS operations. All ranks in > > the file system must be in this state or later for progress to be made, > > i.e. no rank can be failed/damaged or |up:replay|. > > > > So please check the status of the other active mds if it's failed. > > > > Also please share the mds logs and the output of 'ceph fs dump' and > > 'ceph fs status' > > > > Thanks, > > Kotresh H R > > > > On Sat, Jan 14, 2023 at 9:07 PM Thomas Widhalm > > <thomas.widhalm@xxxxxxxxxx <mailto:thomas.widhalm@xxxxxxxxxx>> wrote: > > > > Hi, > > > > I'm really lost with my Ceph system. I built a small cluster for home > > usage which has two uses for me: I want to replace an old NAS and I > want > > to learn about Ceph so that I have hands-on experience. We're using > it > > in our company but I need some real-life experience without risking > any > > company or customers data. That's my preferred way of learning. > > > > The cluster consists of 3 Raspberry Pis plus a few VMs running on > > Proxmox. I'm not using Proxmox' built in Ceph because I want to > focus on > > Ceph and not just use it as a preconfigured tool. > > > > All hosts are running Fedora (x86_64 and arm64) and during an Upgrade > > from F36 to F37 my cluster suddenly showed all PGs as unavailable. I > > worked nearly a week to get it back online and I learned a lot about > > Ceph management and recovery. The cluster is back but I still can't > > access my data. Maybe you can help me? > > > > Here are my versions: > > > > [ceph: root@ceph04 /]# ceph versions > > { > > "mon": { > > "ceph version 17.2.5 > > (98318ae89f1a893a6ded3a640405cdbb33e08757) > > quincy (stable)": 3 > > }, > > "mgr": { > > "ceph version 17.2.5 > > (98318ae89f1a893a6ded3a640405cdbb33e08757) > > quincy (stable)": 3 > > }, > > "osd": { > > "ceph version 17.2.5 > > (98318ae89f1a893a6ded3a640405cdbb33e08757) > > quincy (stable)": 5 > > }, > > "mds": { > > "ceph version 17.2.5 > > (98318ae89f1a893a6ded3a640405cdbb33e08757) > > quincy (stable)": 4 > > }, > > "overall": { > > "ceph version 17.2.5 > > (98318ae89f1a893a6ded3a640405cdbb33e08757) > > quincy (stable)": 15 > > } > > } > > > > > > Here's MDS status output of one MDS: > > [ceph: root@ceph04 /]# ceph tell mds.mds01.ceph05.pqxmvt status > > 2023-01-14T15:30:28.607+0000 7fb9e17fa700 0 client.60986454 > > ms_handle_reset on v2:192.168.23.65:6800/2680651694 > > <http://192.168.23.65:6800/2680651694> > > 2023-01-14T15:30:28.640+0000 7fb9e17fa700 0 client.60986460 > > ms_handle_reset on v2:192.168.23.65:6800/2680651694 > > <http://192.168.23.65:6800/2680651694> > > { > > "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4", > > "whoami": 0, > > "id": 60984167, > > "want_state": "up:replay", > > "state": "up:replay", > > "fs_name": "cephfs", > > "replay_status": { > > "journal_read_pos": 0, > > "journal_write_pos": 0, > > "journal_expire_pos": 0, > > "num_events": 0, > > "num_segments": 0 > > }, > > "rank_uptime": 1127.54018615, > > "mdsmap_epoch": 98056, > > "osdmap_epoch": 12362, > > "osdmap_epoch_barrier": 0, > > "uptime": 1127.957307273 > > } > > > > It's staying like that for days now. If there was a counter moving, I > > just would wait but it doesn't change anything and alle stats says, > the > > MDS aren't working at all. > > > > The symptom I have is that Dashboard and all other tools I use say, > it's > > more or less ok. (Some old messages about failed daemons and > scrubbing > > aside). But I can't mount anything. When I try to start a VM that's > on > > RDS I just get a timeout. And when I try to mount a CephFS, mount > just > > hangs forever. > > > > Whatever command I give MDS or journal, it just hangs. The only > thing I > > could do, was take all CephFS offline, kill the MDS's and do a "ceph > fs > > reset <fs name> --yes-i-really-mean-it". After that I rebooted all > > nodes, just to be sure but I still have no access to data. > > > > Could you please help me? I'm kinda desperate. If you need any more > > information, just let me know. > > > > Cheers, > > Thomas > > > > -- > > Thomas Widhalm > > Lead Systems Engineer > > > > NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | > > D-90429 Nuernberg > > Tel: +49 911 92885-0 | Fax: +49 911 92885-77 > > CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 > > https://www.netways.de <https://www.netways.de> | > > thomas.widhalm@xxxxxxxxxx <mailto:thomas.widhalm@xxxxxxxxxx> > > > > ** stackconf 2023 - September - https://stackconf.eu > > <https://stackconf.eu> ** > > ** OSMC 2023 - November - https://osmc.de <https://osmc.de> ** > > ** New at NWS: Managed Database - > > https://nws.netways.de/managed-database > > <https://nws.netways.de/managed-database> ** > > ** NETWAYS Web Services - https://nws.netways.de > > <https://nws.netways.de> ** > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > <mailto:ceph-users@xxxxxxx> > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > <mailto:ceph-users-leave@xxxxxxx> > > > > -- > Thomas Widhalm > Lead Systems Engineer > > NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 > Nuernberg > Tel: +49 911 92885-0 | Fax: +49 911 92885-77 > CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510 > https://www.netways.de | thomas.widhalm@xxxxxxxxxx > > ** stackconf 2023 - September - https://stackconf.eu ** > ** OSMC 2023 - November - https://osmc.de ** > ** New at NWS: Managed Database - https://nws.netways.de/managed-database > ** > ** NETWAYS Web Services - https://nws.netways.de ** > > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx