Re: MDS stuck in "up:replay"

Kotresh Hiremath Ravishankar <khiremat@xxxxxxxxxx> · Mon, 16 Jan 2023 10:31:51 +0530

Hi Thomas,

As the documentation says, the MDS enters up:resolve from up:replay if the
Ceph file system has multiple ranks (including this one), i.e. it’s not a
single active MDS cluster.
The MDS is resolving any uncommitted inter-MDS operations. All ranks in the
file system must be in this state or later for progress to be made, i.e. no
rank can be failed/damaged or up:replay.

So please check the status of the other active mds if it's failed.

Also please share the mds logs and the output of 'ceph fs dump' and 'ceph
fs status'

Thanks,
Kotresh H R

On Sat, Jan 14, 2023 at 9:07 PM Thomas Widhalm <thomas.widhalm@xxxxxxxxxx>
wrote:

> Hi,
>
> I'm really lost with my Ceph system. I built a small cluster for home
> usage which has two uses for me: I want to replace an old NAS and I want
> to learn about Ceph so that I have hands-on experience. We're using it
> in our company but I need some real-life experience without risking any
> company or customers data. That's my preferred way of learning.
>
> The cluster consists of 3 Raspberry Pis plus a few VMs running on
> Proxmox. I'm not using Proxmox' built in Ceph because I want to focus on
> Ceph and not just use it as a preconfigured tool.
>
> All hosts are running Fedora (x86_64 and arm64) and during an Upgrade
> from F36 to F37 my cluster suddenly showed all PGs as unavailable. I
> worked nearly a week to get it back online and I learned a lot about
> Ceph management and recovery. The cluster is back but I still can't
> access my data. Maybe you can help me?
>
> Here are my versions:
>
> [ceph: root@ceph04 /]# ceph versions
> {
>      "mon": {
>          "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
> quincy (stable)": 3
>      },
>      "mgr": {
>          "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
> quincy (stable)": 3
>      },
>      "osd": {
>          "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
> quincy (stable)": 5
>      },
>      "mds": {
>          "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
> quincy (stable)": 4
>      },
>      "overall": {
>          "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
> quincy (stable)": 15
>      }
> }
>
>
> Here's MDS status output of one MDS:
> [ceph: root@ceph04 /]# ceph tell mds.mds01.ceph05.pqxmvt status
> 2023-01-14T15:30:28.607+0000 7fb9e17fa700  0 client.60986454
> ms_handle_reset on v2:192.168.23.65:6800/2680651694
> 2023-01-14T15:30:28.640+0000 7fb9e17fa700  0 client.60986460
> ms_handle_reset on v2:192.168.23.65:6800/2680651694
> {
>      "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4",
>      "whoami": 0,
>      "id": 60984167,
>      "want_state": "up:replay",
>      "state": "up:replay",
>      "fs_name": "cephfs",
>      "replay_status": {
>          "journal_read_pos": 0,
>          "journal_write_pos": 0,
>          "journal_expire_pos": 0,
>          "num_events": 0,
>          "num_segments": 0
>      },
>      "rank_uptime": 1127.54018615,
>      "mdsmap_epoch": 98056,
>      "osdmap_epoch": 12362,
>      "osdmap_epoch_barrier": 0,
>      "uptime": 1127.957307273
> }
>
> It's staying like that for days now. If there was a counter moving, I
> just would wait but it doesn't change anything and alle stats says, the
> MDS aren't working at all.
>
> The symptom I have is that Dashboard and all other tools I use say, it's
> more or less ok. (Some old messages about failed daemons and scrubbing
> aside). But I can't mount anything. When I try to start a VM that's on
> RDS I just get a timeout. And when I try to mount a CephFS, mount just
> hangs forever.
>
> Whatever command I give MDS or journal, it just hangs. The only thing I
> could do, was take all CephFS offline, kill the MDS's and do a "ceph fs
> reset <fs name> --yes-i-really-mean-it". After that I rebooted all
> nodes, just to be sure but I still have no access to data.
>
> Could you please help me? I'm kinda desperate. If you need any more
> information, just let me know.
>
> Cheers,
> Thomas
>
> --
> Thomas Widhalm
> Lead Systems Engineer
>
> NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429
> Nuernberg
> Tel: +49 911 92885-0 | Fax: +49 911 92885-77
> CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510
> https://www.netways.de | thomas.widhalm@xxxxxxxxxx
>
> ** stackconf 2023 - September - https://stackconf.eu **
> ** OSMC 2023 - November - https://osmc.de **
> ** New at NWS: Managed Database - https://nws.netways.de/managed-database
> **
> ** NETWAYS Web Services - https://nws.netways.de **
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx