MDS stuck in "up:replay"

Thomas Widhalm <thomas.widhalm@xxxxxxxxxx> · Sat, 14 Jan 2023 16:37:20 +0100

Hi,

I'm really lost with my Ceph system. I built a small cluster for home
usage which has two uses for me: I want to replace an old NAS and I want
to learn about Ceph so that I have hands-on experience. We're using it
in our company but I need some real-life experience without risking any
company or customers data. That's my preferred way of learning.

The cluster consists of 3 Raspberry Pis plus a few VMs running on
Proxmox. I'm not using Proxmox' built in Ceph because I want to focus on
Ceph and not just use it as a preconfigured tool.

All hosts are running Fedora (x86_64 and arm64) and during an Upgrade
from F36 to F37 my cluster suddenly showed all PGs as unavailable. I
worked nearly a week to get it back online and I learned a lot about
Ceph management and recovery. The cluster is back but I still can't
access my data. Maybe you can help me?

Here are my versions:

[ceph: root@ceph04 /]# ceph versions
{
    "mon": {
        "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
quincy (stable)": 3
    },
    "mgr": {
        "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
quincy (stable)": 3
    },
    "osd": {
        "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
quincy (stable)": 5
    },
    "mds": {
        "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
quincy (stable)": 4
    },
    "overall": {
        "ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757)
quincy (stable)": 15
    }
}

Here's MDS status output of one MDS:
[ceph: root@ceph04 /]# ceph tell mds.mds01.ceph05.pqxmvt status
2023-01-14T15:30:28.607+0000 7fb9e17fa700  0 client.60986454
ms_handle_reset on v2:192.168.23.65:6800/2680651694
2023-01-14T15:30:28.640+0000 7fb9e17fa700  0 client.60986460
ms_handle_reset on v2:192.168.23.65:6800/2680651694
{
    "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4",
    "whoami": 0,
    "id": 60984167,
    "want_state": "up:replay",
    "state": "up:replay",
    "fs_name": "cephfs",
    "replay_status": {
        "journal_read_pos": 0,
        "journal_write_pos": 0,
        "journal_expire_pos": 0,
        "num_events": 0,
        "num_segments": 0
    },
    "rank_uptime": 1127.54018615,
    "mdsmap_epoch": 98056,
    "osdmap_epoch": 12362,
    "osdmap_epoch_barrier": 0,
    "uptime": 1127.957307273
}

It's staying like that for days now. If there was a counter moving, I
just would wait but it doesn't change anything and alle stats says, the
MDS aren't working at all.

The symptom I have is that Dashboard and all other tools I use say, it's
more or less ok. (Some old messages about failed daemons and scrubbing
aside). But I can't mount anything. When I try to start a VM that's on
RDS I just get a timeout. And when I try to mount a CephFS, mount just
hangs forever.

Whatever command I give MDS or journal, it just hangs. The only thing I
could do, was take all CephFS offline, kill the MDS's and do a "ceph fs
reset <fs name> --yes-i-really-mean-it". After that I rebooted all
nodes, just to be sure but I still have no access to data.

Could you please help me? I'm kinda desperate. If you need any more
information, just let me know.

Cheers,
Thomas

-- 
Thomas Widhalm
Lead Systems Engineer

NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429 Nuernberg
Tel: +49 911 92885-0 | Fax: +49 911 92885-77
CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510
https://www.netways.de | thomas.widhalm@xxxxxxxxxx

** stackconf 2023 - September - https://stackconf.eu **
** OSMC 2023 - November - https://osmc.de **
** New at NWS: Managed Database - https://nws.netways.de/managed-database **
** NETWAYS Web Services - https://nws.netways.de **
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx