Hi Patrick, Xiubo and List,
finally we managed to get the filesystem repaired and running again! YEAH, I'm so happy!!
Big thanks for your support Patrick and Xiubo! (Would love invite you for a beer)!
Please see some comments and (important?) questions below:
On 6/25/24 03:14, Patrick Donnelly wrote:
On Mon, Jun 24, 2024 at 5:22 PM Dietmar Rieder
<dietmar.rieder@xxxxxxxxxxx> wrote:
(resending this, the original message seems that it didn't make it through between all the SPAM recently sent to the list, my apologies if it doubles at some point)
Hi List,
we are still struggeling to get our cephfs back online again, this is an update to inform you what we did so far, and we kindly ask for any input on this to get an idea on how to proceed:
After resetting the journals Xiubo suggested (in a PM) to go on with the disaster recovery procedure:
cephfs-data-scan init skipped creating the inodes 0x0x1 and 0x0x100
[root@ceph01-b ~]# cephfs-data-scan init
Inode 0x0x1 already exists, skipping create. Use --force-init to overwrite the existing object.
Inode 0x0x100 already exists, skipping create. Use --force-init to overwrite the existing object.
We did not use --force-init and proceeded with scan_extents using a single worker, which was indeed very slow.
After ~24h we interupted the scan_extents and restarted it with 32 workers which went through in about 2h15min w/o any issue.
Then I started scan_inodes with 32 workers this was also finished after ~50min no output on stderr or stdout.
I went on with scan_links, which after ~45 minutes threw the following error:
# cephfs-data-scan scan_links
Error ((2) No such file or directory)
Not sure what this indicates necessarily. You can try to get more
debug information using:
[client]
debug mds = 20
debug ms = 1
debug client = 20
in the local ceph.conf for the node running cephfs-data-scan.
I did that, and restarted the "cephfs-data-scan scan_links" .
It didn't produce any additional debug output, however this time it just went through without error (~50 min)
We then reran "cephfs-data-scan cleanup" and it also finished without error after about 10h.
We then set the fs as repaired and all seems to work fin again:
[root@ceph01-b ~]# ceph mds repaired 0
repaired: restoring rank 1:0
[root@ceph01-b ~]# ceph -s
cluster:
id: aae23c5c-a98b-11ee-b44d-00620b05cac4
health: HEALTH_OK
services:
mon: 3 daemons, quorum cephmon-01,cephmon-03,cephmon-02 (age 6d)
mgr: cephmon-01.dsxcho(active, since 6d), standbys: cephmon-02.nssigg, cephmon-03.rgefle
mds: 1/1 daemons up, 5 standby
osd: 336 osds: 336 up (since 2M), 336 in (since 4M)
data:
volumes: 1/1 healthy
pools: 4 pools, 6401 pgs
objects: 284.68M objects, 623 TiB
usage: 890 TiB used, 3.1 PiB / 3.9 PiB avail
pgs: 6206 active+clean
140 active+clean+scrubbing
55 active+clean+scrubbing+deep
io:
client: 3.9 MiB/s rd, 84 B/s wr, 482 op/s rd, 1.11k op/s wr
[root@ceph01-b ~]# ceph fs status
cephfs - 0 clients
======
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 active default.cephmon-03.xcujhz Reqs: 0 /s 124k 60.3k 1993 0
POOL TYPE USED AVAIL
ssd-rep-metadata-pool metadata 298G 63.4T
sdd-rep-data-pool data 10.2T 84.5T
hdd-ec-data-pool data 808T 1929T
STANDBY MDS
default.cephmon-01.cepqjp
default.cephmon-01.pvnqad
default.cephmon-02.duujba
default.cephmon-02.nyfook
default.cephmon-03.chjusj
MDS version: ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)
The msd log however shows some "bad backtrace on directory inode" messages:
2024-06-25T18:45:36.575+0000 7f8594659700 1 mds.default.cephmon-03.xcujhz Updating MDS map to version 8082 from mon.1
2024-06-25T18:45:36.575+0000 7f8594659700 1 mds.0.8082 handle_mds_map i am now mds.0.8082
2024-06-25T18:45:36.575+0000 7f8594659700 1 mds.0.8082 handle_mds_map state change up:standby --> up:replay
2024-06-25T18:45:36.575+0000 7f8594659700 1 mds.0.8082 replay_start
2024-06-25T18:45:36.575+0000 7f8594659700 1 mds.0.8082 waiting for osdmap 34331 (which blocklists prior instance)
2024-06-25T18:45:36.581+0000 7f858de4c700 0 mds.0.cache creating system inode with ino:0x100
2024-06-25T18:45:36.581+0000 7f858de4c700 0 mds.0.cache creating system inode with ino:0x1
2024-06-25T18:45:36.589+0000 7f858ce4a700 1 mds.0.journal EResetJournal
2024-06-25T18:45:36.589+0000 7f858ce4a700 1 mds.0.sessionmap wipe start
2024-06-25T18:45:36.589+0000 7f858ce4a700 1 mds.0.sessionmap wipe result
2024-06-25T18:45:36.589+0000 7f858ce4a700 1 mds.0.sessionmap wipe done
2024-06-25T18:45:36.589+0000 7f858ce4a700 1 mds.0.8082 Finished replaying journal
2024-06-25T18:45:36.589+0000 7f858ce4a700 1 mds.0.8082 making mds journal writeable
2024-06-25T18:45:37.578+0000 7f8594659700 1 mds.default.cephmon-03.xcujhz Updating MDS map to version 8083 from mon.1
2024-06-25T18:45:37.578+0000 7f8594659700 1 mds.0.8082 handle_mds_map i am now mds.0.8082
2024-06-25T18:45:37.578+0000 7f8594659700 1 mds.0.8082 handle_mds_map state change up:replay --> up:reconnect
2024-06-25T18:45:37.578+0000 7f8594659700 1 mds.0.8082 reconnect_start
2024-06-25T18:45:37.578+0000 7f8594659700 1 mds.0.8082 reopen_log
2024-06-25T18:45:37.578+0000 7f8594659700 1 mds.0.8082 reconnect_done
2024-06-25T18:45:38.579+0000 7f8594659700 1 mds.default.cephmon-03.xcujhz Updating MDS map to version 8084 from mon.1
2024-06-25T18:45:38.579+0000 7f8594659700 1 mds.0.8082 handle_mds_map i am now mds.0.8082
2024-06-25T18:45:38.579+0000 7f8594659700 1 mds.0.8082 handle_mds_map state change up:reconnect --> up:rejoin
2024-06-25T18:45:38.579+0000 7f8594659700 1 mds.0.8082 rejoin_start
2024-06-25T18:45:38.583+0000 7f8594659700 1 mds.0.8082 rejoin_joint_start
2024-06-25T18:45:38.592+0000 7f858e64d700 -1 log_channel(cluster) log [ERR] : bad backtrace on directory inode 0x10003e42340
2024-06-25T18:45:38.680+0000 7f858e64d700 -1 log_channel(cluster) log [ERR] : bad backtrace on directory inode 0x10003e45d8b
2024-06-25T18:45:38.754+0000 7f858e64d700 -1 log_channel(cluster) log [ERR] : bad backtrace on directory inode 0x10003e45d90
2024-06-25T18:45:38.754+0000 7f858e64d700 -1 log_channel(cluster) log [ERR] : bad backtrace on directory inode 0x10003e45d9f
2024-06-25T18:45:38.785+0000 7f858fe50700 1 mds.0.8082 rejoin_done
2024-06-25T18:45:39.582+0000 7f8594659700 1 mds.default.cephmon-03.xcujhz Updating MDS map to version 8085 from mon.1
2024-06-25T18:45:39.582+0000 7f8594659700 1 mds.0.8082 handle_mds_map i am now mds.0.8082
2024-06-25T18:45:39.582+0000 7f8594659700 1 mds.0.8082 handle_mds_map state change up:rejoin --> up:active
2024-06-25T18:45:39.582+0000 7f8594659700 1 mds.0.8082 recovery_done -- successful recovery!
2024-06-25T18:45:39.584+0000 7f8594659700 1 mds.0.8082 active_start
2024-06-25T18:45:39.585+0000 7f8594659700 1 mds.0.8082 cluster recovered.
2024-06-25T18:45:42.409+0000 7f8591e54700 -1 mds.pinger is_rank_lagging: rank=0 was never sent ping request.
2024-06-25T18:57:28.213+0000 7f858e64d700 -1 log_channel(cluster) log [ERR] : bad backtrace on directory inode 0x4
Is there anything that we can do about this, to get rid of the "bad backtrace on directory inode"?
Sone more question:
1.
As Xiubo suggested, we now tried to mount the filesystem with the "nowsysnc" option <https://tracker.ceph.com/issues/61009#note-26>:
[root@ceph01-b ~]# mount -t ceph cephfs_user@.cephfs=/ /mnt/cephfs -o secretfile=/etc/ceph/ceph.client.cephfs_user.secret,nowsync
however the option seems not to show up in /proc/mounts
[root@ceph01-b ~]# grep ceph /proc/mounts
cephfs_user@aae23c5c-a98b-11ee-b44d-00620b05cac4.cephfs=/ /mnt/cephfs ceph rw,relatime,name=cephfs_user,secret=<hidden>,ms_mode=prefer-crc,acl,mon_addr=10.1.3.21:3300/10.1.3.22:3300/10.1.3.23:3300 0 0
The kernel version is 5.14.0 (from Rocky 9.3)
[root@ceph01-b ~]# uname -a
Linux ceph01-b 5.14.0-362.24.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Mar 13 17:33:16 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
Is this expected? How can we make sure that the filesystem uses 'nowsync', so that we do not hit the bug <https://tracker.ceph.com/issues/61009> again?