Re: [EXTERN] Re: Urgent help with degraded filesystem needed

Enrico Bocchi <enrico.bocchi@xxxxxxx> · Fri, 28 Jun 2024 13:08:35 +0200

Hi Dietmar,

I understand the option to be set is 'wsync', not 'nowsync'. See 
https://docs.ceph.com/en/latest/man/8/mount.ceph/
nowsync enables async dirops, which is what triggers the assertion in 
https://tracker.ceph.com/issues/61009

The reason why you don't see it in /proc/mounts is because it is the 
default in recent kernels (see 
https://github.com/gregkh/linux/commit/f7a67b463fb83a4b9b11ceaa8ec4950b8fb7f902)
If you set 'wsync' among your mount options, this will show up in 
/proc/mounts

Cheers,
Enrico

On 6/27/24 06:37, Dietmar Rieder wrote:
Can  anybody comment on my questions below? Thanks so much in advance....

Am 26. Juni 2024 08:08:39 MESZ schrieb Dietmar Rieder <dietmar.rieder@xxxxxxxxxxx>:
...sending also to the list and Xiubo (were accidentally removed from recipients)...

On 6/25/24 21:28, Dietmar Rieder wrote:
Hi Patrick,  Xiubo and List,

finally we managed to get the filesystem repaired and running again! YEAH, I'm so happy!!

Big thanks for your support Patrick and Xiubo! (Would love invite you for a beer)!

Please see some comments and (important?) questions below:

On 6/25/24 03:14, Patrick Donnelly wrote:
On Mon, Jun 24, 2024 at 5:22 PM Dietmar Rieder
<dietmar.rieder@xxxxxxxxxxx> wrote:
(resending this, the original message seems that it didn't make it through between all the SPAM recently sent to the list, my apologies if it doubles at some point)

Hi List,

we are still struggeling to get our cephfs back online again, this is an update to inform you what we did so far, and we kindly ask for any input on this to get an idea on how to proceed:

After resetting the journals Xiubo suggested (in a PM) to go on with the disaster recovery procedure:

cephfs-data-scan init skipped creating the inodes 0x0x1 and 0x0x100

[root@ceph01-b ~]# cephfs-data-scan init
Inode 0x0x1 already exists, skipping create.  Use --force-init to overwrite the existing object.
Inode 0x0x100 already exists, skipping create.  Use --force-init to overwrite the existing object.

We did not use --force-init and proceeded with scan_extents using a single worker, which was indeed very slow.

After ~24h we interupted the scan_extents and restarted it with 32 workers which went through in about 2h15min w/o any issue.

Then I started scan_inodes with 32 workers this was also finished after ~50min no output on stderr or stdout.

I went on with scan_links, which after ~45 minutes threw the following error:

# cephfs-data-scan scan_links
Error ((2) No such file or directory)
Not sure what this indicates necessarily. You can try to get more
debug information using:

[client]
    debug mds = 20
    debug ms = 1
    debug client = 20

in the local ceph.conf for the node running cephfs-data-scan.
I did that, and restarted the  "cephfs-data-scan scan_links" .

It didn't produce any additional debug output, however this time it just went through without error (~50 min)

We then reran "cephfs-data-scan cleanup" and it also finished without error after about 10h.

We then set the fs as repaired and all seems to work fin again:

[root@ceph01-b ~]# ceph mds repaired 0
repaired: restoring rank 1:0

[root@ceph01-b ~]# ceph -s
    cluster:
      id:     aae23c5c-a98b-11ee-b44d-00620b05cac4
      health: HEALTH_OK

    services:
      mon: 3 daemons, quorum cephmon-01,cephmon-03,cephmon-02 (age 6d)
      mgr: cephmon-01.dsxcho(active, since 6d), standbys: cephmon-02.nssigg, cephmon-03.rgefle
      mds: 1/1 daemons up, 5 standby
      osd: 336 osds: 336 up (since 2M), 336 in (since 4M)

    data:
      volumes: 1/1 healthy
      pools:   4 pools, 6401 pgs
      objects: 284.68M objects, 623 TiB
      usage:   890 TiB used, 3.1 PiB / 3.9 PiB avail
      pgs:     6206 active+clean
               140  active+clean+scrubbing
               55   active+clean+scrubbing+deep

    io:
      client:   3.9 MiB/s rd, 84 B/s wr, 482 op/s rd, 1.11k op/s wr

[root@ceph01-b ~]# ceph fs status
cephfs - 0 clients
======
RANK  STATE              MDS                ACTIVITY     DNS    INOS DIRS   CAPS
   0    active  default.cephmon-03.xcujhz  Reqs:    0 /s   124k  60.3k 1993      0
           POOL            TYPE     USED  AVAIL
ssd-rep-metadata-pool  metadata   298G  63.4T
    sdd-rep-data-pool      data    10.2T  84.5T
     hdd-ec-data-pool      data     808T  1929T
         STANDBY MDS
default.cephmon-01.cepqjp
default.cephmon-01.pvnqad
default.cephmon-02.duujba
default.cephmon-02.nyfook
default.cephmon-03.chjusj
MDS version: ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)

The msd log however shows some "bad backtrace on directory inode" messages:

2024-06-25T18:45:36.575+0000 7f8594659700  1 mds.default.cephmon-03.xcujhz Updating MDS map to version 8082 from mon.1
2024-06-25T18:45:36.575+0000 7f8594659700  1 mds.0.8082 handle_mds_map i am now mds.0.8082
2024-06-25T18:45:36.575+0000 7f8594659700  1 mds.0.8082 handle_mds_map state change up:standby --> up:replay
2024-06-25T18:45:36.575+0000 7f8594659700  1 mds.0.8082 replay_start
2024-06-25T18:45:36.575+0000 7f8594659700  1 mds.0.8082  waiting for osdmap 34331 (which blocklists prior instance)
2024-06-25T18:45:36.581+0000 7f858de4c700  0 mds.0.cache creating system inode with ino:0x100
2024-06-25T18:45:36.581+0000 7f858de4c700  0 mds.0.cache creating system inode with ino:0x1
2024-06-25T18:45:36.589+0000 7f858ce4a700  1 mds.0.journal EResetJournal
2024-06-25T18:45:36.589+0000 7f858ce4a700  1 mds.0.sessionmap wipe start
2024-06-25T18:45:36.589+0000 7f858ce4a700  1 mds.0.sessionmap wipe result
2024-06-25T18:45:36.589+0000 7f858ce4a700  1 mds.0.sessionmap wipe done
2024-06-25T18:45:36.589+0000 7f858ce4a700  1 mds.0.8082 Finished replaying journal
2024-06-25T18:45:36.589+0000 7f858ce4a700  1 mds.0.8082 making mds journal writeable
2024-06-25T18:45:37.578+0000 7f8594659700  1 mds.default.cephmon-03.xcujhz Updating MDS map to version 8083 from mon.1
2024-06-25T18:45:37.578+0000 7f8594659700  1 mds.0.8082 handle_mds_map i am now mds.0.8082
2024-06-25T18:45:37.578+0000 7f8594659700  1 mds.0.8082 handle_mds_map state change up:replay --> up:reconnect
2024-06-25T18:45:37.578+0000 7f8594659700  1 mds.0.8082 reconnect_start
2024-06-25T18:45:37.578+0000 7f8594659700  1 mds.0.8082 reopen_log
2024-06-25T18:45:37.578+0000 7f8594659700  1 mds.0.8082 reconnect_done
2024-06-25T18:45:38.579+0000 7f8594659700  1 mds.default.cephmon-03.xcujhz Updating MDS map to version 8084 from mon.1
2024-06-25T18:45:38.579+0000 7f8594659700  1 mds.0.8082 handle_mds_map i am now mds.0.8082
2024-06-25T18:45:38.579+0000 7f8594659700  1 mds.0.8082 handle_mds_map state change up:reconnect --> up:rejoin
2024-06-25T18:45:38.579+0000 7f8594659700  1 mds.0.8082 rejoin_start
2024-06-25T18:45:38.583+0000 7f8594659700  1 mds.0.8082 rejoin_joint_start
2024-06-25T18:45:38.592+0000 7f858e64d700 -1 log_channel(cluster) log [ERR] : bad backtrace on directory inode 0x10003e42340
2024-06-25T18:45:38.680+0000 7f858e64d700 -1 log_channel(cluster) log [ERR] : bad backtrace on directory inode 0x10003e45d8b
2024-06-25T18:45:38.754+0000 7f858e64d700 -1 log_channel(cluster) log [ERR] : bad backtrace on directory inode 0x10003e45d90
2024-06-25T18:45:38.754+0000 7f858e64d700 -1 log_channel(cluster) log [ERR] : bad backtrace on directory inode 0x10003e45d9f
2024-06-25T18:45:38.785+0000 7f858fe50700  1 mds.0.8082 rejoin_done
2024-06-25T18:45:39.582+0000 7f8594659700  1 mds.default.cephmon-03.xcujhz Updating MDS map to version 8085 from mon.1
2024-06-25T18:45:39.582+0000 7f8594659700  1 mds.0.8082 handle_mds_map i am now mds.0.8082
2024-06-25T18:45:39.582+0000 7f8594659700  1 mds.0.8082 handle_mds_map state change up:rejoin --> up:active
2024-06-25T18:45:39.582+0000 7f8594659700  1 mds.0.8082 recovery_done -- successful recovery!
2024-06-25T18:45:39.584+0000 7f8594659700  1 mds.0.8082 active_start
2024-06-25T18:45:39.585+0000 7f8594659700  1 mds.0.8082 cluster recovered.
2024-06-25T18:45:42.409+0000 7f8591e54700 -1 mds.pinger is_rank_lagging: rank=0 was never sent ping request.
2024-06-25T18:57:28.213+0000 7f858e64d700 -1 log_channel(cluster) log [ERR] : bad backtrace on directory inode 0x4

Is there anything that we can do about this, to get rid of the "bad backtrace on directory inode"?

Sone more question:

1.
As Xiubo suggested, we now tried to mount the filesystem with the "nowsysnc" option <https://tracker.ceph.com/issues/61009#note-26>:

[root@ceph01-b ~]# mount -t ceph cephfs_user@.cephfs=/ /mnt/cephfs -o secretfile=/etc/ceph/ceph.client.cephfs_user.secret,nowsync

however the option seems not to show up in /proc/mounts

[root@ceph01-b ~]# grep ceph /proc/mounts
cephfs_user@aae23c5c-a98b-11ee-b44d-00620b05cac4.cephfs=/ /mnt/cephfs ceph rw,relatime,name=cephfs_user,secret=<hidden>,ms_mode=prefer-crc,acl,mon_addr=10.1.3.21:3300/10.1.3.22:3300/10.1.3.23:3300 0 0

The kernel version is 5.14.0 (from Rocky 9.3)

[root@ceph01-b ~]# uname -a
Linux ceph01-b 5.14.0-362.24.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Mar 13 17:33:16 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux

Is this expected? How can we make sure that the filesystem uses 'nowsync', so that we do not hit the bug <https://tracker.ceph.com/issues/61009> again?

Oh, I think I misunderstood the suggested workaround. I guess we need to disable "nowsync", which is set by default, right?

so: -o wsync

should be the workaround, right?

2.
There are two empty files in lost+found now. Is ist save to remove them?

[root@ceph01-b lost+found]# ls -la
total 0
drwxr-xr-x 2 root root 1 Jan  1  1970 .
drwxr-xr-x 4 root root 2 Mar 13 21:22 ..
-r-x------ 1 root root 0 Jun 20 23:50 100037a50e2
-r-x------ 1 root root 0 Jun 20 19:05 200049612e5

3.
Are there any specific steps that we should perform now (scrub or similar things) before we put the filesystem into production again?

Dietmar
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Enrico Bocchi
CERN European Laboratory for Particle Physics
IT - Storage & Data Management  - General Storage Services
Mailbox: G20500 - Office: 31-2-010
1211 Genève 23
Switzerland
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx