Can anybody comment on my questions below? Thanks so much in advance.... Am 26. Juni 2024 08:08:39 MESZ schrieb Dietmar Rieder <dietmar.rieder@xxxxxxxxxxx>: >...sending also to the list and Xiubo (were accidentally removed from recipients)... > >On 6/25/24 21:28, Dietmar Rieder wrote: >> Hi Patrick, Xiubo and List, >> >> finally we managed to get the filesystem repaired and running again! YEAH, I'm so happy!! >> >> Big thanks for your support Patrick and Xiubo! (Would love invite you for a beer)! >> >> >> Please see some comments and (important?) questions below: >> >> On 6/25/24 03:14, Patrick Donnelly wrote: >>> On Mon, Jun 24, 2024 at 5:22 PM Dietmar Rieder >>> <dietmar.rieder@xxxxxxxxxxx> wrote: >>>> >>>> (resending this, the original message seems that it didn't make it through between all the SPAM recently sent to the list, my apologies if it doubles at some point) >>>> >>>> Hi List, >>>> >>>> we are still struggeling to get our cephfs back online again, this is an update to inform you what we did so far, and we kindly ask for any input on this to get an idea on how to proceed: >>>> >>>> After resetting the journals Xiubo suggested (in a PM) to go on with the disaster recovery procedure: >>>> >>>> cephfs-data-scan init skipped creating the inodes 0x0x1 and 0x0x100 >>>> >>>> [root@ceph01-b ~]# cephfs-data-scan init >>>> Inode 0x0x1 already exists, skipping create. Use --force-init to overwrite the existing object. >>>> Inode 0x0x100 already exists, skipping create. Use --force-init to overwrite the existing object. >>>> >>>> We did not use --force-init and proceeded with scan_extents using a single worker, which was indeed very slow. >>>> >>>> After ~24h we interupted the scan_extents and restarted it with 32 workers which went through in about 2h15min w/o any issue. >>>> >>>> Then I started scan_inodes with 32 workers this was also finished after ~50min no output on stderr or stdout. >>>> >>>> I went on with scan_links, which after ~45 minutes threw the following error: >>>> >>>> # cephfs-data-scan scan_links >>>> Error ((2) No such file or directory) >>> >>> Not sure what this indicates necessarily. You can try to get more >>> debug information using: >>> >>> [client] >>> debug mds = 20 >>> debug ms = 1 >>> debug client = 20 >>> >>> in the local ceph.conf for the node running cephfs-data-scan. >> >> I did that, and restarted the "cephfs-data-scan scan_links" . >> >> It didn't produce any additional debug output, however this time it just went through without error (~50 min) >> >> We then reran "cephfs-data-scan cleanup" and it also finished without error after about 10h. >> >> We then set the fs as repaired and all seems to work fin again: >> >> [root@ceph01-b ~]# ceph mds repaired 0 >> repaired: restoring rank 1:0 >> >> [root@ceph01-b ~]# ceph -s >> cluster: >> id: aae23c5c-a98b-11ee-b44d-00620b05cac4 >> health: HEALTH_OK >> >> services: >> mon: 3 daemons, quorum cephmon-01,cephmon-03,cephmon-02 (age 6d) >> mgr: cephmon-01.dsxcho(active, since 6d), standbys: cephmon-02.nssigg, cephmon-03.rgefle >> mds: 1/1 daemons up, 5 standby >> osd: 336 osds: 336 up (since 2M), 336 in (since 4M) >> >> data: >> volumes: 1/1 healthy >> pools: 4 pools, 6401 pgs >> objects: 284.68M objects, 623 TiB >> usage: 890 TiB used, 3.1 PiB / 3.9 PiB avail >> pgs: 6206 active+clean >> 140 active+clean+scrubbing >> 55 active+clean+scrubbing+deep >> >> io: >> client: 3.9 MiB/s rd, 84 B/s wr, 482 op/s rd, 1.11k op/s wr >> >> >> [root@ceph01-b ~]# ceph fs status >> cephfs - 0 clients >> ====== >> RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS >> 0 active default.cephmon-03.xcujhz Reqs: 0 /s 124k 60.3k 1993 0 >> POOL TYPE USED AVAIL >> ssd-rep-metadata-pool metadata 298G 63.4T >> sdd-rep-data-pool data 10.2T 84.5T >> hdd-ec-data-pool data 808T 1929T >> STANDBY MDS >> default.cephmon-01.cepqjp >> default.cephmon-01.pvnqad >> default.cephmon-02.duujba >> default.cephmon-02.nyfook >> default.cephmon-03.chjusj >> MDS version: ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable) >> >> >> The msd log however shows some "bad backtrace on directory inode" messages: >> >> 2024-06-25T18:45:36.575+0000 7f8594659700 1 mds.default.cephmon-03.xcujhz Updating MDS map to version 8082 from mon.1 >> 2024-06-25T18:45:36.575+0000 7f8594659700 1 mds.0.8082 handle_mds_map i am now mds.0.8082 >> 2024-06-25T18:45:36.575+0000 7f8594659700 1 mds.0.8082 handle_mds_map state change up:standby --> up:replay >> 2024-06-25T18:45:36.575+0000 7f8594659700 1 mds.0.8082 replay_start >> 2024-06-25T18:45:36.575+0000 7f8594659700 1 mds.0.8082 waiting for osdmap 34331 (which blocklists prior instance) >> 2024-06-25T18:45:36.581+0000 7f858de4c700 0 mds.0.cache creating system inode with ino:0x100 >> 2024-06-25T18:45:36.581+0000 7f858de4c700 0 mds.0.cache creating system inode with ino:0x1 >> 2024-06-25T18:45:36.589+0000 7f858ce4a700 1 mds.0.journal EResetJournal >> 2024-06-25T18:45:36.589+0000 7f858ce4a700 1 mds.0.sessionmap wipe start >> 2024-06-25T18:45:36.589+0000 7f858ce4a700 1 mds.0.sessionmap wipe result >> 2024-06-25T18:45:36.589+0000 7f858ce4a700 1 mds.0.sessionmap wipe done >> 2024-06-25T18:45:36.589+0000 7f858ce4a700 1 mds.0.8082 Finished replaying journal >> 2024-06-25T18:45:36.589+0000 7f858ce4a700 1 mds.0.8082 making mds journal writeable >> 2024-06-25T18:45:37.578+0000 7f8594659700 1 mds.default.cephmon-03.xcujhz Updating MDS map to version 8083 from mon.1 >> 2024-06-25T18:45:37.578+0000 7f8594659700 1 mds.0.8082 handle_mds_map i am now mds.0.8082 >> 2024-06-25T18:45:37.578+0000 7f8594659700 1 mds.0.8082 handle_mds_map state change up:replay --> up:reconnect >> 2024-06-25T18:45:37.578+0000 7f8594659700 1 mds.0.8082 reconnect_start >> 2024-06-25T18:45:37.578+0000 7f8594659700 1 mds.0.8082 reopen_log >> 2024-06-25T18:45:37.578+0000 7f8594659700 1 mds.0.8082 reconnect_done >> 2024-06-25T18:45:38.579+0000 7f8594659700 1 mds.default.cephmon-03.xcujhz Updating MDS map to version 8084 from mon.1 >> 2024-06-25T18:45:38.579+0000 7f8594659700 1 mds.0.8082 handle_mds_map i am now mds.0.8082 >> 2024-06-25T18:45:38.579+0000 7f8594659700 1 mds.0.8082 handle_mds_map state change up:reconnect --> up:rejoin >> 2024-06-25T18:45:38.579+0000 7f8594659700 1 mds.0.8082 rejoin_start >> 2024-06-25T18:45:38.583+0000 7f8594659700 1 mds.0.8082 rejoin_joint_start >> 2024-06-25T18:45:38.592+0000 7f858e64d700 -1 log_channel(cluster) log [ERR] : bad backtrace on directory inode 0x10003e42340 >> 2024-06-25T18:45:38.680+0000 7f858e64d700 -1 log_channel(cluster) log [ERR] : bad backtrace on directory inode 0x10003e45d8b >> 2024-06-25T18:45:38.754+0000 7f858e64d700 -1 log_channel(cluster) log [ERR] : bad backtrace on directory inode 0x10003e45d90 >> 2024-06-25T18:45:38.754+0000 7f858e64d700 -1 log_channel(cluster) log [ERR] : bad backtrace on directory inode 0x10003e45d9f >> 2024-06-25T18:45:38.785+0000 7f858fe50700 1 mds.0.8082 rejoin_done >> 2024-06-25T18:45:39.582+0000 7f8594659700 1 mds.default.cephmon-03.xcujhz Updating MDS map to version 8085 from mon.1 >> 2024-06-25T18:45:39.582+0000 7f8594659700 1 mds.0.8082 handle_mds_map i am now mds.0.8082 >> 2024-06-25T18:45:39.582+0000 7f8594659700 1 mds.0.8082 handle_mds_map state change up:rejoin --> up:active >> 2024-06-25T18:45:39.582+0000 7f8594659700 1 mds.0.8082 recovery_done -- successful recovery! >> 2024-06-25T18:45:39.584+0000 7f8594659700 1 mds.0.8082 active_start >> 2024-06-25T18:45:39.585+0000 7f8594659700 1 mds.0.8082 cluster recovered. >> 2024-06-25T18:45:42.409+0000 7f8591e54700 -1 mds.pinger is_rank_lagging: rank=0 was never sent ping request. >> 2024-06-25T18:57:28.213+0000 7f858e64d700 -1 log_channel(cluster) log [ERR] : bad backtrace on directory inode 0x4 >> >> >> Is there anything that we can do about this, to get rid of the "bad backtrace on directory inode"? >> >> >> Sone more question: >> >> 1. >> As Xiubo suggested, we now tried to mount the filesystem with the "nowsysnc" option <https://tracker.ceph.com/issues/61009#note-26>: >> >> [root@ceph01-b ~]# mount -t ceph cephfs_user@.cephfs=/ /mnt/cephfs -o secretfile=/etc/ceph/ceph.client.cephfs_user.secret,nowsync >> >> however the option seems not to show up in /proc/mounts >> >> [root@ceph01-b ~]# grep ceph /proc/mounts >> cephfs_user@aae23c5c-a98b-11ee-b44d-00620b05cac4.cephfs=/ /mnt/cephfs ceph rw,relatime,name=cephfs_user,secret=<hidden>,ms_mode=prefer-crc,acl,mon_addr=10.1.3.21:3300/10.1.3.22:3300/10.1.3.23:3300 0 0 >> >> The kernel version is 5.14.0 (from Rocky 9.3) >> >> [root@ceph01-b ~]# uname -a >> Linux ceph01-b 5.14.0-362.24.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Mar 13 17:33:16 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux >> >> Is this expected? How can we make sure that the filesystem uses 'nowsync', so that we do not hit the bug <https://tracker.ceph.com/issues/61009> again? >> > >Oh, I think I misunderstood the suggested workaround. I guess we need to disable "nowsync", which is set by default, right? > >so: -o wsync > >should be the workaround, right? > >> 2. >> There are two empty files in lost+found now. Is ist save to remove them? >> >> [root@ceph01-b lost+found]# ls -la >> total 0 >> drwxr-xr-x 2 root root 1 Jan 1 1970 . >> drwxr-xr-x 4 root root 2 Mar 13 21:22 .. >> -r-x------ 1 root root 0 Jun 20 23:50 100037a50e2 >> -r-x------ 1 root root 0 Jun 20 19:05 200049612e5 >> >> 3. >> Are there any specific steps that we should perform now (scrub or similar things) before we put the filesystem into production again? >> > > >Dietmar _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx