Hi all,finally we were able to repair the filesystem and it seems that we did not lose any data. Thanks for all suggestions and comments.
Here is a short summary of our journey: 1. At some point all our 6 MDS were going to error state one after another 2. We tried to restart them but they kept crashing3. We learned that unfortunately we hit a known bug: <https://tracker.ceph.com/issues/61009>
4. We set the filesystem down "ceph fs set cephfs down true" and unmounted it from all clients.
5. We started with the disaster recovery procedure: <https://docs.ceph.com/en/reef/cephfs/disaster-recovery-experts/> I. Backup the journal cephfs-journal-tool --rank=cephfs:all journal export /mnt/backup/backup.bin II. DENTRY recovery from journal (We have 3 active MDS) cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary cephfs-journal-tool --rank=cephfs:1 event recover_dentries summary cephfs-journal-tool --rank=cephfs:2 event recover_dentries summary cephfs-journal-tool --rank=cephfs:all journal inspect Overall journal integrity: OK Overall journal integrity: DAMAGED Corrupt regions: 0xd9a84f243c-ffffffffffffffff Overall journal integrity: OK The journal from rank 1 still shows damage III. Journal truncation cephfs-journal-tool --rank=cephfs:0 journal reset cephfs-journal-tool --rank=cephfs:1 journal reset cephfs-journal-tool --rank=cephfs:2 journal reset IV. MDS table wipes cephfs-table-tool all reset session cephfs-journal-tool --rank=cephfs:1 journal inspect Overall journal integrity: OK V. MDS MAP reset ceph fs reset cephfs --yes-i-really-mean-itAfter these steps to reset and trim the journal we tried to restart the MDS, however they were still dying shortly after starting.
So as Xiubo suggested we went on with the disaster recovery procedure... VI. Recovery from missing metadata objects cephfs-table-tool 0 reset session cephfs-table-tool 0 reset snap cephfs-table-tool 0 reset inode cephfs-journal-tool --rank=cephfs:0 journal reset cephfs-data-scan init The "cephfs-data-scan init" gave us warnings about already existing inodes:Inode 0x0x1 already exists, skipping create. Use --force-init to overwrite the existing object. Inode 0x0x100 already exists, skipping create. Use --force-init to overwrite the existing object.
We decided not to use --force-init and went on with cephfs-data-scan scan_extents sdd-rep-data-pool hdd-ec-data-poolThe docs say it can take a "very long time", unfortunately the tool is not producing any ETA. After ~24hrs we interupted the process and restarted it with 32 workers.
The parallel scan_extents took about 2h and 15 min and did not generate any output on stdout or stderr
So we went on with parallel (32 workers) scan_inodes which also completed without any output after ~ 50 min.
We then ran "cephfs-data-scan scan_links", however the tool was stopping afer ~ 45 min. with an error message: Error ((2) No such file or directory)
We tried to go on anyway with "cephfs-data-scan cleanup". The cleanup was running for about 9h and 20min and did not produce any output.
So we tried to startup the MDS again, however the still kept crashing: 2024-06-23T08:21:50.197+0000 7feeb5177700 1 mds.0.8075 rejoin_start 2024-06-23T08:21:50.201+0000 7feeb5177700 1 mds.0.8075 rejoin_joint_start2024-06-23T08:21:50.204+0000 7feeaf16b700 1 mds.0.cache.den(0x10000000000 groups) loaded already corrupt dentry: [dentry #0x1/data/groups [bf,head] rep@0.0 NULL (dversion lock) pv=0 v=7910497 ino=(nil) state=0 0x55aa27a19180]
[....]2024-06-23T08:21:50.228+0000 7feeaf16b700 -1 log_channel(cluster) log [ERR] : bad backtrace on directory inode 0x10003e42340
[...]2024-06-23T08:21:50.345+0000 7feeaf16b700 -1 log_channel(cluster) log [ERR] : bad backtrace on directory inode 0x10003e45d8b
[....]-6> 2024-06-23T08:21:50.351+0000 7feeaf16b700 10 log_client will send 2024-06-23T08:21:50.229835+0000 mds.default.cephmon-03.xcujhz (mds.0) 1 : cluster [ERR] bad backtrace on direc
tory inode 0x10003e42340-5> 2024-06-23T08:21:50.351+0000 7feeaf16b700 10 log_client will send 2024-06-23T08:21:50.347085+0000 mds.default.cephmon-03.xcujhz (mds.0) 2 : cluster [ERR] bad backtrace on directory inode 0x10003e45d8b -4> 2024-06-23T08:21:50.351+0000 7feeaf16b700 10 monclient: _send_mon_message to mon.cephmon-03 at v2:10.1.3.23:3300/0 -3> 2024-06-23T08:21:50.351+0000 7feeaf16b700 5 mds.beacon.default.cephmon-03.xcujhz Sending beacon down:damaged seq 90 -2> 2024-06-23T08:21:50.351+0000 7feeaf16b700 10 monclient: _send_mon_message to mon.cephmon-03 at v2:10.1.3.23:3300/0 -1> 2024-06-23T08:21:50.371+0000 7feeb817d700 5 mds.beacon.default.cephmon-03.xcujhz received beacon reply down:damaged seq 90 rtt 0.0200002 0> 2024-06-23T08:21:50.371+0000 7feeaf16b700 1 mds.default.cephmon-03.xcujhz respawn!
So we decided to retry the "scan_links" and "cleanup" steps: cephfs-data-scan scan_links Took about 50min., no error this time. cephfs-data-scan cleanup Took about 10h, no error Now we again tried to fire up the MDS:we set "ceph mds repaired 0" and started the MDS. And now the cluster status was: HEALTH_OK
However In the MDS logs we have seen some error messages: "bad backtrace on directory inode" VII. filesystem scrubWe decided to run a filesystem scrub on one of the directories that showed these errors, which we identified by:
rados --cluster ceph -p ssd-rep-metadata-pool listomapvals 10003e45d9f.00000000
ceph tell mds.cephfs:0 scrub start /directory/with/bad/backtrace recusiveAfter this the cluser jumped to HEALTH_ERR state saying that there is a journal damage.
We then decided to run a full filesystem scrub: ceph tell mds.cephfs:0 scrub start / recursive,repair,forceIt took about 4h to complete and from the logs we found about 192 "bad backtrace inodes" that did not have a corresponding "Scrub repaired inode" log message.
We identified 2 home directories that were affected by this. We reran the the scrub recursive,repair,force on these two directories and ceph tell mds.cephfs:0 damage lsshowed that the "bad backtrace inodes" were now reduced to 68 ("damage_type": "backtrace")
We then ran "damage rm" on all these 68 IDs. ceph tell mds.cephfs:0 damage rm <ID> [...] After this, the cluster status went back to HEALTH_OK. VIII. final checkup Before we set the fileystem online again we ran a final checkup: ceph tell mds.cephfs:0 scrub start / recursive,repair,forceThe MDS log showed no more errors, so the filesystem and journal was consistent again.
IX. new mount option for the clients and MDS settings to mitigate the bugIn order not to be hit again by the bug (#61009) we set the -o wsync option for our kernel clients and mds_client_delegate_inos_pct 0 on our MDS.
Now all is running fine again and we were lucky that no data was lost. X. Conclusion:If we would have be aware of the bug and its mitigation we would have saved a lot of downtime and some nerves.
Is there an obvious place that I missed were such known issues are prominently made public? (The bug tracker maybe, but I think it is easy to miss the important among all others)
Thanks again for all the help Dietmar On 6/19/24 09:43, Dietmar Rieder wrote:
Hello cephers,we have a degraded filesystem on our ceph 18.2.2 cluster and I'd need to get it up again.We have 6 MDS daemons and (3 active, each pinned to a subtree, 3 standby) It started this night, I got the first HEALTH_WARN emails saying: HEALTH_WARN --- New --- [WARN] MDS_CLIENT_RECALL: 1 clients failing to respond to cache pressuremds.default.cephmon-02.duujba(mds.1): Client apollo-10:cephfs_user failing to respond to cache pressure client_id: 1962074=== Full health status === [WARN] MDS_CLIENT_RECALL: 1 clients failing to respond to cache pressuremds.default.cephmon-02.duujba(mds.1): Client apollo-10:cephfs_user failing to respond to cache pressure client_id: 1962074then it went on with: HEALTH_WARN --- New --- [WARN] FS_DEGRADED: 1 filesystem is degraded fs cephfs is degraded --- Cleared --- [WARN] MDS_CLIENT_RECALL: 1 clients failing to respond to cache pressuremds.default.cephmon-02.duujba(mds.1): Client apollo-10:cephfs_user failing to respond to cache pressure client_id: 1962074=== Full health status === [WARN] FS_DEGRADED: 1 filesystem is degraded fs cephfs is degraded Then one after another MDS was going to error state: HEALTH_WARN --- Updated --- [WARN] CEPHADM_FAILED_DAEMON: 4 failed cephadm daemon(s)daemon mds.default.cephmon-01.cepqjp on cephmon-01 is in error state daemon mds.default.cephmon-02.duujba on cephmon-02 is in error state daemon mds.default.cephmon-03.chjusj on cephmon-03 is in error state daemon mds.default.cephmon-03.xcujhz on cephmon-03 is in error state=== Full health status === [WARN] CEPHADM_FAILED_DAEMON: 4 failed cephadm daemon(s)daemon mds.default.cephmon-01.cepqjp on cephmon-01 is in error state daemon mds.default.cephmon-02.duujba on cephmon-02 is in error state daemon mds.default.cephmon-03.chjusj on cephmon-03 is in error state daemon mds.default.cephmon-03.xcujhz on cephmon-03 is in error state[WARN] FS_DEGRADED: 1 filesystem is degraded fs cephfs is degraded [WARN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available have 0; want 1 moreIn the morning then I tried to restart the MDS in error state but the kept failing. I then reduced the number of active MDS to 1ceph fs set cephfs max_mds 1 And set the filesystem down ceph fs set cephfs down true I tried to restart the MDS again but now I'm stuck at the following status: [root@ceph01-b ~]# ceph -s cluster: id: aae23c5c-a98b-11ee-b44d-00620b05cac4 health: HEALTH_WARN 4 failed cephadm daemon(s) 1 filesystem is degraded insufficient standby MDS daemons available services: mon: 3 daemons, quorum cephmon-01,cephmon-03,cephmon-02 (age 2w)mgr: cephmon-01.dsxcho(active, since 11w), standbys: cephmon-02.nssigg, cephmon-03.rgeflemds: 3/3 daemons up osd: 336 osds: 336 up (since 11w), 336 in (since 3M) data: volumes: 0/1 healthy, 1 recovering pools: 4 pools, 6401 pgs objects: 284.69M objects, 623 TiB usage: 889 TiB used, 3.1 PiB / 3.9 PiB avail pgs: 6186 active+clean 156 active+clean+scrubbing 59 active+clean+scrubbing+deep [root@ceph01-b ~]# ceph health detailHEALTH_WARN 4 failed cephadm daemon(s); 1 filesystem is degraded; insufficient standby MDS daemons available[WRN] CEPHADM_FAILED_DAEMON: 4 failed cephadm daemon(s) daemon mds.default.cephmon-01.cepqjp on cephmon-01 is in error state daemon mds.default.cephmon-02.duujba on cephmon-02 is in unknown state daemon mds.default.cephmon-03.chjusj on cephmon-03 is in error state daemon mds.default.cephmon-03.xcujhz on cephmon-03 is in error state [WRN] FS_DEGRADED: 1 filesystem is degraded fs cephfs is degraded [WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available have 0; want 1 more [root@ceph01-b ~]# [root@ceph01-b ~]# ceph fs status cephfs - 40 clients ======RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 resolve default.cephmon-02.nyfook 12.3k 11.8k 3228 0 1 replay(laggy) default.cephmon-02.duujba 0 0 0 0 2 resolve default.cephmon-01.pvnqad 15.8k 3541 1409 0POOL TYPE USED AVAIL ssd-rep-metadata-pool metadata 295G 63.5T sdd-rep-data-pool data 10.2T 84.6T hdd-ec-data-pool data 808T 1929TMDS version: ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)The end log file of the replay(laggy) default.cephmon-02.duujba shows: [...]-11> 2024-06-19T07:12:38.980+0000 7f90fd117700 1 mds.1.journaler.pq(ro) _finish_probe_end write_pos = 8673820672 (header had 8623488918). recovered. -10> 2024-06-19T07:12:38.980+0000 7f90fd117700 4 mds.1.purge_queue operator(): open complete -9> 2024-06-19T07:12:38.980+0000 7f90fd117700 4 mds.1.purge_queue operator(): recovering write_pos -8> 2024-06-19T07:12:39.015+0000 7f9104926700 10 monclient: get_auth_request con 0x55a93ef42c00 auth_method 0 -7> 2024-06-19T07:12:39.025+0000 7f9105928700 10 monclient: get_auth_request con 0x55a93ef43400 auth_method 0 -6> 2024-06-19T07:12:39.038+0000 7f90fd117700 4 mds.1.purge_queue _recover: write_pos recovered -5> 2024-06-19T07:12:39.038+0000 7f90fd117700 1 mds.1.journaler.pq(ro) set_writeable -4> 2024-06-19T07:12:39.044+0000 7f9105127700 10 monclient: get_auth_request con 0x55a93ef43c00 auth_method 0 -3> 2024-06-19T07:12:39.113+0000 7f9104926700 10 monclient: get_auth_request con 0x55a93ed97000 auth_method 0 -2> 2024-06-19T07:12:39.123+0000 7f9105928700 10 monclient: get_auth_request con 0x55a93e903c00 auth_method 0 -1> 2024-06-19T07:12:39.236+0000 7f90fa912700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.2/rpm/el8/BUILD/ceph-18.2.2/src/include/interval_set.h: In function 'void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]' thread 7f90fa912700 time 2024-06-19T07:12:39.235633+0000/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.2/rpm/el8/BUILD/ceph-18.2.2/src/include/interval_set.h: 568: FAILED ceph_assert(p->first <= start)ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7f910c722e15]2: /usr/lib64/ceph/libceph-common.so.2(+0x2a9fdb) [0x7f910c722fdb]3: (interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t, std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x55a93c0de9a5] 4: (EMetaBlob::replay(MDSRank*, LogSegment*, int, MDPeerUpdate*)+0x4207) [0x55a93c3e76e7]5: (EUpdate::replay(MDSRank*)+0x61) [0x55a93c3e9f81] 6: (MDLog::_replay_thread()+0x6c9) [0x55a93c3701d9] 7: (MDLog::ReplayThread::entry()+0x11) [0x55a93c01e2d1] 8: /lib64/libpthread.so.0(+0x81ca) [0x7f910b4c81ca] 9: clone()0> 2024-06-19T07:12:39.236+0000 7f90fa912700 -1 *** Caught signal (Aborted) **in thread 7f90fa912700 thread_name:md_log_replayceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)1: /lib64/libpthread.so.0(+0x12d20) [0x7f910b4d2d20] 2: gsignal() 3: abort()4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x18f) [0x7f910c722e6f]5: /usr/lib64/ceph/libceph-common.so.2(+0x2a9fdb) [0x7f910c722fdb]6: (interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t, std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x55a93c0de9a5] 7: (EMetaBlob::replay(MDSRank*, LogSegment*, int, MDPeerUpdate*)+0x4207) [0x55a93c3e76e7]8: (EUpdate::replay(MDSRank*)+0x61) [0x55a93c3e9f81] 9: (MDLog::_replay_thread()+0x6c9) [0x55a93c3701d9] 10: (MDLog::ReplayThread::entry()+0x11) [0x55a93c01e2d1] 11: /lib64/libpthread.so.0(+0x81ca) [0x7f910b4c81ca] 12: clone()NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.--- logging levels --- 0/ 5 none 0/ 1 lockdep 0/ 1 context 1/ 1 crush 1/ 5 mds 1/ 5 mds_balancer 1/ 5 mds_locker 1/ 5 mds_log 1/ 5 mds_log_expire 1/ 5 mds_migrator 0/ 1 buffer 0/ 1 timer 0/ 1 filer 0/ 1 striper 0/ 1 objecter 0/ 5 rados 0/ 5 rbd 0/ 5 rbd_mirror 0/ 5 rbd_replay 0/ 5 rbd_pwl 0/ 5 journaler 0/ 5 objectcacher 0/ 5 immutable_obj_cache 0/ 5 client 1/ 5 osd 0/ 5 optracker 0/ 5 objclass 1/ 3 filestore 1/ 3 journal 0/ 0 ms 1/ 5 mon 0/10 monc 1/ 5 paxos 0/ 5 tp 1/ 5 auth 1/ 5 crypto 1/ 1 finisher 1/ 1 reserver 1/ 5 heartbeatmap 1/ 5 perfcounter 1/ 5 rgw 1/ 5 rgw_sync 1/ 5 rgw_datacache 1/ 5 rgw_access 1/ 5 rgw_dbstore 1/ 5 rgw_flight 1/ 5 javaclient 1/ 5 asok 1/ 1 throttle 0/ 0 refs 1/ 5 compressor 1/ 5 bluestore 1/ 5 bluefs 1/ 3 bdev 1/ 5 kstore 4/ 5 rocksdb 4/ 5 leveldb 1/ 5 fuse 2/ 5 mgr 1/ 5 mgrc 1/ 5 dpdk 1/ 5 eventtrace 1/ 5 prioritycache 0/ 5 test 0/ 5 cephfs_mirror 0/ 5 cephsqlite 0/ 5 seastore 0/ 5 seastore_onode 0/ 5 seastore_odata 0/ 5 seastore_omap 0/ 5 seastore_tm 0/ 5 seastore_t 0/ 5 seastore_cleaner 0/ 5 seastore_epm 0/ 5 seastore_lba 0/ 5 seastore_fixedkv_tree 0/ 5 seastore_cache 0/ 5 seastore_journal 0/ 5 seastore_device 0/ 5 seastore_backref 0/ 5 alienstore 1/ 5 mclock 0/ 5 cyanstore 1/ 5 ceph_exporter 1/ 5 memstore -2/-2 (syslog threshold) -1/-1 (stderr threshold) --- pthread ID / name mapping for recent threads --- 7f90fa912700 / md_log_replay 7f90fb914700 / 7f90fc115700 / MR_Finisher 7f90fd117700 / PQ_Finisher 7f90fe119700 / ms_dispatch 7f910011d700 / ceph-mds 7f9102121700 / ms_dispatch 7f9103123700 / io_context_pool 7f9104125700 / admin_socket 7f9104926700 / msgr-worker-2 7f9105127700 / msgr-worker-1 7f9105928700 / msgr-worker-0 7f910d8eab00 / ceph-mds max_recent 10000 max_new 1000 log_file /var/log/ceph/ceph-mds.default.cephmon-02.duujba.log --- end dump of recent events --- I have no idea how to resolve this and would be grateful for any help. Dietmar _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx
Attachment:
OpenPGP_signature.asc
Description: OpenPGP digital signature
_______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx