Re: [EXTERN] Urgent help with degraded filesystem needed

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all,

finally we were able to repair the filesystem and it seems that we did not lose any data. Thanks for all suggestions and comments.

Here is a short summary of our journey:


1. At some point all our 6 MDS were going to error state one after another

2. We tried to restart them but they kept crashing

3. We learned that unfortunately we hit a known bug: <https://tracker.ceph.com/issues/61009>

4. We set the filesystem down "ceph fs set cephfs down true" and unmounted it from all clients.

5. We started with the disaster recovery procedure:

<https://docs.ceph.com/en/reef/cephfs/disaster-recovery-experts/>

I. Backup the journal
cephfs-journal-tool --rank=cephfs:all journal export /mnt/backup/backup.bin

II. DENTRY recovery from journal

(We have 3 active MDS)
cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary
cephfs-journal-tool --rank=cephfs:1 event recover_dentries summary
cephfs-journal-tool --rank=cephfs:2 event recover_dentries summary

cephfs-journal-tool --rank=cephfs:all  journal inspect
Overall journal integrity: OK
Overall journal integrity: DAMAGED
Corrupt regions:
  0xd9a84f243c-ffffffffffffffff
Overall journal integrity: OK

The journal from rank 1 still shows damage

III. Journal truncation

cephfs-journal-tool --rank=cephfs:0 journal reset
cephfs-journal-tool --rank=cephfs:1 journal reset
cephfs-journal-tool --rank=cephfs:2 journal reset

IV. MDS table wipes

cephfs-table-tool all reset session

cephfs-journal-tool --rank=cephfs:1  journal inspect
Overall journal integrity: OK

V. MDS MAP reset

ceph fs reset cephfs --yes-i-really-mean-it

After these steps to reset and trim the journal we tried to restart the MDS, however they were still dying shortly after starting.

So as Xiubo suggested we went on with the disaster recovery procedure...

VI. Recovery from missing metadata objects

cephfs-table-tool 0 reset session
cephfs-table-tool 0 reset snap
cephfs-table-tool 0 reset inode

cephfs-journal-tool --rank=cephfs:0 journal reset

cephfs-data-scan init

The "cephfs-data-scan init" gave us warnings about already existing inodes:
Inode 0x0x1 already exists, skipping create. Use --force-init to overwrite the existing object. Inode 0x0x100 already exists, skipping create. Use --force-init to overwrite the existing object.

We decided not to use --force-init and went on with

cephfs-data-scan scan_extents sdd-rep-data-pool hdd-ec-data-pool

The docs say it can take a "very long time", unfortunately the tool is not producing any ETA. After ~24hrs we interupted the process and restarted it with 32 workers.

The parallel scan_extents took about 2h and 15 min and did not generate any output on stdout or stderr

So we went on with parallel (32 workers) scan_inodes which also completed without any output after ~ 50 min.

We then ran "cephfs-data-scan scan_links", however the tool was stopping afer ~ 45 min. with an error message: Error ((2) No such file or directory)

We tried to go on anyway with "cephfs-data-scan cleanup". The cleanup was running for about 9h and 20min and did not produce any output.

So we tried to startup the MDS again, however the still kept crashing:

2024-06-23T08:21:50.197+0000 7feeb5177700  1 mds.0.8075 rejoin_start
2024-06-23T08:21:50.201+0000 7feeb5177700  1 mds.0.8075 rejoin_joint_start
2024-06-23T08:21:50.204+0000 7feeaf16b700 1 mds.0.cache.den(0x10000000000 groups) loaded already corrupt dentry: [dentry #0x1/data/groups [bf,head] rep@0.0 NULL (dversion lock) pv=0 v=7910497 ino=(nil) state=0 0x55aa27a19180]
[....]
2024-06-23T08:21:50.228+0000 7feeaf16b700 -1 log_channel(cluster) log [ERR] : bad backtrace on directory inode 0x10003e42340
[...]
2024-06-23T08:21:50.345+0000 7feeaf16b700 -1 log_channel(cluster) log [ERR] : bad backtrace on directory inode 0x10003e45d8b
[....]
-6> 2024-06-23T08:21:50.351+0000 7feeaf16b700 10 log_client will send 2024-06-23T08:21:50.229835+0000 mds.default.cephmon-03.xcujhz (mds.0) 1 : cluster [ERR] bad backtrace on direc
tory inode 0x10003e42340
-5> 2024-06-23T08:21:50.351+0000 7feeaf16b700 10 log_client will send 2024-06-23T08:21:50.347085+0000 mds.default.cephmon-03.xcujhz (mds.0) 2 : cluster [ERR] bad backtrace on directory inode 0x10003e45d8b -4> 2024-06-23T08:21:50.351+0000 7feeaf16b700 10 monclient: _send_mon_message to mon.cephmon-03 at v2:10.1.3.23:3300/0 -3> 2024-06-23T08:21:50.351+0000 7feeaf16b700 5 mds.beacon.default.cephmon-03.xcujhz Sending beacon down:damaged seq 90 -2> 2024-06-23T08:21:50.351+0000 7feeaf16b700 10 monclient: _send_mon_message to mon.cephmon-03 at v2:10.1.3.23:3300/0 -1> 2024-06-23T08:21:50.371+0000 7feeb817d700 5 mds.beacon.default.cephmon-03.xcujhz received beacon reply down:damaged seq 90 rtt 0.0200002 0> 2024-06-23T08:21:50.371+0000 7feeaf16b700 1 mds.default.cephmon-03.xcujhz respawn!

So we decided to retry the "scan_links" and "cleanup" steps:

cephfs-data-scan scan_links
Took about 50min., no error this time.

cephfs-data-scan cleanup
Took about 10h, no error

Now we again tried to fire up the MDS:

we set "ceph mds repaired 0" and started the MDS. And now the cluster status was: HEALTH_OK

However In the MDS logs we have seen some error messages:

"bad backtrace on directory inode"


VII. filesystem scrub


We decided to run a filesystem scrub on one of the directories that showed these errors, which we identified by:

rados --cluster ceph -p ssd-rep-metadata-pool listomapvals 10003e45d9f.00000000

ceph tell mds.cephfs:0 scrub start /directory/with/bad/backtrace recusive

After this the cluser jumped to HEALTH_ERR state saying that there is a journal damage.

We then decided to run a full filesystem scrub:

ceph tell mds.cephfs:0 scrub start / recursive,repair,force

It took about 4h to complete and from the logs we found about 192 "bad backtrace inodes" that did not have a corresponding "Scrub repaired inode" log message.
We identified 2 home directories that were affected by this.

We reran the the scrub recursive,repair,force on these two directories and

ceph tell mds.cephfs:0 damage ls

showed that the "bad backtrace inodes" were now reduced to 68 ("damage_type": "backtrace")

We then ran "damage rm" on all these 68 IDs.

ceph tell mds.cephfs:0 damage rm <ID>
[...]

After this, the cluster status went back to HEALTH_OK.

VIII. final checkup

Before we set the fileystem online again we ran a final checkup:

ceph tell mds.cephfs:0 scrub start / recursive,repair,force

The MDS log showed no more errors, so the filesystem and journal was consistent again.

IX. new mount option for the clients and MDS settings to mitigate the bug

In order not to be hit again by the bug (#61009) we set the -o wsync option for our kernel clients and mds_client_delegate_inos_pct 0 on our MDS.

Now all is running fine again and we were lucky that no data was lost.

X. Conclusion:

If we would have be aware of the bug and its mitigation we would have saved a lot of downtime and some nerves.

Is there an obvious place that I missed were such known issues are prominently made public? (The bug tracker maybe, but I think it is easy to miss the important among all others)

Thanks again for all the help
   Dietmar

On 6/19/24 09:43, Dietmar Rieder wrote:
Hello cephers,

we have a degraded filesystem on our ceph 18.2.2 cluster and I'd need to get it up again.

We have 6 MDS daemons and (3 active, each pinned to a subtree, 3 standby)

It started this night, I got the first HEALTH_WARN emails saying:

HEALTH_WARN

--- New ---
[WARN] MDS_CLIENT_RECALL: 1 clients failing to respond to cache pressure
        mds.default.cephmon-02.duujba(mds.1): Client apollo-10:cephfs_user failing to respond to cache pressure client_id: 1962074


=== Full health status ===
[WARN] MDS_CLIENT_RECALL: 1 clients failing to respond to cache pressure
        mds.default.cephmon-02.duujba(mds.1): Client apollo-10:cephfs_user failing to respond to cache pressure client_id: 1962074


then it went on with:

HEALTH_WARN

--- New ---
[WARN] FS_DEGRADED: 1 filesystem is degraded
         fs cephfs is degraded

--- Cleared ---
[WARN] MDS_CLIENT_RECALL: 1 clients failing to respond to cache pressure
        mds.default.cephmon-02.duujba(mds.1): Client apollo-10:cephfs_user failing to respond to cache pressure client_id: 1962074


=== Full health status ===
[WARN] FS_DEGRADED: 1 filesystem is degraded
         fs cephfs is degraded



Then one after another MDS was going to error state:

HEALTH_WARN

--- Updated ---
[WARN] CEPHADM_FAILED_DAEMON: 4 failed cephadm daemon(s)
        daemon mds.default.cephmon-01.cepqjp on cephmon-01 is in error state         daemon mds.default.cephmon-02.duujba on cephmon-02 is in error state         daemon mds.default.cephmon-03.chjusj on cephmon-03 is in error state         daemon mds.default.cephmon-03.xcujhz on cephmon-03 is in error state


=== Full health status ===
[WARN] CEPHADM_FAILED_DAEMON: 4 failed cephadm daemon(s)
        daemon mds.default.cephmon-01.cepqjp on cephmon-01 is in error state         daemon mds.default.cephmon-02.duujba on cephmon-02 is in error state         daemon mds.default.cephmon-03.chjusj on cephmon-03 is in error state         daemon mds.default.cephmon-03.xcujhz on cephmon-03 is in error state
[WARN] FS_DEGRADED: 1 filesystem is degraded
         fs cephfs is degraded
[WARN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available
         have 0; want 1 more


In the morning then I tried to restart the MDS in error state but the kept failing. I then reduced the number of active MDS to 1

ceph fs set cephfs max_mds 1

And set the filesystem down

ceph fs set cephfs down true

I tried to restart the MDS again but now I'm stuck at the following status:


[root@ceph01-b ~]# ceph -s
   cluster:
     id:     aae23c5c-a98b-11ee-b44d-00620b05cac4
     health: HEALTH_WARN
             4 failed cephadm daemon(s)
             1 filesystem is degraded
             insufficient standby MDS daemons available

   services:
     mon: 3 daemons, quorum cephmon-01,cephmon-03,cephmon-02 (age 2w)
    mgr: cephmon-01.dsxcho(active, since 11w), standbys: cephmon-02.nssigg, cephmon-03.rgefle
     mds: 3/3 daemons up
     osd: 336 osds: 336 up (since 11w), 336 in (since 3M)

   data:
     volumes: 0/1 healthy, 1 recovering
     pools:   4 pools, 6401 pgs
     objects: 284.69M objects, 623 TiB
     usage:   889 TiB used, 3.1 PiB / 3.9 PiB avail
     pgs:     6186 active+clean
              156  active+clean+scrubbing
              59   active+clean+scrubbing+deep

[root@ceph01-b ~]# ceph health detail
HEALTH_WARN 4 failed cephadm daemon(s); 1 filesystem is degraded; insufficient standby MDS daemons available
[WRN] CEPHADM_FAILED_DAEMON: 4 failed cephadm daemon(s)
     daemon mds.default.cephmon-01.cepqjp on cephmon-01 is in error state
     daemon mds.default.cephmon-02.duujba on cephmon-02 is in unknown state
     daemon mds.default.cephmon-03.chjusj on cephmon-03 is in error state
     daemon mds.default.cephmon-03.xcujhz on cephmon-03 is in error state
[WRN] FS_DEGRADED: 1 filesystem is degraded
     fs cephfs is degraded
[WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available
     have 0; want 1 more
[root@ceph01-b ~]#
[root@ceph01-b ~]# ceph fs status
cephfs - 40 clients
======
RANK      STATE                 MDS             ACTIVITY   DNS    INOS DIRS   CAPS  0       resolve     default.cephmon-02.nyfook            12.3k  11.8k 3228      0  1    replay(laggy)  default.cephmon-02.duujba               0      0    0      0  2       resolve     default.cephmon-01.pvnqad            15.8k  3541 1409      0
          POOL            TYPE     USED  AVAIL
ssd-rep-metadata-pool  metadata   295G  63.5T
   sdd-rep-data-pool      data    10.2T  84.6T
    hdd-ec-data-pool      data     808T  1929T
MDS version: ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)


The end log file of the  replay(laggy)  default.cephmon-02.duujba shows:

[...]
   -11> 2024-06-19T07:12:38.980+0000 7f90fd117700  1 mds.1.journaler.pq(ro) _finish_probe_end write_pos = 8673820672 (header had 8623488918). recovered.    -10> 2024-06-19T07:12:38.980+0000 7f90fd117700  4 mds.1.purge_queue operator(): open complete     -9> 2024-06-19T07:12:38.980+0000 7f90fd117700  4 mds.1.purge_queue operator(): recovering write_pos     -8> 2024-06-19T07:12:39.015+0000 7f9104926700 10 monclient: get_auth_request con 0x55a93ef42c00 auth_method 0     -7> 2024-06-19T07:12:39.025+0000 7f9105928700 10 monclient: get_auth_request con 0x55a93ef43400 auth_method 0     -6> 2024-06-19T07:12:39.038+0000 7f90fd117700  4 mds.1.purge_queue _recover: write_pos recovered     -5> 2024-06-19T07:12:39.038+0000 7f90fd117700  1 mds.1.journaler.pq(ro) set_writeable     -4> 2024-06-19T07:12:39.044+0000 7f9105127700 10 monclient: get_auth_request con 0x55a93ef43c00 auth_method 0     -3> 2024-06-19T07:12:39.113+0000 7f9104926700 10 monclient: get_auth_request con 0x55a93ed97000 auth_method 0     -2> 2024-06-19T07:12:39.123+0000 7f9105928700 10 monclient: get_auth_request con 0x55a93e903c00 auth_method 0     -1> 2024-06-19T07:12:39.236+0000 7f90fa912700 -1 /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.2/rpm/el8/BUILD/ceph-18.2.2/src/include/interval_set.h: In function 'void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]' thread 7f90fa912700 time 2024-06-19T07:12:39.235633+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.2/rpm/el8/BUILD/ceph-18.2.2/src/include/interval_set.h: 568: FAILED ceph_assert(p->first <= start)

 ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x135) [0x7f910c722e15]
  2: /usr/lib64/ceph/libceph-common.so.2(+0x2a9fdb) [0x7f910c722fdb]
 3: (interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t, std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x55a93c0de9a5]  4: (EMetaBlob::replay(MDSRank*, LogSegment*, int, MDPeerUpdate*)+0x4207) [0x55a93c3e76e7]
  5: (EUpdate::replay(MDSRank*)+0x61) [0x55a93c3e9f81]
  6: (MDLog::_replay_thread()+0x6c9) [0x55a93c3701d9]
  7: (MDLog::ReplayThread::entry()+0x11) [0x55a93c01e2d1]
  8: /lib64/libpthread.so.0(+0x81ca) [0x7f910b4c81ca]
  9: clone()

     0> 2024-06-19T07:12:39.236+0000 7f90fa912700 -1 *** Caught signal (Aborted) **
  in thread 7f90fa912700 thread_name:md_log_replay

 ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)
  1: /lib64/libpthread.so.0(+0x12d20) [0x7f910b4d2d20]
  2: gsignal()
  3: abort()
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x18f) [0x7f910c722e6f]
  5: /usr/lib64/ceph/libceph-common.so.2(+0x2a9fdb) [0x7f910c722fdb]
 6: (interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t, std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x55a93c0de9a5]  7: (EMetaBlob::replay(MDSRank*, LogSegment*, int, MDPeerUpdate*)+0x4207) [0x55a93c3e76e7]
  8: (EUpdate::replay(MDSRank*)+0x61) [0x55a93c3e9f81]
  9: (MDLog::_replay_thread()+0x6c9) [0x55a93c3701d9]
  10: (MDLog::ReplayThread::entry()+0x11) [0x55a93c01e2d1]
  11: /lib64/libpthread.so.0(+0x81ca) [0x7f910b4c81ca]
  12: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

--- logging levels ---
    0/ 5 none
    0/ 1 lockdep
    0/ 1 context
    1/ 1 crush
    1/ 5 mds
    1/ 5 mds_balancer
    1/ 5 mds_locker
    1/ 5 mds_log
    1/ 5 mds_log_expire
    1/ 5 mds_migrator
    0/ 1 buffer
    0/ 1 timer
    0/ 1 filer
    0/ 1 striper
    0/ 1 objecter
    0/ 5 rados
    0/ 5 rbd
    0/ 5 rbd_mirror
    0/ 5 rbd_replay
    0/ 5 rbd_pwl
    0/ 5 journaler
    0/ 5 objectcacher
    0/ 5 immutable_obj_cache
    0/ 5 client
    1/ 5 osd
    0/ 5 optracker
    0/ 5 objclass
    1/ 3 filestore
    1/ 3 journal
    0/ 0 ms
    1/ 5 mon
    0/10 monc
    1/ 5 paxos
    0/ 5 tp
    1/ 5 auth
    1/ 5 crypto
    1/ 1 finisher
    1/ 1 reserver
    1/ 5 heartbeatmap
    1/ 5 perfcounter
    1/ 5 rgw
    1/ 5 rgw_sync
    1/ 5 rgw_datacache
    1/ 5 rgw_access
    1/ 5 rgw_dbstore
    1/ 5 rgw_flight
    1/ 5 javaclient
    1/ 5 asok
    1/ 1 throttle
    0/ 0 refs
    1/ 5 compressor
    1/ 5 bluestore
    1/ 5 bluefs
    1/ 3 bdev
    1/ 5 kstore
    4/ 5 rocksdb
    4/ 5 leveldb
    1/ 5 fuse
    2/ 5 mgr
    1/ 5 mgrc
    1/ 5 dpdk
    1/ 5 eventtrace
    1/ 5 prioritycache
    0/ 5 test
    0/ 5 cephfs_mirror
    0/ 5 cephsqlite
    0/ 5 seastore
    0/ 5 seastore_onode
    0/ 5 seastore_odata
    0/ 5 seastore_omap
    0/ 5 seastore_tm
    0/ 5 seastore_t
    0/ 5 seastore_cleaner
    0/ 5 seastore_epm
    0/ 5 seastore_lba
    0/ 5 seastore_fixedkv_tree
    0/ 5 seastore_cache
    0/ 5 seastore_journal
    0/ 5 seastore_device
    0/ 5 seastore_backref
    0/ 5 alienstore
    1/ 5 mclock
    0/ 5 cyanstore
    1/ 5 ceph_exporter
    1/ 5 memstore
   -2/-2 (syslog threshold)
   -1/-1 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
   7f90fa912700 / md_log_replay
   7f90fb914700 /
   7f90fc115700 / MR_Finisher
   7f90fd117700 / PQ_Finisher
   7f90fe119700 / ms_dispatch
   7f910011d700 / ceph-mds
   7f9102121700 / ms_dispatch
   7f9103123700 / io_context_pool
   7f9104125700 / admin_socket
   7f9104926700 / msgr-worker-2
   7f9105127700 / msgr-worker-1
   7f9105928700 / msgr-worker-0
   7f910d8eab00 / ceph-mds
   max_recent     10000
   max_new         1000
   log_file /var/log/ceph/ceph-mds.default.cephmon-02.duujba.log
--- end dump of recent events ---


I have no idea how to resolve this and would be grateful for any help.

Dietmar

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx


Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux