Re: [EXTERN] Urgent help with degraded filesystem needed

Dietmar Rieder <dietmar.rieder@xxxxxxxxxxx> · Sat, 29 Jun 2024 10:50:49 +0200

Hi all,

finally we were able to repair the filesystem and it seems that we did 
not lose any data. Thanks for all suggestions and comments.

Here is a short summary of our journey:

1. At some point all our 6 MDS were going to error state one after another

2. We tried to restart them but they kept crashing

3. We learned that unfortunately we hit a known bug: 
<https://tracker.ceph.com/issues/61009>

4. We set the filesystem down "ceph fs set cephfs down true" and 
unmounted it from all clients.

5. We started with the disaster recovery procedure:

<https://docs.ceph.com/en/reef/cephfs/disaster-recovery-experts/>

I. Backup the journal
cephfs-journal-tool --rank=cephfs:all journal export /mnt/backup/backup.bin

II. DENTRY recovery from journal

(We have 3 active MDS)
cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary
cephfs-journal-tool --rank=cephfs:1 event recover_dentries summary
cephfs-journal-tool --rank=cephfs:2 event recover_dentries summary

cephfs-journal-tool --rank=cephfs:all  journal inspect
Overall journal integrity: OK
Overall journal integrity: DAMAGED
Corrupt regions:
  0xd9a84f243c-ffffffffffffffff
Overall journal integrity: OK

The journal from rank 1 still shows damage

III. Journal truncation

cephfs-journal-tool --rank=cephfs:0 journal reset
cephfs-journal-tool --rank=cephfs:1 journal reset
cephfs-journal-tool --rank=cephfs:2 journal reset

IV. MDS table wipes

cephfs-table-tool all reset session

cephfs-journal-tool --rank=cephfs:1  journal inspect
Overall journal integrity: OK

V. MDS MAP reset

ceph fs reset cephfs --yes-i-really-mean-it

After these steps to reset and trim the journal we tried to restart the 
MDS, however they were still dying shortly after starting.

So as Xiubo suggested we went on with the disaster recovery procedure...

VI. Recovery from missing metadata objects

cephfs-table-tool 0 reset session
cephfs-table-tool 0 reset snap
cephfs-table-tool 0 reset inode

cephfs-journal-tool --rank=cephfs:0 journal reset

cephfs-data-scan init

The "cephfs-data-scan init" gave us warnings about already existing inodes:
Inode 0x0x1 already exists, skipping create.  Use --force-init to 
overwrite the existing object.
Inode 0x0x100 already exists, skipping create.  Use --force-init to 
overwrite the existing object.

We decided not to use --force-init and went on with

cephfs-data-scan scan_extents sdd-rep-data-pool hdd-ec-data-pool

The docs say it can take a "very long time", unfortunately the tool is 
not producing any ETA. After ~24hrs we interupted the process and 
restarted it with 32 workers.

The parallel scan_extents took about 2h and 15 min and did not generate 
any output on stdout or stderr

So we went on with parallel (32 workers) scan_inodes which also 
completed without any output after ~ 50 min.

We then ran "cephfs-data-scan scan_links", however the tool was stopping 
afer ~ 45 min. with an error message: Error ((2) No such file or directory)

We tried to go on anyway with "cephfs-data-scan cleanup". The cleanup 
was running for about 9h and 20min and did not produce any output.

So we tried to startup the MDS again, however the still kept crashing:

2024-06-23T08:21:50.197+0000 7feeb5177700  1 mds.0.8075 rejoin_start
2024-06-23T08:21:50.201+0000 7feeb5177700  1 mds.0.8075 rejoin_joint_start
2024-06-23T08:21:50.204+0000 7feeaf16b700  1 
mds.0.cache.den(0x10000000000 groups) loaded already corrupt dentry: 
[dentry #0x1/data/groups [bf,head] rep@0.0 NULL (dversion lock) pv=0 
v=7910497 ino=(nil) state=0 0x55aa27a19180]
[....]
2024-06-23T08:21:50.228+0000 7feeaf16b700 -1 log_channel(cluster) log 
[ERR] : bad backtrace on directory inode 0x10003e42340
[...]
2024-06-23T08:21:50.345+0000 7feeaf16b700 -1 log_channel(cluster) log 
[ERR] : bad backtrace on directory inode 0x10003e45d8b
[....]
    -6> 2024-06-23T08:21:50.351+0000 7feeaf16b700 10 log_client  will 
send 2024-06-23T08:21:50.229835+0000 mds.default.cephmon-03.xcujhz 
(mds.0) 1 : cluster [ERR] bad backtrace on direc
tory inode 0x10003e42340
    -5> 2024-06-23T08:21:50.351+0000 7feeaf16b700 10 log_client  will 
send 2024-06-23T08:21:50.347085+0000 mds.default.cephmon-03.xcujhz 
(mds.0) 2 : cluster [ERR] bad backtrace on directory inode 0x10003e45d8b
    -4> 2024-06-23T08:21:50.351+0000 7feeaf16b700 10 monclient: 
_send_mon_message to mon.cephmon-03 at v2:10.1.3.23:3300/0
    -3> 2024-06-23T08:21:50.351+0000 7feeaf16b700  5 
mds.beacon.default.cephmon-03.xcujhz Sending beacon down:damaged seq 90
    -2> 2024-06-23T08:21:50.351+0000 7feeaf16b700 10 monclient: 
_send_mon_message to mon.cephmon-03 at v2:10.1.3.23:3300/0
    -1> 2024-06-23T08:21:50.371+0000 7feeb817d700  5 
mds.beacon.default.cephmon-03.xcujhz received beacon reply down:damaged 
seq 90 rtt 0.0200002
     0> 2024-06-23T08:21:50.371+0000 7feeaf16b700  1 
mds.default.cephmon-03.xcujhz respawn!

So we decided to retry the "scan_links" and "cleanup" steps:

cephfs-data-scan scan_links
Took about 50min., no error this time.

cephfs-data-scan cleanup
Took about 10h, no error

Now we again tried to fire up the MDS:

we set "ceph mds repaired 0" and started the MDS. And now the cluster 
status was: HEALTH_OK

However In the MDS logs we have seen some error messages:

"bad backtrace on directory inode"

VII. filesystem scrub

We decided to run a filesystem scrub on one of the directories that 
showed these errors, which we identified by:

rados --cluster ceph -p ssd-rep-metadata-pool listomapvals 
10003e45d9f.00000000

ceph tell mds.cephfs:0 scrub start /directory/with/bad/backtrace recusive

After this the cluser jumped to HEALTH_ERR state saying that there is a 
journal damage.

We then decided to run a full filesystem scrub:

ceph tell mds.cephfs:0 scrub start / recursive,repair,force

It took about 4h to complete and from the logs we found about 192 "bad 
backtrace inodes" that did not have a corresponding "Scrub repaired 
inode" log message.
We identified 2 home directories that were affected by this.

We reran the the scrub recursive,repair,force on these two directories and

ceph tell mds.cephfs:0 damage ls

showed that the "bad backtrace inodes" were now reduced to 68 
("damage_type": "backtrace")

We then ran "damage rm" on all these 68 IDs.

ceph tell mds.cephfs:0 damage rm <ID>
[...]

After this, the cluster status went back to HEALTH_OK.

VIII. final checkup

Before we set the fileystem online again we ran a final checkup:

ceph tell mds.cephfs:0 scrub start / recursive,repair,force

The MDS log showed no more errors, so the filesystem and journal was 
consistent again.

IX. new mount option for the clients and MDS settings to mitigate the bug

In order not to be hit again by the bug (#61009) we set the -o wsync 
option for our kernel clients and mds_client_delegate_inos_pct 0  on our 
MDS.

Now all is running fine again and we were lucky that no data was lost.

X. Conclusion:

If we would have be aware of the bug and its mitigation we would have 
saved a lot of downtime and some nerves.

Is there an obvious place that I missed were such known issues are 
prominently made public? (The bug tracker maybe, but I think it is easy 
to miss the important among all others)

Thanks again for all the help
   Dietmar

On 6/19/24 09:43, Dietmar Rieder wrote:
Hello cephers,

we have a degraded filesystem on our ceph 18.2.2 cluster and I'd need to 
get it up again.

We have 6 MDS daemons and (3 active, each pinned to a subtree, 3 standby)

It started this night, I got the first HEALTH_WARN emails saying:

HEALTH_WARN

--- New ---
[WARN] MDS_CLIENT_RECALL: 1 clients failing to respond to cache pressure
         mds.default.cephmon-02.duujba(mds.1): Client 
apollo-10:cephfs_user failing to respond to cache pressure client_id: 
1962074

=== Full health status ===
[WARN] MDS_CLIENT_RECALL: 1 clients failing to respond to cache pressure
         mds.default.cephmon-02.duujba(mds.1): Client 
apollo-10:cephfs_user failing to respond to cache pressure client_id: 
1962074

then it went on with:

HEALTH_WARN

--- New ---
[WARN] FS_DEGRADED: 1 filesystem is degraded
         fs cephfs is degraded

--- Cleared ---
[WARN] MDS_CLIENT_RECALL: 1 clients failing to respond to cache pressure
         mds.default.cephmon-02.duujba(mds.1): Client 
apollo-10:cephfs_user failing to respond to cache pressure client_id: 
1962074

=== Full health status ===
[WARN] FS_DEGRADED: 1 filesystem is degraded
         fs cephfs is degraded

Then one after another MDS was going to error state:

HEALTH_WARN

--- Updated ---
[WARN] CEPHADM_FAILED_DAEMON: 4 failed cephadm daemon(s)
         daemon mds.default.cephmon-01.cepqjp on cephmon-01 is in error 
state
         daemon mds.default.cephmon-02.duujba on cephmon-02 is in error 
state
         daemon mds.default.cephmon-03.chjusj on cephmon-03 is in error 
state
         daemon mds.default.cephmon-03.xcujhz on cephmon-03 is in error 
state

=== Full health status ===
[WARN] CEPHADM_FAILED_DAEMON: 4 failed cephadm daemon(s)
         daemon mds.default.cephmon-01.cepqjp on cephmon-01 is in error 
state
         daemon mds.default.cephmon-02.duujba on cephmon-02 is in error 
state
         daemon mds.default.cephmon-03.chjusj on cephmon-03 is in error 
state
         daemon mds.default.cephmon-03.xcujhz on cephmon-03 is in error 
state
[WARN] FS_DEGRADED: 1 filesystem is degraded
         fs cephfs is degraded
[WARN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available
         have 0; want 1 more

In the morning then I tried to restart the MDS in error state but the 
kept failing. I then reduced the number of active MDS to 1

ceph fs set cephfs max_mds 1

And set the filesystem down

ceph fs set cephfs down true

I tried to restart the MDS again but now I'm stuck at the following status:

[root@ceph01-b ~]# ceph -s
   cluster:
     id:     aae23c5c-a98b-11ee-b44d-00620b05cac4
     health: HEALTH_WARN
             4 failed cephadm daemon(s)
             1 filesystem is degraded
             insufficient standby MDS daemons available

   services:
     mon: 3 daemons, quorum cephmon-01,cephmon-03,cephmon-02 (age 2w)
     mgr: cephmon-01.dsxcho(active, since 11w), standbys: 
cephmon-02.nssigg, cephmon-03.rgefle
     mds: 3/3 daemons up
     osd: 336 osds: 336 up (since 11w), 336 in (since 3M)

   data:
     volumes: 0/1 healthy, 1 recovering
     pools:   4 pools, 6401 pgs
     objects: 284.69M objects, 623 TiB
     usage:   889 TiB used, 3.1 PiB / 3.9 PiB avail
     pgs:     6186 active+clean
              156  active+clean+scrubbing
              59   active+clean+scrubbing+deep

[root@ceph01-b ~]# ceph health detail
HEALTH_WARN 4 failed cephadm daemon(s); 1 filesystem is degraded; 
insufficient standby MDS daemons available
[WRN] CEPHADM_FAILED_DAEMON: 4 failed cephadm daemon(s)
     daemon mds.default.cephmon-01.cepqjp on cephmon-01 is in error state
     daemon mds.default.cephmon-02.duujba on cephmon-02 is in unknown state
     daemon mds.default.cephmon-03.chjusj on cephmon-03 is in error state
     daemon mds.default.cephmon-03.xcujhz on cephmon-03 is in error state
[WRN] FS_DEGRADED: 1 filesystem is degraded
     fs cephfs is degraded
[WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons available
     have 0; want 1 more
[root@ceph01-b ~]#
[root@ceph01-b ~]# ceph fs status
cephfs - 40 clients
======
RANK      STATE                 MDS             ACTIVITY   DNS    INOS 
DIRS   CAPS
  0       resolve     default.cephmon-02.nyfook            12.3k  11.8k 
3228      0
  1    replay(laggy)  default.cephmon-02.duujba               0      0 
    0      0
  2       resolve     default.cephmon-01.pvnqad            15.8k  3541 
1409      0
          POOL            TYPE     USED  AVAIL
ssd-rep-metadata-pool  metadata   295G  63.5T
   sdd-rep-data-pool      data    10.2T  84.6T
    hdd-ec-data-pool      data     808T  1929T
MDS version: ceph version 18.2.2 
(531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)

The end log file of the  replay(laggy)  default.cephmon-02.duujba shows:

[...]
    -11> 2024-06-19T07:12:38.980+0000 7f90fd117700  1 
mds.1.journaler.pq(ro) _finish_probe_end write_pos = 8673820672 (header 
had 8623488918). recovered.
    -10> 2024-06-19T07:12:38.980+0000 7f90fd117700  4 mds.1.purge_queue 
operator(): open complete
     -9> 2024-06-19T07:12:38.980+0000 7f90fd117700  4 mds.1.purge_queue 
operator(): recovering write_pos
     -8> 2024-06-19T07:12:39.015+0000 7f9104926700 10 monclient: 
get_auth_request con 0x55a93ef42c00 auth_method 0
     -7> 2024-06-19T07:12:39.025+0000 7f9105928700 10 monclient: 
get_auth_request con 0x55a93ef43400 auth_method 0
     -6> 2024-06-19T07:12:39.038+0000 7f90fd117700  4 mds.1.purge_queue 
_recover: write_pos recovered
     -5> 2024-06-19T07:12:39.038+0000 7f90fd117700  1 
mds.1.journaler.pq(ro) set_writeable
     -4> 2024-06-19T07:12:39.044+0000 7f9105127700 10 monclient: 
get_auth_request con 0x55a93ef43c00 auth_method 0
     -3> 2024-06-19T07:12:39.113+0000 7f9104926700 10 monclient: 
get_auth_request con 0x55a93ed97000 auth_method 0
     -2> 2024-06-19T07:12:39.123+0000 7f9105928700 10 monclient: 
get_auth_request con 0x55a93e903c00 auth_method 0
     -1> 2024-06-19T07:12:39.236+0000 7f90fa912700 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.2/rpm/el8/BUILD/ceph-18.2.2/src/include/interval_set.h: In function 'void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]' thread 7f90fa912700 time 2024-06-19T07:12:39.235633+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.2/rpm/el8/BUILD/ceph-18.2.2/src/include/interval_set.h: 568: FAILED ceph_assert(p->first <= start)

  ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef 
(stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x135) [0x7f910c722e15]
  2: /usr/lib64/ceph/libceph-common.so.2(+0x2a9fdb) [0x7f910c722fdb]
  3: (interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t, 
std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x55a93c0de9a5]
  4: (EMetaBlob::replay(MDSRank*, LogSegment*, int, 
MDPeerUpdate*)+0x4207) [0x55a93c3e76e7]
  5: (EUpdate::replay(MDSRank*)+0x61) [0x55a93c3e9f81]
  6: (MDLog::_replay_thread()+0x6c9) [0x55a93c3701d9]
  7: (MDLog::ReplayThread::entry()+0x11) [0x55a93c01e2d1]
  8: /lib64/libpthread.so.0(+0x81ca) [0x7f910b4c81ca]
  9: clone()

      0> 2024-06-19T07:12:39.236+0000 7f90fa912700 -1 *** Caught signal 
(Aborted) **
  in thread 7f90fa912700 thread_name:md_log_replay

  ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef 
(stable)
  1: /lib64/libpthread.so.0(+0x12d20) [0x7f910b4d2d20]
  2: gsignal()
  3: abort()
  4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x18f) [0x7f910c722e6f]
  5: /usr/lib64/ceph/libceph-common.so.2(+0x2a9fdb) [0x7f910c722fdb]
  6: (interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t, 
std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x55a93c0de9a5]
  7: (EMetaBlob::replay(MDSRank*, LogSegment*, int, 
MDPeerUpdate*)+0x4207) [0x55a93c3e76e7]
  8: (EUpdate::replay(MDSRank*)+0x61) [0x55a93c3e9f81]
  9: (MDLog::_replay_thread()+0x6c9) [0x55a93c3701d9]
  10: (MDLog::ReplayThread::entry()+0x11) [0x55a93c01e2d1]
  11: /lib64/libpthread.so.0(+0x81ca) [0x7f910b4c81ca]
  12: clone()
  NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

--- logging levels ---
    0/ 5 none
    0/ 1 lockdep
    0/ 1 context
    1/ 1 crush
    1/ 5 mds
    1/ 5 mds_balancer
    1/ 5 mds_locker
    1/ 5 mds_log
    1/ 5 mds_log_expire
    1/ 5 mds_migrator
    0/ 1 buffer
    0/ 1 timer
    0/ 1 filer
    0/ 1 striper
    0/ 1 objecter
    0/ 5 rados
    0/ 5 rbd
    0/ 5 rbd_mirror
    0/ 5 rbd_replay
    0/ 5 rbd_pwl
    0/ 5 journaler
    0/ 5 objectcacher
    0/ 5 immutable_obj_cache
    0/ 5 client
    1/ 5 osd
    0/ 5 optracker
    0/ 5 objclass
    1/ 3 filestore
    1/ 3 journal
    0/ 0 ms
    1/ 5 mon
    0/10 monc
    1/ 5 paxos
    0/ 5 tp
    1/ 5 auth
    1/ 5 crypto
    1/ 1 finisher
    1/ 1 reserver
    1/ 5 heartbeatmap
    1/ 5 perfcounter
    1/ 5 rgw
    1/ 5 rgw_sync
    1/ 5 rgw_datacache
    1/ 5 rgw_access
    1/ 5 rgw_dbstore
    1/ 5 rgw_flight
    1/ 5 javaclient
    1/ 5 asok
    1/ 1 throttle
    0/ 0 refs
    1/ 5 compressor
    1/ 5 bluestore
    1/ 5 bluefs
    1/ 3 bdev
    1/ 5 kstore
    4/ 5 rocksdb
    4/ 5 leveldb
    1/ 5 fuse
    2/ 5 mgr
    1/ 5 mgrc
    1/ 5 dpdk
    1/ 5 eventtrace
    1/ 5 prioritycache
    0/ 5 test
    0/ 5 cephfs_mirror
    0/ 5 cephsqlite
    0/ 5 seastore
    0/ 5 seastore_onode
    0/ 5 seastore_odata
    0/ 5 seastore_omap
    0/ 5 seastore_tm
    0/ 5 seastore_t
    0/ 5 seastore_cleaner
    0/ 5 seastore_epm
    0/ 5 seastore_lba
    0/ 5 seastore_fixedkv_tree
    0/ 5 seastore_cache
    0/ 5 seastore_journal
    0/ 5 seastore_device
    0/ 5 seastore_backref
    0/ 5 alienstore
    1/ 5 mclock
    0/ 5 cyanstore
    1/ 5 ceph_exporter
    1/ 5 memstore
   -2/-2 (syslog threshold)
   -1/-1 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
   7f90fa912700 / md_log_replay
   7f90fb914700 /
   7f90fc115700 / MR_Finisher
   7f90fd117700 / PQ_Finisher
   7f90fe119700 / ms_dispatch
   7f910011d700 / ceph-mds
   7f9102121700 / ms_dispatch
   7f9103123700 / io_context_pool
   7f9104125700 / admin_socket
   7f9104926700 / msgr-worker-2
   7f9105127700 / msgr-worker-1
   7f9105928700 / msgr-worker-0
   7f910d8eab00 / ceph-mds
   max_recent     10000
   max_new         1000
   log_file /var/log/ceph/ceph-mds.default.cephmon-02.duujba.log
--- end dump of recent events ---

I have no idea how to resolve this and would be grateful for any help.

Dietmar

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

Attachment:
OpenPGP_signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx