Re: [EXTERN] Re: Urgent help with degraded filesystem needed

Dietmar Rieder <dietmar.rieder@xxxxxxxxxxx> · Wed, 19 Jun 2024 10:13:12 +0200

Hi Xiubo,

On 6/19/24 09:55, Xiubo Li wrote:
Hi Dietmar,

On 6/19/24 15:43, Dietmar Rieder wrote:
Hello cephers,

we have a degraded filesystem on our ceph 18.2.2 cluster and I'd need 
to get it up again.

We have 6 MDS daemons and (3 active, each pinned to a subtree, 3 standby)

It started this night, I got the first HEALTH_WARN emails saying:

HEALTH_WARN

--- New ---
[WARN] MDS_CLIENT_RECALL: 1 clients failing to respond to cache pressure
        mds.default.cephmon-02.duujba(mds.1): Client 
apollo-10:cephfs_user failing to respond to cache pressure client_id: 
1962074

=== Full health status ===
[WARN] MDS_CLIENT_RECALL: 1 clients failing to respond to cache pressure
        mds.default.cephmon-02.duujba(mds.1): Client 
apollo-10:cephfs_user failing to respond to cache pressure client_id: 
1962074

then it went on with:

HEALTH_WARN

--- New ---
[WARN] FS_DEGRADED: 1 filesystem is degraded
        fs cephfs is degraded

--- Cleared ---
[WARN] MDS_CLIENT_RECALL: 1 clients failing to respond to cache pressure
        mds.default.cephmon-02.duujba(mds.1): Client 
apollo-10:cephfs_user failing to respond to cache pressure client_id: 
1962074

=== Full health status ===
[WARN] FS_DEGRADED: 1 filesystem is degraded
        fs cephfs is degraded

Then one after another MDS was going to error state:

HEALTH_WARN

--- Updated ---
[WARN] CEPHADM_FAILED_DAEMON: 4 failed cephadm daemon(s)
        daemon mds.default.cephmon-01.cepqjp on cephmon-01 is in error 
state
        daemon mds.default.cephmon-02.duujba on cephmon-02 is in error 
state
        daemon mds.default.cephmon-03.chjusj on cephmon-03 is in error 
state
        daemon mds.default.cephmon-03.xcujhz on cephmon-03 is in error 
state

=== Full health status ===
[WARN] CEPHADM_FAILED_DAEMON: 4 failed cephadm daemon(s)
        daemon mds.default.cephmon-01.cepqjp on cephmon-01 is in error 
state
        daemon mds.default.cephmon-02.duujba on cephmon-02 is in error 
state
        daemon mds.default.cephmon-03.chjusj on cephmon-03 is in error 
state
        daemon mds.default.cephmon-03.xcujhz on cephmon-03 is in error 
state
[WARN] FS_DEGRADED: 1 filesystem is degraded
        fs cephfs is degraded
[WARN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons 
available
        have 0; want 1 more

In the morning then I tried to restart the MDS in error state but the 
kept failing. I then reduced the number of active MDS to 1

ceph fs set cephfs max_mds 1

And set the filesystem down

ceph fs set cephfs down true

I tried to restart the MDS again but now I'm stuck at the following 
status:

[root@ceph01-b ~]# ceph -s
  cluster:
    id:     aae23c5c-a98b-11ee-b44d-00620b05cac4
    health: HEALTH_WARN
            4 failed cephadm daemon(s)
            1 filesystem is degraded
            insufficient standby MDS daemons available

  services:
    mon: 3 daemons, quorum cephmon-01,cephmon-03,cephmon-02 (age 2w)
    mgr: cephmon-01.dsxcho(active, since 11w), standbys: 
cephmon-02.nssigg, cephmon-03.rgefle
    mds: 3/3 daemons up
    osd: 336 osds: 336 up (since 11w), 336 in (since 3M)

  data:
    volumes: 0/1 healthy, 1 recovering
    pools:   4 pools, 6401 pgs
    objects: 284.69M objects, 623 TiB
    usage:   889 TiB used, 3.1 PiB / 3.9 PiB avail
    pgs:     6186 active+clean
             156  active+clean+scrubbing
             59   active+clean+scrubbing+deep

[root@ceph01-b ~]# ceph health detail
HEALTH_WARN 4 failed cephadm daemon(s); 1 filesystem is degraded; 
insufficient standby MDS daemons available
[WRN] CEPHADM_FAILED_DAEMON: 4 failed cephadm daemon(s)
    daemon mds.default.cephmon-01.cepqjp on cephmon-01 is in error state
    daemon mds.default.cephmon-02.duujba on cephmon-02 is in unknown 
state
    daemon mds.default.cephmon-03.chjusj on cephmon-03 is in error state
    daemon mds.default.cephmon-03.xcujhz on cephmon-03 is in error state
[WRN] FS_DEGRADED: 1 filesystem is degraded
    fs cephfs is degraded
[WRN] MDS_INSUFFICIENT_STANDBY: insufficient standby MDS daemons 
available
    have 0; want 1 more
[root@ceph01-b ~]#
[root@ceph01-b ~]# ceph fs status
cephfs - 40 clients
======
RANK      STATE                 MDS             ACTIVITY   DNS INOS 
DIRS   CAPS
 0       resolve     default.cephmon-02.nyfook            12.3k 11.8k 
3228      0
 1    replay(laggy)  default.cephmon-02.duujba 0      0    0      0
 2       resolve     default.cephmon-01.pvnqad            15.8k 3541 
1409      0
         POOL            TYPE     USED  AVAIL
ssd-rep-metadata-pool  metadata   295G  63.5T
  sdd-rep-data-pool      data    10.2T  84.6T
   hdd-ec-data-pool      data     808T  1929T
MDS version: ceph version 18.2.2 
(531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef (stable)

The end log file of the  replay(laggy)  default.cephmon-02.duujba shows:

[...]
   -11> 2024-06-19T07:12:38.980+0000 7f90fd117700  1 
mds.1.journaler.pq(ro) _finish_probe_end write_pos = 8673820672 
(header had 8623488918). recovered.
   -10> 2024-06-19T07:12:38.980+0000 7f90fd117700  4 mds.1.purge_queue 
operator(): open complete
    -9> 2024-06-19T07:12:38.980+0000 7f90fd117700  4 mds.1.purge_queue 
operator(): recovering write_pos
    -8> 2024-06-19T07:12:39.015+0000 7f9104926700 10 monclient: 
get_auth_request con 0x55a93ef42c00 auth_method 0
    -7> 2024-06-19T07:12:39.025+0000 7f9105928700 10 monclient: 
get_auth_request con 0x55a93ef43400 auth_method 0
    -6> 2024-06-19T07:12:39.038+0000 7f90fd117700  4 mds.1.purge_queue 
_recover: write_pos recovered
    -5> 2024-06-19T07:12:39.038+0000 7f90fd117700  1 
mds.1.journaler.pq(ro) set_writeable
    -4> 2024-06-19T07:12:39.044+0000 7f9105127700 10 monclient: 
get_auth_request con 0x55a93ef43c00 auth_method 0
    -3> 2024-06-19T07:12:39.113+0000 7f9104926700 10 monclient: 
get_auth_request con 0x55a93ed97000 auth_method 0
    -2> 2024-06-19T07:12:39.123+0000 7f9105928700 10 monclient: 
get_auth_request con 0x55a93e903c00 auth_method 0
    -1> 2024-06-19T07:12:39.236+0000 7f90fa912700 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.2/rpm/el8/BUILD/ceph-18.2.2/src/include/interval_set.h: In function 'void interval_set<T, C>::erase(T, T, std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]' thread 7f90fa912700 time 2024-06-19T07:12:39.235633+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/18.2.2/rpm/el8/BUILD/ceph-18.2.2/src/include/interval_set.h: 568: FAILED ceph_assert(p->first <= start)

 ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef 
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x135) [0x7f910c722e15]
 2: /usr/lib64/ceph/libceph-common.so.2(+0x2a9fdb) [0x7f910c722fdb]
 3: (interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t, 
std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x55a93c0de9a5]
 4: (EMetaBlob::replay(MDSRank*, LogSegment*, int, 
MDPeerUpdate*)+0x4207) [0x55a93c3e76e7]
 5: (EUpdate::replay(MDSRank*)+0x61) [0x55a93c3e9f81]
 6: (MDLog::_replay_thread()+0x6c9) [0x55a93c3701d9]
 7: (MDLog::ReplayThread::entry()+0x11) [0x55a93c01e2d1]
 8: /lib64/libpthread.so.0(+0x81ca) [0x7f910b4c81ca]
 9: clone()

     0> 2024-06-19T07:12:39.236+0000 7f90fa912700 -1 *** Caught signal 
(Aborted) **
 in thread 7f90fa912700 thread_name:md_log_replay

 ceph version 18.2.2 (531c0d11a1c5d39fbfe6aa8a521f023abf3bf3e2) reef 
(stable)
 1: /lib64/libpthread.so.0(+0x12d20) [0x7f910b4d2d20]
 2: gsignal()
 3: abort()
 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x18f) [0x7f910c722e6f]
 5: /usr/lib64/ceph/libceph-common.so.2(+0x2a9fdb) [0x7f910c722fdb]
 6: (interval_set<inodeno_t, std::map>::erase(inodeno_t, inodeno_t, 
std::function<bool (inodeno_t, inodeno_t)>)+0x2e5) [0x55a93c0de9a5]
 7: (EMetaBlob::replay(MDSRank*, LogSegment*, int, 
MDPeerUpdate*)+0x4207) [0x55a93c3e76e7]
 8: (EUpdate::replay(MDSRank*)+0x61) [0x55a93c3e9f81]
 9: (MDLog::_replay_thread()+0x6c9) [0x55a93c3701d9]
 10: (MDLog::ReplayThread::entry()+0x11) [0x55a93c01e2d1]
 11: /lib64/libpthread.so.0(+0x81ca) [0x7f910b4c81ca]
 12: clone()
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is 
needed to interpret this.

This is a known bug, please see https://tracker.ceph.com/issues/61009.

As a workaround I am afraid you need to trim the journal logs first and 
then try to restart the MDS daemons, And at the same time please follow 
the workaround in https://tracker.ceph.com/issues/61009#note-26

I see, I'll try to do this. Are there any caveats or issues to expect by 
trimming the journal logs?

Is there a step by step guide on how to perform the trimming? Should all 
MDS be stopped before?

Sorry for the lot of (naive) questions, but I do not want to make any 
mistake here.

Thanks for your support,

Dietmar

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_mirror
   0/ 5 rbd_replay
   0/ 5 rbd_pwl
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 immutable_obj_cache
   0/ 5 client
   1/ 5 osd
   0/ 5 optracker
   0/ 5 objclass
   1/ 3 filestore
   1/ 3 journal
   0/ 0 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 1 reserver
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/ 5 rgw_sync
   1/ 5 rgw_datacache
   1/ 5 rgw_access
   1/ 5 rgw_dbstore
   1/ 5 rgw_flight
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 compressor
   1/ 5 bluestore
   1/ 5 bluefs
   1/ 3 bdev
   1/ 5 kstore
   4/ 5 rocksdb
   4/ 5 leveldb
   1/ 5 fuse
   2/ 5 mgr
   1/ 5 mgrc
   1/ 5 dpdk
   1/ 5 eventtrace
   1/ 5 prioritycache
   0/ 5 test
   0/ 5 cephfs_mirror
   0/ 5 cephsqlite
   0/ 5 seastore
   0/ 5 seastore_onode
   0/ 5 seastore_odata
   0/ 5 seastore_omap
   0/ 5 seastore_tm
   0/ 5 seastore_t
   0/ 5 seastore_cleaner
   0/ 5 seastore_epm
   0/ 5 seastore_lba
   0/ 5 seastore_fixedkv_tree
   0/ 5 seastore_cache
   0/ 5 seastore_journal
   0/ 5 seastore_device
   0/ 5 seastore_backref
   0/ 5 alienstore
   1/ 5 mclock
   0/ 5 cyanstore
   1/ 5 ceph_exporter
   1/ 5 memstore
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
  7f90fa912700 / md_log_replay
  7f90fb914700 /
  7f90fc115700 / MR_Finisher
  7f90fd117700 / PQ_Finisher
  7f90fe119700 / ms_dispatch
  7f910011d700 / ceph-mds
  7f9102121700 / ms_dispatch
  7f9103123700 / io_context_pool
  7f9104125700 / admin_socket
  7f9104926700 / msgr-worker-2
  7f9105127700 / msgr-worker-1
  7f9105928700 / msgr-worker-0
  7f910d8eab00 / ceph-mds
  max_recent     10000
  max_new         1000
  log_file /var/log/ceph/ceph-mds.default.cephmon-02.duujba.log
--- end dump of recent events ---

I have no idea how to resolve this and would be grateful for any help.

Dietmar

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

Attachment:
OpenPGP_signature.asc

Description: OpenPGP digital signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx