Time Estimation for cephfs-data-scan scan_links

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hello,

I've encountered an issue where the metadata pool has corrupted a cache inode, leading to an MDS rank abort in the 'reconnect' state. To address this, I'm following the "USING AN ALTERNATE METADATA POOL FOR RECOVERY" section from the documentation [1].

However, I've observed that the cephfs-data-scan scan_links step has been running for over 24 hours on 35 TB of data, which is replicated across 3 OSDs, resulting in more than 100 TB of raw data. Does anyone have an estimation on the duration for this step?

Additional detail: The corrupted mds log:

-9> 2023-10-11T10:13:22.254-0300 7ff901f75700 10 monclient: get_auth_request con 0x559bf41e4400 auth_method 0 -8> 2023-10-11T10:13:22.254-0300 7ff8ff770700 5 mds.barril12 handle_mds_map old map epoch 472481 <= 472481, discarding -7> 2023-10-11T10:13:22.254-0300 7ff8ff770700 0 mds.0.cache missing dir for * (which maps to *) on [inode 0x10021afaf90 [...392,head] /dbteamvenv/ auth v98534854 snaprealm=0x559bf427ce00 f(v60 m2023-10-06T15:35:03.278089-0300 9=0+9) n(v141971 rc2023-10-09T18:41:19.742089-0300 b1424948533453 139810=131460+8350) (iversion lock) 0x559bf4298580] -6> 2023-10-11T10:13:22.254-0300 7ff8ff770700 0 mds.0.cache missing dir ino 0x20005dd786b -5> 2023-10-11T10:13:22.254-0300 7ff902776700 10 monclient: get_auth_request con 0x559bf4142c00 auth_method 0 -4> 2023-10-11T10:13:22.258-0300 7ff902f77700 5 mds.beacon.barril12 received beacon reply up:rejoin seq 4 rtt 1.09601 -3> 2023-10-11T10:13:22.258-0300 7ff8ff770700 -1 ./src/mds/MDCache.cc: In function 'void MDCache::handle_cache_rejoin_weak(ceph::cref_t<MMDSCacheRejoin>&)' thread 7ff8ff770700 time 2023-10-11T10:13:22.259535-0300
./src/mds/MDCache.cc: 4462: FAILED ceph_assert(diri)

ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable) 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x124) [0x7ff904a5b282]
 2: /usr/lib/ceph/libceph-common.so.2(+0x25b420) [0x7ff904a5b420]
3: (MDCache::handle_cache_rejoin_weak(boost::intrusive_ptr<MMDSCacheRejoin const> const&)+0x20de) [0x559bf0a9da6e] 4: (MDCache::dispatch(boost::intrusive_ptr<Message const> const&)+0x424) [0x559bf0aa2a64] 5: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x5c0) [0x559bf0930580] 6: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x58) [0x559bf0930b78] 7: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1bf) [0x559bf090b5df] 8: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x468) [0x7ff904ca71d8]
 9: (DispatchQueue::entry()+0x5ef) [0x7ff904ca48df]
 10: (DispatchQueue::DispatchThread::entry()+0xd) [0x7ff904d681cd]
 11: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7ff905680ea7]
 12: clone()

-2> 2023-10-11T10:13:22.258-0300 7ff902f77700 10 monclient: get_auth_request con 0x559bf41e4c00 auth_method 0 -1> 2023-10-11T10:13:22.258-0300 7ff902f77700 10 monclient: get_auth_request con 0x559bf41e5400 auth_method 0 0> 2023-10-11T10:13:22.262-0300 7ff8ff770700 -1 *** Caught signal (Aborted) **
 in thread 7ff8ff770700 thread_name:ms_dispatch

ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)
 1: /lib/x86_64-linux-gnu/libpthread.so.0(+0x13140) [0x7ff90568c140]
 2: gsignal()
 3: abort()
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x17e) [0x7ff904a5b2dc]
 5: /usr/lib/ceph/libceph-common.so.2(+0x25b420) [0x7ff904a5b420]
6: (MDCache::handle_cache_rejoin_weak(boost::intrusive_ptr<MMDSCacheRejoin const> const&)+0x20de) [0x559bf0a9da6e] 7: (MDCache::dispatch(boost::intrusive_ptr<Message const> const&)+0x424) [0x559bf0aa2a64] 8: (MDSRank::_dispatch(boost::intrusive_ptr<Message const> const&, bool)+0x5c0) [0x559bf0930580] 9: (MDSRankDispatcher::ms_dispatch(boost::intrusive_ptr<Message const> const&)+0x58) [0x559bf0930b78] 10: (MDSDaemon::ms_dispatch2(boost::intrusive_ptr<Message> const&)+0x1bf) [0x559bf090b5df] 11: (Messenger::ms_deliver_dispatch(boost::intrusive_ptr<Message> const&)+0x468) [0x7ff904ca71d8]
 12: (DispatchQueue::entry()+0x5ef) [0x7ff904ca48df]
 13: (DispatchQueue::DispatchThread::entry()+0xd) [0x7ff904d681cd]
 14: /lib/x86_64-linux-gnu/libpthread.so.0(+0x7ea7) [0x7ff905680ea7]
 15: clone()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.


Ceph Cluster status:

barril1:~# ceph status
  cluster:
    id:     c30ecc8d-440e-4608-b3fe-5020337ae11d
    health: HEALTH_ERR
            2 filesystems are degraded
            2 filesystems are offline

  services:
mon: 5 daemons, quorum barril4,barril3,barril2,barril1,urquell (age 32h) mgr: barril2(active, since 32h), standbys: barril3, barril4, urquell, barril1
    mds: 0/10 daemons up (10 failed), 9 standby
    osd: 48 osds: 48 up (since 32h), 48 in (since 2M); 22 remapped pgs
    rgw: 4 daemons active (4 hosts, 1 zones)

  data:
    volumes: 0/2 healthy, 2 failed
    pools:   12 pools, 1475 pgs
    objects: 50.89M objects, 72 TiB
    usage:   207 TiB used, 148 TiB / 355 TiB avail
    pgs:     579358/152674596 objects misplaced (0.379%)
             1449 active+clean
             22   active+remapped+backfilling
             4    active+clean+scrubbing+deep

  io:
    client:   7.2 MiB/s rd, 1.2 MiB/s wr, 342 op/s rd, 367 op/s wr
    recovery: 26 MiB/s, 13 keys/s, 26 objects/s

  progress:
    Global Recovery Event (19h)
      [===========================.] (remaining: 17m)



Ceph fs status:

barril1:~# ceph fs status
cephfs - 0 clients
======
RANK  STATE   MDS  ACTIVITY  DNS  INOS  DIRS  CAPS
 0    failed
 1    failed
 2    failed
 3    failed
 4    failed
 5    failed
 6    failed
 7    failed
 8    failed
      POOL          TYPE     USED  AVAIL
cephfs_metadata   metadata  1045G  35.6T
cephfs.c3sl.data    data     114T  35.6T
c3sl - 0 clients
====
RANK  STATE   MDS  ACTIVITY  DNS  INOS  DIRS  CAPS
 0    failed
      POOL          TYPE     USED  AVAIL
cephfs.c3sl.meta  metadata  28.2G  35.6T
cephfs.c3sl.data    data     114T  35.6T
STANDBY MDS
  barril2
  barril4
  barril42
  barril33
  barril13
  barril23
  barril43
  barril1
  barril12
MDS version: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)

ceph health detail:

barril1:~# ceph health detail
HEALTH_ERR 2 filesystems are degraded; 2 filesystems are offline
[WRN] FS_DEGRADED: 2 filesystems are degraded
    fs cephfs is degraded
    fs c3sl is degraded
[ERR] MDS_ALL_DOWN: 2 filesystems are offline
    fs cephfs is offline because no MDS is active for it.
    fs c3sl is offline because no MDS is active for it.


[1]: https://docs.ceph.com/en/reef/cephfs/disaster-recovery-experts/#using-an-alternate-metadata-pool-for-recovery

Best regards,

Odair M. Ditkun Jr
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux