This is a known issue, please see https://tracker.ceph.com/issues/60986.
If you could reproduce it then please enable the mds debug logs and this
could help debug it fast:
debug_mds = 25
debug_ms = 1
Thanks
- Xiubo
On 5/7/24 00:26, Robert Sander wrote:
Hi,
a 17.2.7 cluster with two filesystems has suddenly non-working MDSs:
# ceph -s
cluster:
id: f54eea86-265a-11eb-a5d0-457857ba5742
health: HEALTH_ERR
22 failed cephadm daemon(s)
2 filesystems are degraded
1 mds daemon damaged
insufficient standby MDS daemons available
services:
mon: 5 daemons, quorum ceph00,ceph03,ceph04,ceph01,ceph02 (age 4h)
mgr: ceph03.odfupq(active, since 4h), standbys: ppc721.vsincn,
ceph00.lvbddp, ceph02.zhyxjg, ceph06.eifppc
mds: 4/5 daemons up
osd: 145 osds: 145 up (since 20h), 145 in (since 2d)
rgw: 12 daemons active (4 hosts, 1 zones)
data:
volumes: 0/2 healthy, 2 recovering; 1 damaged
pools: 15 pools, 4897 pgs
objects: 195.64M objects, 195 TiB
usage: 617 TiB used, 527 TiB / 1.1 PiB avail
pgs: 4892 active+clean
5 active+clean+scrubbing+deep
io:
client: 2.5 MiB/s rd, 20 MiB/s wr, 665 op/s rd, 938 op/s wr
# ceph fs status
ABC - 4 clients
===========
RANK STATE MDS ACTIVITY DNS INOS
DIRS CAPS
0 failed
1 resolve ABC.ceph04.lzlkdu 0 3 1 0
2 resolve ABC.ppc721.rzfmyi 0 3 1 0
3 resolve ABC.ceph04.jiepaw 249 252 13 0
POOL TYPE USED AVAIL
cephfs.ABC.meta metadata 33.0G 104T
cephfs.ABC.data data 390T 104T
DEF - 154 clients
===========
RANK STATE MDS ACTIVITY DNS INOS
DIRS CAPS
0 rejoin(laggy) DEF.ceph06.etthum 30.9k 30.8k
5084 0
POOL TYPE USED AVAIL
cephfs.DEF.meta metadata 190G 104T
cephfs.DEF.data data 118T 104T
MDS version: ceph version 17.2.7
(b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
The first filesystem will not get an MDS in rank 0,
we already tried to set max_msd to 1 but to no avail.
The second filesystem's MDS shows "replay" for a while and then
it crashes in the rejoin phase with:
-92> 2024-05-06T16:07:15.514+0000 7f1927e9d700 1 mds.0.501522
handle_mds_map i am now mds.0.501522
-91> 2024-05-06T16:07:15.514+0000 7f1927e9d700 1 mds.0.501522
handle_mds_map state change up:reconnect --> up:rejoin
-90> 2024-05-06T16:07:15.514+0000 7f1927e9d700 1 mds.0.501522
rejoin_start
-89> 2024-05-06T16:07:15.514+0000 7f1927e9d700 1 mds.0.501522
rejoin_joint_start
-88> 2024-05-06T16:07:15.514+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x300005bfece err -22/0
-87> 2024-05-06T16:07:15.514+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x30000671eb5 err -22/0
-86> 2024-05-06T16:07:15.514+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x300005bfed3 err -22/0
-85> 2024-05-06T16:07:15.514+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012bc94c err -22/0
-84> 2024-05-06T16:07:15.514+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x300005b0274 err -22/0
-83> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x30000671eb5 err -22/0
-82> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012bc94c err -22/0
-81> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x30000671ebd err -22/-22
-80> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x30000671ecd err -22/-22
-79> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012bc9ea err -22/-22
-78> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x300005bfed3 err -22/0
-77> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012bc9c3 err -22/-22
-76> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012bc978 err -22/-22
-75> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012bc99d err -22/-22
-74> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012bc95b err -22/-22
-73> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012bc980 err -22/-22
-72> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x300005b0274 err -22/0
-71> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x20001dc7a7e err -22/-22
-70> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012be364 err -22/-22
-69> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x300005b2e32 err -22/-22
-68> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x30000671eb5 err -22/0
-67> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012bc94c err -22/0
-66> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x30000671ebd err -22/-22
-65> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x30000671ecd err -22/-22
-64> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012bc9ea err -22/-22
-63> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012bc978 err -22/-22
-62> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x3000069373a err -22/-22
-61> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012dc5d8 err -22/-22
-60> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x400000a32e8 err -22/-22
-59> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x30000696952 err -22/-22
-58> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012bc94c err -22/0
-57> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x300005bfed3 err -22/0
-56> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012bc99d err -22/-22
-55> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012bc980 err -22/-22
-54> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x20001dc7a7e err -22/-22
-53> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x400000a58cf err -22/-22
-52> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012c3a0e err -22/-22
-51> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012bc94c err -22/0
-50> 2024-05-06T16:07:15.518+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x300005b0274 err -22/0
-49> 2024-05-06T16:07:15.522+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x20001dc7a7f err -22/-22
-48> 2024-05-06T16:07:15.522+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012bc95b err -22/-22
-47> 2024-05-06T16:07:15.522+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x300005b2e32 err -22/-22
-46> 2024-05-06T16:07:15.522+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012be364 err -22/-22
-45> 2024-05-06T16:07:15.522+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x300005b388d err -22/-22
-44> 2024-05-06T16:07:15.522+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x10007185ac2 err -22/-22
-43> 2024-05-06T16:07:15.522+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012bc94c err -22/0
-42> 2024-05-06T16:07:15.522+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012bc980 err -22/-22
-41> 2024-05-06T16:07:15.522+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012bc99d err -22/-22
-40> 2024-05-06T16:07:15.522+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x400000a58cf err -22/-22
-39> 2024-05-06T16:07:15.522+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012c3a0e err -22/-22
-38> 2024-05-06T16:07:15.522+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x400000a58db err -22/-22
-37> 2024-05-06T16:07:15.522+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x40000d63bff err -22/-22
-36> 2024-05-06T16:07:15.522+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012bc94c err -22/0
-35> 2024-05-06T16:07:15.526+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012bc95b err -22/-22
-34> 2024-05-06T16:07:15.526+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012be364 err -22/-22
-33> 2024-05-06T16:07:15.526+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x10007185ac2 err -22/-22
-32> 2024-05-06T16:07:15.526+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012bc94c err -22/0
-31> 2024-05-06T16:07:15.526+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x10007185ac4 err -22/-22
-30> 2024-05-06T16:07:15.526+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012bc980 err -22/-22
-29> 2024-05-06T16:07:15.526+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x400000a58cf err -22/-22
-28> 2024-05-06T16:07:15.526+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x400000a58db err -22/-22
-27> 2024-05-06T16:07:15.526+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x400000a58f4 err -22/-22
-26> 2024-05-06T16:07:15.530+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012bc94c err -22/0
-25> 2024-05-06T16:07:15.534+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012bc980 err -22/-22
-24> 2024-05-06T16:07:15.534+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x400000a58cf err -22/-22
-23> 2024-05-06T16:07:15.534+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x400000a58db err -22/-22
-22> 2024-05-06T16:07:15.534+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x400000a58f4 err -22/-22
-21> 2024-05-06T16:07:15.534+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x400000a5fc4 err -22/-22
-20> 2024-05-06T16:07:15.534+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x400000a634d err -22/-22
-19> 2024-05-06T16:07:15.534+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x400000a63bf err -22/-22
-18> 2024-05-06T16:07:15.542+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012bc94c err -22/0
-17> 2024-05-06T16:07:15.542+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012bc980 err -22/-22
-16> 2024-05-06T16:07:15.546+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x400000a58cf err -22/-22
-15> 2024-05-06T16:07:15.550+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x400000a58db err -22/-22
-14> 2024-05-06T16:07:15.550+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x400000a58f4 err -22/-22
-13> 2024-05-06T16:07:15.550+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x400000a5fc4 err -22/-22
-12> 2024-05-06T16:07:15.550+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x400000a634d err -22/-22
-11> 2024-05-06T16:07:15.550+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x400000a63bf err -22/-22
-10> 2024-05-06T16:07:15.550+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012bfd9c err -22/-22
-9> 2024-05-06T16:07:15.550+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x200012bfb78 err -22/-22
-8> 2024-05-06T16:07:15.554+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x300005b0274 err -22/0
-7> 2024-05-06T16:07:15.554+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x300005b2e32 err -22/-22
-6> 2024-05-06T16:07:15.554+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x300005b388d err -22/-22
-5> 2024-05-06T16:07:15.562+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x300005b0274 err -22/0
-4> 2024-05-06T16:07:15.562+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x300005b2e32 err -22/-22
-3> 2024-05-06T16:07:15.562+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x300005b388d err -22/-22
-2> 2024-05-06T16:07:15.562+0000 7f1921e91700 0 mds.0.cache
failed to open ino 0x40000d5a226 err -22/-22
-1> 2024-05-06T16:07:15.634+0000 7f1921e91700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/mds/MDCache.cc:
In function 'void MDCache::rejoin_send_rejoins()' thread 7f1921e91700
time 2024-05-06T16:07:15.635683+0000
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/mds/MDCache.cc:
4086: FAILED ceph_assert(auth >= 0)
ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy
(stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x135) [0x7f1930ad94a3]
2: /usr/lib64/ceph/libceph-common.so.2(+0x269669) [0x7f1930ad9669]
3: (MDCache::rejoin_send_rejoins()+0x216b) [0x5614ac8747eb]
4: (MDCache::process_imported_caps()+0x1993) [0x5614ac872353]
5: (Context::complete(int)+0xd) [0x5614ac6e182d]
6: (MDSContext::complete(int)+0x5f) [0x5614aca41f4f]
7: (void finish_contexts<std::vector<MDSContext*,
std::allocator<MDSContext*> > >(ceph::common::CephContext*,
std::vector<MDSContext*, std::allocator<MDSContext*> >&, int)+0x8d)
[0x5614ac6e6f5d]
8: (OpenFileTable::_open_ino_finish(inodeno_t, int)+0x156)
[0x5614aca765a6]
9: (MDSContext::complete(int)+0x5f) [0x5614aca41f4f]
10: (void finish_contexts<std::vector<MDSContext*,
std::allocator<MDSContext*> > >(ceph::common::CephContext*,
std::vector<MDSContext*, std::allocator<MDSContext*> >&, int)+0x8d)
[0x5614ac6e6f5d]
11: (MDCache::open_ino_finish(inodeno_t, MDCache::open_ino_info_t&,
int)+0x138) [0x5614ac867168]
12: (MDCache::_open_ino_backtrace_fetched(inodeno_t,
ceph::buffer::v15_2_0::list&, int)+0x290) [0x5614ac87ff90]
13: (MDSContext::complete(int)+0x5f) [0x5614aca41f4f]
14: (MDSIOContextBase::complete(int)+0x534) [0x5614aca426e4]
15: (Finisher::finisher_thread_entry()+0x18d) [0x7f1930b7884d]
16: /lib64/libpthread.so.0(+0x81ca) [0x7f192fac81ca]
17: clone()
How do we solve this issue?
Regards
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx