Hi,
I would try to finish the upgrade first and bring all daemons to the
same ceph version before trying any recovery. Was it a failed upgrade
attempt?
Can you please share 'ceph -s', 'ceph versions' and 'ceph orch upgrade
status'?
Zitat von Sasha BALLET <balletn@xxxxxxxx>:
Hi,
I'm struggling with a Ceph cluster that doesn't want to end recovery.
I suspect there is multiple issues at the same time.
So let's start with the most obvious, the MDS daemons are crashing
and blocking me from mounting my CephFS.
The cluster has been deployed on Debian Buster and later upgraded to
Bullseye.
There is 4 nodes of 64 OSDs of 12TB.
I can't understand which Ceph version it is on because multiple
services have different versions (another clue about what could be
wrong). There is some 16.2.4, 16.2.5, 16.2.13 and even some 17.2.6
and I think I saw some 15.x.x on the OSDs.
I have this log from the service:
journalctl -xe -u
ceph-938952d4-7775-11eb-9f42-bc97e19a216a@xxxxxxxxxxxxxxx-storage-02.oaelhv.service
[...]
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.4/rpm/el8/BUIL>
ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0)
pacific (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x158) [0x7f850b1ed59c]
2: /usr/lib64/ceph/libceph-common.so.2(+0x2767b6) [0x7f850b1ed7b6]
3: (MDCache::journal_cow_dentry(MutationImpl*, EMetaBlob*,
CDentry*, snapid_t, CInode**, CDentry::linkage_t*)+0x1215)
[0x55768a6ad6b5]
4: (MDCache::journal_dirty_inode(MutationImpl*, EMetaBlob*,
CInode*, snapid_t)+0x105) [0x55768a6ad965]
5: (Locker::check_inode_max_size(CInode*, bool, unsigned long,
unsigned long, utime_t)+0xc39) [0x55768a78fb09]
6: (RecoveryQueue::_recovered(CInode*, int, unsigned long,
utime_t)+0x92d) [0x55768a76377d]
7: (MDSContext::complete(int)+0x56) [0x55768a8c0d46]
8: (MDSIOContextBase::complete(int)+0xa3) [0x55768a8c1073]
9: (Filer::C_Probe::finish(int)+0xb5) [0x55768a974af5]
10: (Context::complete(int)+0xd) [0x55768a5b4b6d]
11: (Finisher::finisher_thread_entry()+0x1a5) [0x7f850b28c9d5]
12: /lib64/libpthread.so.0(+0x814a) [0x7f8509f8f14a]
13: clone()
debug 0> 2023-09-11T12:55:25.765+0000 7f84fc580700 -1 ***
Caught signal (Aborted) **
in thread 7f84fc580700 thread_name:MR_Finisher
ceph version 16.2.4 (3cbe25cde3cfa028984618ad32de9edc4c1eaed0)
pacific (stable)
1: /lib64/libpthread.so.0(+0x12b20) [0x7f8509f99b20]
2: gsignal()
3: abort()
4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x1a9) [0x7f850b1ed5ed]
5: /usr/lib64/ceph/libceph-common.so.2(+0x2767b6) [0x7f850b1ed7b6]
6: (MDCache::journal_cow_dentry(MutationImpl*, EMetaBlob*,
CDentry*, snapid_t, CInode**, CDentry::linkage_t*)+0x1215)
[0x55768a6ad6b5]
7: (MDCache::journal_dirty_inode(MutationImpl*, EMetaBlob*,
CInode*, snapid_t)+0x105) [0x55768a6ad965]
8: (Locker::check_inode_max_size(CInode*, bool, unsigned long,
unsigned long, utime_t)+0xc39) [0x55768a78fb09]
9: (RecoveryQueue::_recovered(CInode*, int, unsigned long,
utime_t)+0x92d) [0x55768a76377d]
10: (MDSContext::complete(int)+0x56) [0x55768a8c0d46]
11: (MDSIOContextBase::complete(int)+0xa3) [0x55768a8c1073]
12: (Filer::C_Probe::finish(int)+0xb5) [0x55768a974af5]
13: (Context::complete(int)+0xd) [0x55768a5b4b6d]
14: (Finisher::finisher_thread_entry()+0x1a5) [0x7f850b28c9d5]
15: /lib64/libpthread.so.0(+0x814a) [0x7f8509f8f14a]
16: clone()
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 rbd_pwl
0/ 5 journaler
0/ 5 objectcacher
0/ 5 immutable_obj_cache
0/ 5 client
1/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 0 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 1 reserver
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 rgw_sync
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 fuse
1/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
1/ 5 prioritycache
0/ 5 test
0/ 5 cephfs_mirror
0/ 5 cephsqlite
-2/-2 (syslog threshold)
99/99 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
140209120859904 /
140209129252608 / md_submit
140209137645312 /
140209146038016 / MR_Finisher
140209162823424 / PQ_Finisher
140209213179648 / ceph-mds
140209229965056 / safe_timer
140209246750464 / ms_dispatch
140209263535872 / io_context_pool
140209280321280 / admin_socket
140209288713984 / msgr-worker-2
140209297106688 / msgr-worker-1
140209305499392 / msgr-worker-0
140209549580160 / ceph-mds
max_recent 10000
max_new 10000
log_file
/var/lib/ceph/crash/2023-09-11T12:55:25.769044Z_5229ea26-9bea-4c49-960a-4501b227e545/log
--- end dump of recent events ---
ceph-938952d4-7775-11eb-9f42-bc97e19a216a@xxxxxxxxxxxxxxx-storage-02.oaelhv.service: Main process exited, code=exited,
status=134/n/a
I tried the Advanced: Metadata Repair Tools
https://docs.ceph.com/en/nautilus/cephfs/disaster-recovery-experts/
<https://docs.ceph.com/en/nautilus/cephfs/disaster-recovery-experts/>
It goes, "recovery successfull" in the logs, then goes "has been put
in standby", then crashes a few seconds or minutes later.
I don't know what to do now without breaking everything
Thank you for your help
Let me know if you need any other informations
Have a nice day,
Sasha
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx