Hello everyone,
We're having an issue with our backup filesystem running Reef (18.2.4)
on an Alma9.4 (version 5.14 kernel) cluster where are our filesystem
"ceph_spare" is in a constant degraded state:
root@pebbles-n4 11:49 [~]: ceph fs status
ceph_spare - 360 clients
==========
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 replay pebbles-s2 14.1M 6858k 240k 0
POOL TYPE USED AVAIL
mds_spare_fs metadata 1451G 3216G
ec82_spare_primary_fs_data data 0 3216G
ec82pool_spare data 8820T 4626T
STANDBY MDS
pebbles-s1
pebbles-s2
MDS version: ceph version 18.2.4
(e7ad5345525c7aa95470c26863873b581076945d) reef (stable)
root@pebbles-n4 11:49 [~]: ceph health detail
HEALTH_WARN 1 filesystem is degraded; 1 large omap objects; 1 MDSs
report oversized cache; 1 MDSs behind on trimming
[WRN] FS_DEGRADED: 1 filesystem is degraded
fs ceph_spare is degraded
[WRN] LARGE_OMAP_OBJECTS: 1 large omap objects
1 large objects found in pool 'mds_spare_fs'
Search the cluster log for 'Large omap object found' for more
details.
[WRN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
mds.pebbles-s2(mds.0): MDS cache is too large (24GB/4GB); 0
inodes in use by clients, 0 stray files
[WRN] MDS_TRIM: 1 MDSs behind on trimming
mds.pebbles-s2(mds.0): Behind on trimming (69293/128)
max_segments: 128, num_segments: 69293
Our MDS is stuck in replay constantly and crashing before finishing
despite a restart, reboot, and wiping of the client sessions ("ceph
config set mds mds_wipe_sessions true"). When following the replay,
we've noticed that the MDS always crashes just before the
"journal_read_pos" reaches the "journal_write_pos" where the MDS logs
with a 10/20 debug report the following:
...
-14> 2024-10-02T15:20:52.494+0100 7fa8b6d95640 10 mds.0.journal
EMetaBlob.replay dir 0x1002a7c9f3a.010001*
-13> 2024-10-02T15:20:52.494+0100 7fa8b6d95640 10 mds.0.journal
EMetaBlob.replay updated dir [dir 0x1002a7c9f3a.010001*
/cephfs2-users/crusso/cjrLabArchives/knayde/Data2020/2020March04_LHC2C7_Sma
llHole_F4MSM_Pu_Process/Polish/job030/MotionCorrectedMovies/
[2,head] auth v=283167 cv=0/0 state=1073741824 f(v0
m2024-09-30T02:20:11.052706+0100 4262=4262+0) n(v37
rc2024-09-30T02:20:11.052706+0100
b4168080204 4262=4262+0) hs=877+853,ss=3+0 dirty=1733 | child=1
0x5642e8d5d680]
-12> 2024-10-02T15:20:52.494+0100 7fa8b6d95640 20
mds.0.cache.dir(0x1002a7c9f3a.010001*) lookup_exact_snap (head,
'FoilHole_4616489_Data_3448109_3448111_20200306_014931_Fractions_cor_stack_tracks
_plot_0012.ef980630.partial')
-11> 2024-10-02T15:20:52.494+0100 7fa8b6d95640 20
mds.0.cache.dir(0x1002a7c9f3a.010001*) lookup_exact_snap (head,
'FoilHole_4616489_Data_3448109_3448111_20200306_014931_Fractions_cor_stack_tracks
_plot_0012.ef980630.partial')
-10> 2024-10-02T15:20:52.494+0100 7fa8b6d95640 12
mds.0.cache.dir(0x1002a7c9f3a.010001*) add_null_dentry [dentry
#0x1/cephfs2-users/crusso/cjrLabArchives/knayde/Data2020/2020March04_LHC2C7_SmallH
ole_F4MSM_Pu_Process/Polish/job030/MotionCorrectedMovies/FoilHole_4616489_Data_3448109_3448111_20200306_014931_Fractions_cor_stack_tracks_plot_0012.ef980630.partial
[126,head] auth NULL (dversion lo
ck) pv=0 v=283167 ino=(nil) state=1073741824 0x564f219fca00]
-9> 2024-10-02T15:20:52.494+0100 7fa8b6d95640 10 mds.0.journal
EMetaBlob.replay added (full) [dentry
#0x1/cephfs2-users/crusso/cjrLabArchives/knayde/Data2020/2020March04_LHC2C7_SmallHole_F4MSM_P
u_Process/Polish/job030/MotionCorrectedMovies/FoilHole_4616489_Data_3448109_3448111_20200306_014931_Fractions_cor_stack_tracks_plot_0012.ef980630.partial
[126,head] auth NULL (dversion lock) v=28316
6 ino=(nil) state=1610612736 | dirty=1 0x564f219fca00]
-8> 2024-10-02T15:20:52.494+0100 7fa8b6d95640 12
mds.0.cache.dir(0x1002a7c9f3a.010001*) link_primary_inode [dentry
#0x1/cephfs2-users/crusso/cjrLabArchives/knayde/Data2020/2020March04_LHC2C7_Sma
llHole_F4MSM_Pu_Process/Polish/job030/MotionCorrectedMovies/FoilHole_46164
89_Data_3448109_3448111_20200306_014931_Fractions_cor_stack_tracks_plot_0012.ef980630.partial
[126,head] auth NULL (dversion lock) v=283166 ino=(nil)
state=1610612736 | dirty=1 0x564f219fca00] [inode 0x1002c5f1d10
[126,head] #1002c5f1d10 auth v283166 s=0 n(v0 1=1+0) (iversion lock)
cr={159821641=0-4194304@125} 0x564f21a00000]
-7> 2024-10-02T15:20:52.494+0100 7fa8b6d95640 10 mds.0.journal
EMetaBlob.replay added [inode 0x1002c5f1d10 [126,head]
/cephfs2-users/crusso/cjrLabArchives/knayde/Data2020/2020March04_LHC2C7_SmallHole_F4MSM_Pu_Process/Polish/job030/MotionCorrectedMovies/FoilHole_4616489_Data_3448109_3448111_20200306_014931_Fractions_cor_stack_tracks_plot_0012.ef980630.partial
auth v283166 s=0 n(v0 1=1+0) (iversion lock)
cr={159821641=0-4194304@125} 0x564f21a00000]
-6> 2024-10-02T15:20:52.494+0100 7fa8b6d95640 10
mds.0.cache.ino(0x1002c5f1d10) mark_dirty_parent
-5> 2024-10-02T15:20:52.494+0100 7fa8b6d95640 10 mds.0.journal
EMetaBlob.replay noting opened inode [inode 0x1002c5f1d10 [126,head]
/cephfs2-users/crusso/cjrLabArchives/knayde/Data2020/2020March04_LHC2C7_SmallHole_F4MSM_Pu_Process/Polish/job030/MotionCorrectedMovies/FoilHole_4616489_Data_3448109_3448111_20200306_014931_Fractions_cor_stack_tracks_plot_0012.ef980630.partial
auth v283166 DIRTYPARENT s=0 n(v0 1=1+0) (iversion lock)
cr={159821641=0-4194304@125} | dirtyparent=1 dirty=1 0x564f21a00000]
-4> 2024-10-02T15:20:52.494+0100 7fa8b6d95640 10 mds.0.journal
EMetaBlob.replay inotable tablev 1481899 <= table 1481899
-3> 2024-10-02T15:20:52.494+0100 7fa8b6d95640 10 mds.0.journal
EMetaBlob.replay sessionmap v 746010300, table 746010299 prealloc []
used 0x1002c5f1d10
-2> 2024-10-02T15:20:52.494+0100 7fa8b6d95640 20 mds.0.journal
(session prealloc
[0x1002bdf9f74~0x22,0x1002bdfb51d~0x7c,0x1002c59def8~0xfb,0x1002c5eec0c~0x86,0x1002c5eee87~0x1f5])
-1> 2024-10-02T15:20:52.494+0100 7fa8b6d95640 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/18.2.4/rpm/el9/BUILD/ceph-18.2.4/src/include/interval_set.h:
In function 'void interval_set<T, C>::erase(T, T,
std::function<bool(T, T)>) [with T = inodeno_t; C = std::map]'
thread 7fa8b6d95640 time 2024-10-02T15:20:52.495403+0100
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/18.2.4/rpm/el9/BUILD/ceph-18.2.4/src/include/interval_set.h:
568: FAILED ceph_assert(p->first <= start)
ceph version 18.2.4 (e7ad5345525c7aa95470c26863873b581076945d)
reef (stable)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x12e) [0x7fa8c416b04d]
2: /usr/lib64/ceph/libceph-common.so.2(+0x16b20b) [0x7fa8c416b20b]
3: /usr/bin/ceph-mds(+0x1ef0fe) [0x5642daf490fe]
4: /usr/bin/ceph-mds(+0x1ef145) [0x5642daf49145]
5: (EMetaBlob::replay(MDSRank*, LogSegment*, int,
MDPeerUpdate*)+0x4bad) [0x5642db17b67d]
6: (EUpdate::replay(MDSRank*)+0x5d) [0x5642db18477d]
7: (MDLog::_replay_thread()+0x75e) [0x5642db12d00e]
8: /usr/bin/ceph-mds(+0x13c561) [0x5642dae96561]
9: /lib64/libc.so.6(+0x89c02) [0x7fa8c3889c02]
10: /lib64/libc.so.6(+0x10ec40) [0x7fa8c390ec40]
0> 2024-10-02T15:20:52.495+0100 7fa8b6d95640 -1 *** Caught
signal (Aborted) **
in thread 7fa8b6d95640 thread_name:md_log_replay
ceph version 18.2.4 (e7ad5345525c7aa95470c26863873b581076945d)
reef (stable)
1: /lib64/libc.so.6(+0x3e6f0) [0x7fa8c383e6f0]
2: /lib64/libc.so.6(+0x8b94c) [0x7fa8c388b94c]
3: raise()
4: abort()
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x188) [0x7fa8c416b0a7]
6: /usr/lib64/ceph/libceph-common.so.2(+0x16b20b) [0x7fa8c416b20b]
7: /usr/bin/ceph-mds(+0x1ef0fe) [0x5642daf490fe]
8: /usr/bin/ceph-mds(+0x1ef145) [0x5642daf49145]
9: (EMetaBlob::replay(MDSRank*, LogSegment*, int,
MDPeerUpdate*)+0x4bad) [0x5642db17b67d]
10: (EUpdate::replay(MDSRank*)+0x5d) [0x5642db18477d]
11: (MDLog::_replay_thread()+0x75e) [0x5642db12d00e]
12: /usr/bin/ceph-mds(+0x13c561) [0x5642dae96561]
13: /lib64/libc.so.6(+0x89c02) [0x7fa8c3889c02]
14: /lib64/libc.so.6(+0x10ec40) [0x7fa8c390ec40]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
10/20 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 rbd_pwl
0/ 5 journaler
0/ 5 objectcacher
0/ 5 immutable_obj_cache
0/ 5 client
1/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 0 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 1 reserver
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 rgw_sync
1/ 5 rgw_datacache
1/ 5 rgw_access
1/ 5 rgw_dbstore
1/ 5 rgw_flight
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
1/ 5 fuse
2/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
1/ 5 prioritycache
0/ 5 test
0/ 5 cephfs_mirror
0/ 5 cephsqlite
0/ 5 seastore
0/ 5 seastore_onode
0/ 5 seastore_odata
0/ 5 seastore_omap
0/ 5 seastore_tm
0/ 5 seastore_t
0/ 5 seastore_cleaner
0/ 5 seastore_epm
0/ 5 seastore_lba
0/ 5 seastore_fixedkv_tree
0/ 5 seastore_cache
0/ 5 seastore_journal
0/ 5 seastore_device
0/ 5 seastore_backref
0/ 5 alienstore
1/ 5 mclock
0/ 5 cyanstore
1/ 5 ceph_exporter
1/ 5 memstore
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
--- pthread ID / name mapping for recent threads ---
7fa8b6d95640 / md_log_replay
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-mds.pebbles-s3.log
--- end dump of recent events ---
Our MDS then starts at the beginning of the replay process and
continually re-replays the journal until it crashes again at the same point.
It looks like our journal has gotten corrupted at this file from what I
understand and our journal (worryingly) is exceptionally large where
we've had to use a 2 TiB machine just to try and export it. What is
causing this issue? Can we do small modifications to the journal or
similar to rectify this issue or move the faulty object in the journal
out of the bulk object-store to fail (and thus skip) the transaction? We
really do not want to go through disaster recovery again
(https://docs.ceph.com/en/reef/cephfs/disaster-recovery-experts/#disaster-recovery-experts)
as this is the 2nd time this has happened to this cluster in the last 4
months and it took over a month to recover the data last time
Kindest regards,
Ivan
--
Ivan Clayson
-----------------
Scientific Computing Officer
Room 2N249
Structural Studies
MRC Laboratory of Molecular Biology
Francis Crick Ave, Cambridge
CB2 0QH
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx