CAUTION: This email originated from outside of the LMB:
*.-dparmar@xxxxxxxxxx-.*
Do not click links or open attachments unless you recognize the
sender and know the content is safe.
If you think this is a phishing email, please forward it to
phishing@xxxxxxxxxxxxxxxxx
--
Ugh, something went horribly wrong. I've downloaded the MDS logs
that contain assertion failure and it looks relevant to this [0].
Do you have client logs for this?
The other log that you shared is being downloaded right now, once
that's done and I'm done going through it, I'll update you.
[0] https://tracker.ceph.com/issues/54546
On Mon, Jul 8, 2024 at 4:49 PM Ivan Clayson
<ivan@xxxxxxxxxxxxxxxxx> wrote:
Hi Dhairya,
Sorry to resurrect this thread again, but we still
unfortunately have an issue with our filesystem after we
attempted to write new backups to it.
We finished the scrub of the filesystem on Friday and ran a
repair scrub on the 1 directory which had metadata damage.
After doing so and rebooting, the cluster reported no issues
and data was accessible again.
We re-started the backups to run over the weekend and
unfortunately the filesystem crashed again where the log of
the failure is here:
https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s2.log-20240708.gz.
We ran the backups on kernel mounts of the filesystem without
the nowsync option this time to avoid the out-of-sync write
problems..
I've tried resetting the journal again after recovering the
dentries but unfortunately the filesystem is still in a
failed state despite setting joinable to true. The log of
this crash is here:
https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s4.log-20240708.
I'm not sure how to proceed as I can't seem to get any MDS to
take over the first rank. I would like to do a scrub of the
filesystem and preferably overwrite the troublesome files
with the originals on the live filesystem. Do you have any
advice on how to make the filesystem leave its failed state?
I have a backup of the journal before I reset it so I can
roll back if necessary.
Here are some details about the filesystem at present:
root@pebbles-s2 11:49 [~]: ceph -s; ceph fs status
cluster:
id: e3f7535e-d35f-4a5d-88f0-a1e97abcd631
health: HEALTH_ERR
1 filesystem is degraded
1 large omap objects
1 filesystem is offline
1 mds daemon damaged
nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim
flag(s) set
1750 pgs not deep-scrubbed in time
1612 pgs not scrubbed in time
services:
mon: 4 daemons, quorum
pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4 (age 50m)
mgr: pebbles-s2(active, since 77m), standbys:
pebbles-s1, pebbles-s3, pebbles-s4
mds: 1/2 daemons up, 3 standby
osd: 1380 osds: 1380 up (since 76m), 1379 in (since
10d); 10 remapped pgs
flags
nobackfill,norebalance,norecover,noscrub,nodeep-scrub,nosnaptrim
data:
volumes: 1/2 healthy, 1 recovering; 1 damaged
pools: 7 pools, 2177 pgs
objects: 3.24G objects, 6.7 PiB
usage: 8.6 PiB used, 14 PiB / 23 PiB avail
pgs: 11785954/27384310061 objects misplaced (0.043%)
2167 active+clean
6 active+remapped+backfilling
4 active+remapped+backfill_wait
ceph_backup - 0 clients
===========
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 failed
POOL TYPE USED AVAIL
mds_backup_fs metadata 1174G 3071G
ec82_primary_fs_data data 0 3071G
ec82pool data 8085T 4738T
ceph_archive - 2 clients
============
RANK STATE MDS ACTIVITY DNS INOS
DIRS CAPS
0 active pebbles-s4 Reqs: 0 /s 13.4k 7105
118 2
POOL TYPE USED AVAIL
mds_archive_fs metadata 5184M 3071G
ec83_primary_fs_data data 0 3071G
ec83pool data 138T 4307T
STANDBY MDS
pebbles-s2
pebbles-s3
pebbles-s1
MDS version: ceph version 17.2.7
(b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable)
root@pebbles-s2 11:55 [~]: ceph fs dump
e2643889
enable_multiple, ever_enabled_multiple: 1,1
default compat: compat={},rocompat={},incompat={1=base
v0.20,2=client writeable ranges,3=default file layouts on
dirs,4=dir inode in separate object,5=mds uses versioned
encoding,6=dirfrag is stored in omap,8=no anchor
table,9=file layout v2,10=snaprealm v2}
legacy client fscid: 1
Filesystem 'ceph_backup' (1)
fs_name ceph_backup
epoch 2643888
flags 12 joinable allow_snaps allow_multimds_snaps
created 2023-05-19T12:52:36.302135+0100
modified 2024-07-08T11:17:55.437861+0100
tableserver 0
root 0
session_timeout 60
session_autoclose 300
max_file_size 109934182400000
required_client_features {}
last_failure 0
last_failure_osd_epoch 494515
compat compat={},rocompat={},incompat={1=base
v0.20,2=client writeable ranges,3=default file layouts on
dirs,4=dir inode in separate object,5=mds uses versioned
encoding,6=dirfrag is stored in omap,7=mds uses inline
data,8=no anchor table,9=file layout v2,10=snaprealm v2}
max_mds 1
in 0
up {}
failed
damaged 0
stopped
data_pools [6,3]
metadata_pool 2
inline_data disabled
balancer
standby_count_wanted 1
Kindest regards,
Ivan
On 28/06/2024 15:17, Dhairya Parmar wrote:
CAUTION: This email originated from outside of the LMB:
*.-dparmar@xxxxxxxxxx-.*
Do not click links or open attachments unless you recognize
the sender and know the content is safe.
If you think this is a phishing email, please forward it to
phishing@xxxxxxxxxxxxxxxxx
--
On Fri, Jun 28, 2024 at 6:02 PM Ivan Clayson
<ivan@xxxxxxxxxxxxxxxxx> wrote:
Hi Dhairya,
I would be more than happy to share our corrupted
journal. Has the host key changed for drop.ceph.com
<http://drop.ceph.com>? The fingerprint I'm being sent
is 7T6dSMcUUa5refV147WEZR99UgW8Y1qYEXZr8ppvog4 which is
different to the one in our
/usr/share/ceph/known_hosts_drop.ceph.com
<http://known_hosts_drop.ceph.com>.
Ah, strange. Let me get in touch with folks who might know
about this, will revert back to you ASAP
Thank you for your advice as well. We've reset our MDS'
journal and are currently in the process of a full
filesystem scrub which understandably is taking quite a
bit of time but seems to be progressing through the
objects fine.
YAY!
Thank you ever so much for all your help and please do
feel free to follow up with us if you would like any
further details about our crash!
Glad to hear it went well, this bug is being worked on with
high priority and once the patch is ready, it will be
backported.
The root cause of this issue is the `nowsync` (async dirops)
being enabled by default with kclient [0]. This feature
allows asynchronous creation and deletion of files,
optimizing performance by avoiding round-trip latency for
these system calls. However, in very rare cases (like yours
:D), it can affect the system's consistency and stability
hence if this kind of optimization is not a priority for
your workload, I recommend turning it off by switching
the mount points to `wsync` and also set the MDS config
`mds_client_delegate_inos_pct` to `0` so that you don't end
up in this situation again (until the bug fix arrives :)).
[0]
https://github.com/ceph/ceph-client/commit/f7a67b463fb83a4b9b11ceaa8ec4950b8fb7f902
Kindest regards,
Ivan
On 27/06/2024 12:39, Dhairya Parmar wrote:
CAUTION: This email originated from outside of the LMB:
*.-dparmar@xxxxxxxxxx-.*
Do not click links or open attachments unless you
recognize the sender and know the content is safe.
If you think this is a phishing email, please forward
it to phishing@xxxxxxxxxxxxxxxxx
--
Hi Ivan,
The solution (which has been successful for us in the
past) is to reset the journal. This would bring the fs
back online and return the MDSes to a stable state, but
some data would be lost—the data in the journal that
hasn't been flushed to the backing store would be gone.
Therefore, you should try to flush out as much journal
data as possible before resetting the journal.
Here are the steps for this entire process:
1) Bring the FS offline
$ ceph fs fail <fs_name>
2) Recover dentries from journal (run it with every MDS
Rank)
$ cephfs-journal-tool --rank=<fs_name>:<mds-rank> event
recover_dentries summary
3) Reset the journal (again with every MDS Rank)
$ cephfs-journal-tool --rank=<fs_name>:<mds-rank>
journal reset
4) Bring the FS online
$ cephfs fs set <fs_name> joinable true
5) Restart the MDSes
6) Perform scrub to ensure consistency of fs
$ ceph tell mds.<fs_name>:0 scrub start <path>
[scrubopts] [tag]
# you could try a recursive scrub maybe `ceph tell
mds.<fs_name>:0 scrub start / recursive`
Some important notes to keep in mind:
* Recovering dentries will take time (generally, rank 0
is the most time-consuming, but the rest should be quick).
* cephfs-journal-tool and metadata OSDs are bound to
use a significant CPU percentage. This is because
cephfs-journal-tool has to swig the journal data and
flush it out to the backing store, which also makes the
metadata operations go rampant, resulting in OSDs
taking a significant percentage of CPU.
Do let me know how this goes.
On Thu, Jun 27, 2024 at 3:44 PM Ivan Clayson
<ivan@xxxxxxxxxxxxxxxxx> wrote:
Hi Dhairya,
We can induce the crash by simply restarting the
MDS and the crash seems to happen when an MDS goes
from up:standby to up:replay. The MDS works through
a few files in the log before eventually crashing
where I've included the logs for this here (this is
after I imported the backed up journal which I hope
was successful but please let me know if you
suspect it wasn't!):
https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.mds_restart_crash.log
With respect to the client logs, are you referring
to the clients who are writing to the filesystem?
We don't typically run them in any sort of debug
mode and we have quite a few machines running our
backup system but we can look an hour or so before
the first MDS crash (though I don't know if this is
when the de-sync occurred). Here are some MDS logs
with regards to the initial crash on Saturday
morning though which may be helpful:
-59> 2024-06-22T05:41:43.090+0100
7f184ce82700 10 monclient: tick
-58> 2024-06-22T05:41:43.090+0100
7f184ce82700 10 monclient: _check_auth_rotating
have uptodate secrets (they expire after
2024-06-22T05:41:13.091556+0100)
-57> 2024-06-22T05:41:43.208+0100
7f184de84700 1 mds.pebbles-s2 Updating MDS map
to version 2529650 from mon.3
-56> 2024-06-22T05:41:43.208+0100
7f184de84700 4 mds.0.purge_queue operator():
data pool 6 not found in OSDMap
-55> 2024-06-22T05:41:43.208+0100
7f184de84700 4 mds.0.purge_queue operator():
data pool 3 not found in OSDMap
-54> 2024-06-22T05:41:43.209+0100
7f184de84700 5 asok(0x5592e7968000)
register_command objecter_requests hook
0x5592e78f8800
-53> 2024-06-22T05:41:43.209+0100
7f184de84700 10 monclient: _renew_subs
-52> 2024-06-22T05:41:43.209+0100
7f184de84700 10 monclient: _send_mon_message to
mon.pebbles-s4 at v2:10.1.5.134:3300/0
<http://10.1.5.134:3300/0>
-51> 2024-06-22T05:41:43.209+0100
7f184de84700 10 log_channel(cluster)
update_config to_monitors: true to_syslog:
false syslog_facility: prio: info to_graylog:
false graylog_host: 127.0.0.1 graylog_port: 12201)
-50> 2024-06-22T05:41:43.209+0100
7f184de84700 4 mds.0.purge_queue operator():
data pool 6 not found in OSDMap
-49> 2024-06-22T05:41:43.209+0100
7f184de84700 4 mds.0.purge_queue operator():
data pool 3 not found in OSDMap
-48> 2024-06-22T05:41:43.209+0100
7f184de84700 4 mds.0.0 apply_blocklist: killed
0, blocklisted sessions (0 blocklist entries, 0)
-47> 2024-06-22T05:41:43.209+0100
7f184de84700 1 mds.0.2529650 handle_mds_map i
am now mds.0.2529650
-46> 2024-06-22T05:41:43.209+0100
7f184de84700 1 mds.0.2529650 handle_mds_map
state change up:standby --> up:replay
-45> 2024-06-22T05:41:43.209+0100
7f184de84700 5 mds.beacon.pebbles-s2
set_want_state: up:standby -> up:replay
-44> 2024-06-22T05:41:43.209+0100
7f184de84700 1 mds.0.2529650 replay_start
-43> 2024-06-22T05:41:43.209+0100
7f184de84700 1 mds.0.2529650 waiting for
osdmap 473739 (which blocklists prior instance)
-42> 2024-06-22T05:41:43.209+0100
7f184de84700 10 monclient: _send_mon_message to
mon.pebbles-s4 at v2:10.1.5.134:3300/0
<http://10.1.5.134:3300/0>
-41> 2024-06-22T05:41:43.209+0100
7f1849e7c700 2 mds.0.cache Memory usage:
total 299012, rss 37624, heap 182556, baseline
182556, 0 / 0 inodes have caps, 0 caps, 0 caps
per inode
-40> 2024-06-22T05:41:43.224+0100
7f184de84700 10 monclient: _renew_subs
-39> 2024-06-22T05:41:43.224+0100
7f184de84700 10 monclient: _send_mon_message to
mon.pebbles-s4 at v2:10.1.5.134:3300/0
<http://10.1.5.134:3300/0>
-38> 2024-06-22T05:41:43.224+0100
7f184de84700 10 monclient:
handle_get_version_reply finishing 1 version 473739
-37> 2024-06-22T05:41:43.224+0100
7f1847e78700 2 mds.0.2529650 Booting: 0:
opening inotable
-36> 2024-06-22T05:41:43.224+0100
7f1847e78700 2 mds.0.2529650 Booting: 0:
opening sessionmap
-35> 2024-06-22T05:41:43.224+0100
7f1847e78700 2 mds.0.2529650 Booting: 0:
opening mds log
-34> 2024-06-22T05:41:43.224+0100
7f1847e78700 5 mds.0.log open discovering log
bounds
-33> 2024-06-22T05:41:43.224+0100
7f1847e78700 2 mds.0.2529650 Booting: 0:
opening purge queue (async)
-32> 2024-06-22T05:41:43.224+0100
7f1847e78700 4 mds.0.purge_queue open: opening
-31> 2024-06-22T05:41:43.224+0100
7f1847e78700 1 mds.0.journaler.pq(ro) recover
start
-30> 2024-06-22T05:41:43.224+0100
7f1847e78700 1 mds.0.journaler.pq(ro) read_head
-29> 2024-06-22T05:41:43.224+0100
7f1847e78700 2 mds.0.2529650 Booting: 0:
loading open file table (async)
-28> 2024-06-22T05:41:43.224+0100
7f1847e78700 2 mds.0.2529650 Booting: 0:
opening snap table
-27> 2024-06-22T05:41:43.224+0100
7f1847677700 4 mds.0.journalpointer Reading
journal pointer '400.00000000'
-26> 2024-06-22T05:41:43.224+0100
7f1850689700 10 monclient: get_auth_request con
0x5592e8987000 auth_method 0
-25> 2024-06-22T05:41:43.225+0100
7f1850e8a700 10 monclient: get_auth_request con
0x5592e8987c00 auth_method 0
-24> 2024-06-22T05:41:43.252+0100
7f1848e7a700 1 mds.0.journaler.pq(ro)
_finish_read_head loghead(trim 231160676352,
expire 231163662875, write 231163662875,
stream_format 1). probing for end of log (from
231163662875)...
-23> 2024-06-22T05:41:43.252+0100
7f1848e7a700 1 mds.0.journaler.pq(ro) probing
for end of the log
-22> 2024-06-22T05:41:43.252+0100
7f1847677700 1 mds.0.journaler.mdlog(ro)
recover start
-21> 2024-06-22T05:41:43.252+0100
7f1847677700 1 mds.0.journaler.mdlog(ro) read_head
-20> 2024-06-22T05:41:43.252+0100
7f1847677700 4 mds.0.log Waiting for journal
0x200 to recover...
-19> 2024-06-22T05:41:43.252+0100
7f1850689700 10 monclient: get_auth_request con
0x5592e8bc6000 auth_method 0
-18> 2024-06-22T05:41:43.253+0100
7f185168b700 10 monclient: get_auth_request con
0x5592e8bc6800 auth_method 0
-17> 2024-06-22T05:41:43.257+0100
7f1847e78700 1 mds.0.journaler.mdlog(ro)
_finish_read_head loghead(trim 90131453181952,
expire 90131465778558, write 90132009715463,
stream_format 1). probing for end of log (from
90132009715463)...
-16> 2024-06-22T05:41:43.257+0100
7f1847e78700 1 mds.0.journaler.mdlog(ro)
probing for end of the log
-15> 2024-06-22T05:41:43.257+0100
7f1847e78700 1 mds.0.journaler.mdlog(ro)
_finish_probe_end write_pos = 90132019384791
(header had 90132009715463). recovered.
-14> 2024-06-22T05:41:43.257+0100
7f1847677700 4 mds.0.log Journal 0x200 recovered.
-13> 2024-06-22T05:41:43.257+0100
7f1847677700 4 mds.0.log Recovered journal
0x200 in format 1
-12> 2024-06-22T05:41:43.273+0100
7f1848e7a700 1 mds.0.journaler.pq(ro)
_finish_probe_end write_pos = 231163662875
(header had 231163662875). recovered.
-11> 2024-06-22T05:41:43.273+0100
7f1848e7a700 4 mds.0.purge_queue operator():
open complete
-10> 2024-06-22T05:41:43.273+0100
7f1848e7a700 1 mds.0.journaler.pq(ro)
set_writeable
-9> 2024-06-22T05:41:43.441+0100
7f1847e78700 2 mds.0.2529650 Booting: 1:
loading/discovering base inodes
-8> 2024-06-22T05:41:43.441+0100
7f1847e78700 0 mds.0.cache creating system
inode with ino:0x100
-7> 2024-06-22T05:41:43.442+0100
7f1847e78700 0 mds.0.cache creating system
inode with ino:0x1
-6> 2024-06-22T05:41:43.442+0100
7f1847e78700 2 mds.0.2529650 Booting: 2:
replaying mds log
-5> 2024-06-22T05:41:43.442+0100
7f1847e78700 2 mds.0.2529650 Booting: 2:
waiting for purge queue recovered
-4> 2024-06-22T05:41:44.090+0100
7f184ce82700 10 monclient: tick
-3> 2024-06-22T05:41:44.090+0100
7f184ce82700 10 monclient: _check_auth_rotating
have uptodate secrets (they expire after
2024-06-22T05:41:14.091638+0100)
-2> 2024-06-22T05:41:44.210+0100
7f1849e7c700 2 mds.0.cache Memory usage:
total 588368, rss 308304, heap 207132, baseline
182556, 0 / 15149 inodes have caps, 0 caps, 0
caps per inode
-1> 2024-06-22T05:41:44.642+0100
7f1846675700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
In function 'void interval_set<T, C>::erase(T,
T, std::function<bool(T, T)>) [with T =
inodeno_t; C = std::map]' thread 7f1846675700
time 2024-06-22T05:41:44.643146+0100
ceph version 17.2.7
(b12291d110049b2f35e32e0de30d70e9a4c060d2)
quincy (stable)
1: (ceph::__ceph_assert_fail(char const*, char
const*, int, char const*)+0x135) [0x7f18568b64a3]
2:
/usr/lib64/ceph/libceph-common.so.2(+0x269669)
[0x7f18568b6669]
3: (interval_set<inodeno_t,
std::map>::erase(inodeno_t, inodeno_t,
std::function<bool (inodeno_t,
inodeno_t)>)+0x2e5) [0x5592e5027885]
4: (EMetaBlob::replay(MDSRank*, LogSegment*,
int, MDPeerUpdate*)+0x4377) [0x5592e532c7b7]
5: (EUpdate::replay(MDSRank*)+0x61)
[0x5592e5330bd1]
6: (MDLog::_replay_thread()+0x7bb)
[0x5592e52b754b]
7: (MDLog::ReplayThread::entry()+0x11)
[0x5592e4f6a041]
8: /lib64/libpthread.so.0(+0x81ca)
[0x7f18558a41ca]
9: clone()
0> 2024-06-22T05:41:44.643+0100
7f1846675700 -1 *** Caught signal (Aborted) **
in thread 7f1846675700 thread_name:md_log_replay
ceph version 17.2.7
(b12291d110049b2f35e32e0de30d70e9a4c060d2)
quincy (stable)
1: /lib64/libpthread.so.0(+0x12cf0)
[0x7f18558aecf0]
2: gsignal()
3: abort()
4: (ceph::__ceph_assert_fail(char const*, char
const*, int, char const*)+0x18f) [0x7f18568b64fd]
5:
/usr/lib64/ceph/libceph-common.so.2(+0x269669)
[0x7f18568b6669]
6: (interval_set<inodeno_t,
std::map>::erase(inodeno_t, inodeno_t,
std::function<bool (inodeno_t,
inodeno_t)>)+0x2e5) [0x5592e5027885]
7: (EMetaBlob::replay(MDSRank*, LogSegment*,
int, MDPeerUpdate*)+0x4377) [0x5592e532c7b7]
8: (EUpdate::replay(MDSRank*)+0x61)
[0x5592e5330bd1]
9: (MDLog::_replay_thread()+0x7bb)
[0x5592e52b754b]
10: (MDLog::ReplayThread::entry()+0x11)
[0x5592e4f6a041]
11: /lib64/libpthread.so.0(+0x81ca)
[0x7f18558a41ca]
12: clone()
We have a relatively low debug setting normally so
I don't think many details of the initial crash
were captured unfortunately and the MDS logs before
the above (i.e. "-60" and older) are just beacon
messages and _check_auth_rotating checks.
I was wondering whether you have any
recommendations in terms of what actions we could
take to bring our filesystem back into a working
state short of rebuilding the entire metadata pool?
We are quite keen to bring our backup back into
service urgently as we currently do not have any
accessible backups for our Ceph clusters.
Kindest regards,
Ivan
On 25/06/2024 19:18, Dhairya Parmar wrote:
CAUTION: This email originated from outside of the
LMB:
*.-dparmar@xxxxxxxxxx-.*
Do not click links or open attachments unless you
recognize the sender and know the content is safe.
If you think this is a phishing email, please
forward it to phishing@xxxxxxxxxxxxxxxxx
--
On Tue, Jun 25, 2024 at 6:38 PM Ivan Clayson
<ivan@xxxxxxxxxxxxxxxxx> wrote:
Hi Dhairya,
Thank you for your rapid reply. I tried
recovering the dentries for the file just
before the crash I mentioned before and then
splicing the transactions from the journal
which seemed to remove that issue for that
inode but resulted in the MDS crashing on the
next inode in the journal when performing replay.
The MDS delegates a range of preallocated inodes
(in form of a set - interval_set<inodeno_t>
preallocated_inos) to the clients, so it can be
one inode that is untracked or some inodes from
the range or in worst case scenario - ALL, and
this is something that even the
`cephfs-journal-tool` would not be able to tell
(since we're talking about MDS internals which
aren't exposed to such tools). That is the reason
why you see "MDS crashing on the next inode in the
journal when performing replay".
An option could be to expose the inode set to some
tool or asok cmd to identify such inodes ranges,
which needs to be discussed. For now, we're trying
to address this in [0], you can follow the
discussion there.
[0] https://tracker.ceph.com/issues/66251
Removing all the transactions involving the
directory housing the files that seemed to
cause these crashes from the journal only
caused the MDS to fail to even start replay.
I've rolled back our journal to our original
version when the crash first happened and the
entire MDS log for the crash can be found
here:
https://www.mrc-lmb.cam.ac.uk/scicomp/data/uploads/ceph/ceph-mds.pebbles-s3.flush_journal.log-25-06-24
Awesome, this would help us a ton. Apart from
this, would it be possible to send us client logs?
Please let us know if you would like any other
logs file as we can easily induce this crash.
Since you can easily induce the crash, can you
share the reproducer please i.e. what all action
you take in order to hit this?
Kindest regards,
Ivan
On 25/06/2024 09:58, Dhairya Parmar wrote:
CAUTION: This email originated from outside
of the LMB:
*.-dparmar@xxxxxxxxxx-.*
Do not click links or open attachments unless
you recognize the sender and know the content
is safe.
If you think this is a phishing email, please
forward it to phishing@xxxxxxxxxxxxxxxxx
--
Hi Ivan,
This looks to be similar to the issue [0]
that we're already addressing at [1]. So
basically there is some out-of-sync event
that led the client to make use of the inodes
that MDS wasn't aware of/isn't tracking and
hence the crash. It'd be really helpful if
you can provide us more logs.
CC @Rishabh Dave <mailto:ridave@xxxxxxxxxx>
@Venky Shankar <mailto:vshankar@xxxxxxxxxx>
@Patrick Donnelly
<mailto:pdonnell@xxxxxxxxxx> @Xiubo Li
<mailto:xiubli@xxxxxxxxxx>
[0] https://tracker.ceph.com/issues/61009
[1] https://tracker.ceph.com/issues/66251
--
***Dhairya Parmar*
Associate Software Engineer, CephFS
<https://www.redhat.com/>IBM, Inc.
On Mon, Jun 24, 2024 at 8:54 PM Ivan Clayson
<ivan@xxxxxxxxxxxxxxxxx> wrote:
Hello,
We have been experiencing a serious issue
with our CephFS backup cluster
running quincy (version 17.2.7) on a
RHEL8-derivative Linux kernel
(Alma8.9, 4.18.0-513.9.1 kernel) where
our MDSes for our filesystem are
constantly in a "replay" or
"replay(laggy)" state and keep crashing.
We have a single MDS filesystem called
"ceph_backup" with 2 standby
MDSes along with a 2nd unused filesystem
"ceph_archive" (this holds
little to no data) where we are using our
"ceph_backup" filesystem to
backup our data and this is the one which
is currently broken. The Ceph
health outputs currently are:
root@pebbles-s1 14:05 [~]: ceph -s
cluster:
id:
e3f7535e-d35f-4a5d-88f0-a1e97abcd631
health: HEALTH_WARN
1 filesystem is degraded
insufficient standby MDS daemons available
1319 pgs not deep-scrubbed in time
1054 pgs not scrubbed in time
services:
mon: 4 daemons, quorum
pebbles-s1,pebbles-s2,pebbles-s3,pebbles-s4
(age 36m)
mgr: pebbles-s2(active, since
36m), standbys: pebbles-s4,
pebbles-s3, pebbles-s1
mds: 2/2 daemons up
osd: 1380 osds: 1380 up (since
29m), 1379 in (since 3d); 37
remapped pgs
data:
volumes: 1/2 healthy, 1 recovering
pools: 7 pools, 2177 pgs
objects: 3.55G objects, 7.0 PiB
usage: 8.9 PiB used, 14 PiB / 23
PiB avail
pgs: 83133528/30006841533
objects misplaced (0.277%)
2090 active+clean
47 active+clean+scrubbing+deep
29 active+remapped+backfilling
8 active+remapped+backfill_wait
2 active+clean+scrubbing
1 active+clean+snaptrim
io:
recovery: 1.9 GiB/s, 719 objects/s
root@pebbles-s1 14:09 [~]: ceph fs status
ceph_backup - 0 clients
===========
RANK STATE MDS ACTIVITY DNS INOS
DIRS CAPS
0 replay(laggy) pebbles-s3 0 0
0 0
POOL TYPE USED AVAIL
mds_backup_fs metadata 1255G 2780G
ec82_primary_fs_data data 0 2780G
ec82pool data 8442T 3044T
ceph_archive - 2 clients
============
RANK STATE MDS ACTIVITY DNS INOS
DIRS CAPS
0 active pebbles-s2 Reqs: 0
/s 13.4k 7105 118 2
POOL TYPE USED AVAIL
mds_archive_fs metadata 5184M 2780G
ec83_primary_fs_data data 0 2780G
ec83pool data 138T 2767T
MDS version: ceph version 17.2.7
(b12291d110049b2f35e32e0de30d70e9a4c060d2)
quincy (stable)
root@pebbles-s1 14:09 [~]: ceph health
detail | head
HEALTH_WARN 1 filesystem is degraded;
insufficient standby MDS
daemons available; 1319 pgs not
deep-scrubbed in time; 1054 pgs not
scrubbed in time
[WRN] FS_DEGRADED: 1 filesystem is
degraded
fs ceph_backup is degraded
[WRN] MDS_INSUFFICIENT_STANDBY:
insufficient standby MDS daemons
available
have 0; want 1 more
When our cluster first ran after a
reboot, Ceph ran through the 2
standby MDSes, crashing them all, until
it reached the final MDS and is
now stuck in this "replay(laggy)" state.
Putting our MDSes into
debugging mode, we can see that this MDS
crashed when replaying the
journal for a particular inode (this is
the same for all the MDSes and
they all crash on the same object):
...
2024-06-24T13:44:55.563+0100 7f8811c40700
10 mds.0.journal
EMetaBlob.replay for [521,head] had
[inode 0x1005ba89481
[...539,head]
/cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/
auth fragtree_t(*^2 00*^3 00000*^
4 00001*^3 00010*^4 00011*^4 00100*^4
00101*^4 00110*^4 00111*^4
01*^3 01000*^4 01001*^3 01010*^4
01011*^3 01100*^4 01101*^4 01110*^4
01111*^4 10*^3 10000*^4 10001*^4
10010*^4 10011*^4 10100*^4 10101*^3
10110*^4 10111*^4 11*^6) v10880645
f(v0 m2024-06-22
T05:41:10.213700+0100 1281276=1281276+0)
n(v12
rc2024-06-22T05:41:10.213700+0100
b1348251683896 1281277=1281276+1)
old_inodes=8 (iversion lock) |
dirfrag=416 dirty=1 0x55770a2bdb80]
2024-06-24T13:44:55.563+0100 7f8811c40700
10 mds.0.journal
EMetaBlob.replay dir 0x1005ba89481.011011000*
2024-06-24T13:44:55.563+0100 7f8811c40700
10 mds.0.journal
EMetaBlob.replay updated dir [dir
0x1005ba89481.011011000*
/cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/
[2,head] auth v=436385 cv=0/0
state=107374182
4 f(v0
m2024-06-22T05:41:10.213700+0100
2502=2502+0) n(v12
rc2024-06-22T05:41:10.213700+0100
b2120744220 2502=2502+0)
hs=32+33,ss=0+0 dirty=65 | child=1
0x55770ebcda80]
2024-06-24T13:44:55.563+0100 7f8811c40700
10 mds.0.journal
EMetaBlob.replay added (full) [dentry
#0x1/cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fracti
ons_ave_Z124.mrc.teberet7.partial
[539,head] auth NULL (dversion
lock) v=436384 ino=(nil)
state=1610612800|bottomlru | dirty=1
0x557710444500]
2024-06-24T13:44:55.563+0100 7f8811c40700
10 mds.0.journal
EMetaBlob.replay added [inode
0x1005cd4fe35 [539,head]
/cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_
005006_fractions_ave_Z124.mrc.teberet7.partial
auth v436384 s=0 n(v0
1=1+0) (iversion lock)
cr={99995144=0-4194304@538} 0x557710438680]
2024-06-24T13:44:55.563+0100 7f8811c40700 10
mds.0.cache.ino(0x1005cd4fe35)
mark_dirty_parent
2024-06-24T13:44:55.563+0100 7f8811c40700
10 mds.0.journal
EMetaBlob.replay noting opened inode
[inode 0x1005cd4fe35 [539,head]
/cephfs-users/afellows/Ferdos/20210625_real_DDFHFKLMT_KriosIII_K3/cryolo/test_micrographs/FoilHole_27649821_Data_27626128_2762
6130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
auth
v436384 DIRTYPARENT s=0 n(v0 1=1+0)
(iversion lock)
cr={99995144=0-4194304@538} |
dirtyparent=1 dirty=1 0x557710438680]
2024-06-24T13:44:55.563+0100 7f8811c40700
10 mds.0.journal
EMetaBlob.replay inotable tablev 3112837
<= table 3112837
2024-06-24T13:44:55.563+0100 7f8811c40700
10 mds.0.journal
EMetaBlob.replay sessionmap v 1560540883,
table 1560540882 prealloc
[] used 0x1005cd4fe35
2024-06-24T13:44:55.563+0100 7f8811c40700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
I
n function 'void interval_set<T,
C>::erase(T, T,
std::function<bool(T, T)>) [with T =
inodeno_t; C = std::map]'
thread 7f8811c40700 time
2024-06-24T13:44:55.564315+0100
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/include/interval_set.h:
568: FAILED ceph_assert(p->first <=
start)
ceph version 17.2.7
(b12291d110049b2f35e32e0de30d70e9a4c060d2)
quincy (stable)
1: (ceph::__ceph_assert_fail(char
const*, char const*, int, char
const*)+0x135) [0x7f8821e814a3]
2:
/usr/lib64/ceph/libceph-common.so.2(+0x269669)
[0x7f8821e81669]
3: (interval_set<inodeno_t,
std::map>::erase(inodeno_t, inodeno_t,
std::function<bool (inodeno_t,
inodeno_t)>)+0x2e5) [0x5576f9bb2885]
4: (EMetaBlob::replay(MDSRank*,
LogSegment*, int,
MDPeerUpdate*)+0x4377) [0x5576f9eb77b7]
5: (EUpdate::replay(MDSRank*)+0x61)
[0x5576f9ebbbd1]
6: (MDLog::_replay_thread()+0x7bb)
[0x5576f9e4254b]
7:
(MDLog::ReplayThread::entry()+0x11)
[0x5576f9af5041]
8: /lib64/libpthread.so.0(+0x81ca)
[0x7f8820e6f1ca]
9: clone()
I've only included a short section of the
crash (this is the first trace
in the log with regards to the crash with
a 10/20 debug_mds option). We
tried deleting the 0x1005cd4fe35 object
from the object store using the
"rados" command but this did not allow
our MDS to successfully replay.
From my understanding the journal seems
okay as we didn't run out of
space for example on our metadata pool
and "cephfs-journal-tool journal
inspect" doesn't seem to think there is
any damage:
root@pebbles-s1 13:58 [~]:
cephfs-journal-tool --rank=ceph_backup:0
journal inspect
Overall journal integrity: OK
root@pebbles-s1 14:04 [~]:
cephfs-journal-tool --rank=ceph_backup:0
event get --inode 1101069090357 summary
Events by type:
OPEN: 1
UPDATE: 3
Errors: 0
root@pebbles-s1 14:05 [~]:
cephfs-journal-tool --rank=ceph_backup:0
event get --inode 1101069090357 list
2024-06-22T05:41:10.214635+0100
0x51f97d4cfe35 UPDATE: (openc)
test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
2024-06-22T05:41:11.203312+0100
0x51f97d59c848 UPDATE:
(check_inode_max_size)
test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
2024-06-22T05:41:15.484871+0100
0x51f97e7344cc OPEN: ()
FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
2024-06-22T05:41:15.484921+0100
0x51f97e73493b UPDATE: (rename)
test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc.teberet7.partial
test_micrographs/FoilHole_27649821_Data_27626128_27626130_20210628_005006_fractions_ave_Z124.mrc
I was wondering whether anyone had any
advice for us on how we should
proceed forward? We were thinking about
manually applying these events
(via "event apply") where failing that we
could erase this problematic
event with "cephfs-journal-tool
--rank=ceph_backup:0 event splice
--inode 1101069090357". Is this a good
idea? We would rather not rebuild
the entire metadata pool if we could
avoid it (once was enough for us)
as this cluster has ~9 PB of data on it.
Kindest regards,
Ivan Clayson
--
Ivan Clayson
-----------------
Scientific Computing Officer
Room 2N249
Structural Studies
MRC Laboratory of Molecular Biology
Francis Crick Ave, Cambridge
CB2 0QH
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to
ceph-users-leave@xxxxxxx
--
Ivan Clayson
-----------------
Scientific Computing Officer
Room 2N249
Structural Studies
MRC Laboratory of Molecular Biology
Francis Crick Ave, Cambridge
CB2 0QH
--
Ivan Clayson
-----------------
Scientific Computing Officer
Room 2N249
Structural Studies
MRC Laboratory of Molecular Biology
Francis Crick Ave, Cambridge
CB2 0QH
--
Ivan Clayson
-----------------
Scientific Computing Officer
Room 2N249
Structural Studies
MRC Laboratory of Molecular Biology
Francis Crick Ave, Cambridge
CB2 0QH
--
Ivan Clayson
-----------------
Scientific Computing Officer
Room 2N249
Structural Studies
MRC Laboratory of Molecular Biology
Francis Crick Ave, Cambridge
CB2 0QH