Hi Erich,
there's no simple answer to your question, as always it depends.
Every now and then there are threads about clients misbehaving,
especially with the "flush tid" messages. For example, the docs [1]
state:
The CephFS client-MDS protocol uses a field called the oldest tid to
inform the MDS of which client requests are fully complete and may
therefore be forgotten about by the MDS. If a buggy client is
failing to advance this field, then the MDS may be prevented from
properly cleaning up resources used by client requests.
So it might be worth looking into the clients as well.
Looking at your output the number of requests is really low, so it
doesn't look like a load issue to me. MDS daemons are single threaded,
do you see one of them with a high CPU usage, for example in 'top'?
There's also the internal mds balancing that can cause issues with
multi-active MDS, in such a case pinning can help. But you would see
it in the logs if the two MDSs would play ping-pong with the requests.
And again, your cephfs load doesn't look very high, so maybe it's a
client issue afterall.
If you switch to one active MDS, does the behaviour change?
Regards,
Eugen
[1]
https://docs.ceph.com/en/quincy/cephfs/health-messages/#mds-client-oldest-tid-mds-client-oldest-tid-many
Zitat von Erich Weiler <weiler@xxxxxxxxxxxx>:
Hi All,
We have a slurm cluster with 25 clients, each with 256 cores, each
mounting a cephfs filesystem as their main storage target. The
workload can be heavy at times.
We have two active MDS daemons and one standby. A lot of the time
everything is healthy but we sometimes get warnings about MDS
daemons being slow on requests, behind on trimming, etc. I realize
their may be a bug in play, but also, I was wondering if we simply
didn't have enough MDS daemons to handle the load. Is there a way
to know if adding a MDS daemon would help? We could add a third
active MDS if needed. But I don't want to start adding a bunch of
MDS's if that won't help.
The OSD servers seem fine. It's mainly the MDS instances that are
complaining.
We are running reef 18.2.1.
For reference, when things look healthy:
# ceph fs status slugfs
slugfs - 34 clients
======
RANK STATE MDS ACTIVITY DNS INOS
DIRS CAPS
0 active slugfs.pr-md-03.mclckv Reqs: 273 /s 2759k 2636k 362k 1079k
1 active slugfs.pr-md-01.xdtppo Reqs: 194 /s 868k 674k
67.3k 351k
POOL TYPE USED AVAIL
cephfs_metadata metadata 127G 3281G
cephfs_md_and_data data 0 98.3T
cephfs_data data 740T 196T
STANDBY MDS
slugfs.pr-md-02.sbblqq
MDS version: ceph version 18.2.1
(7fe91d5d5842e04be3b4f514d6dd990c54b29c76) reef (stable)
# ceph -s
cluster:
id: 58bde08a-d7ed-11ee-9098-506b4b4da440
health: HEALTH_OK
services:
mon: 5 daemons, quorum
pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 5d)
mgr: pr-md-01.jemmdf(active, since 5w), standbys: pr-md-02.emffhz
mds: 2/2 daemons up, 1 standby
osd: 46 osds: 46 up (since 8d), 46 in (since 4w)
data:
volumes: 1/1 healthy
pools: 4 pools, 1313 pgs
objects: 271.17M objects, 493 TiB
usage: 744 TiB used, 384 TiB / 1.1 PiB avail
pgs: 1307 active+clean
4 active+clean+scrubbing
2 active+clean+scrubbing+deep
io:
client: 39 MiB/s rd, 108 MiB/s wr, 1.96k op/s rd, 54 op/s wr
But when things are in "warning" mode, it looks like this:
# ceph -s
cluster:
id: 58bde08a-d7ed-11ee-9098-506b4b4da440
health: HEALTH_WARN
1 filesystem is degraded
1 clients failing to advance oldest client/flush tid
1 MDSs report slow requests
1 MDSs behind on trimming
services:
mon: 5 daemons, quorum
pr-md-01,pr-md-02,pr-store-01,pr-store-02,pr-md-03 (age 5d)
mgr: pr-md-01.jemmdf(active, since 5w), standbys: pr-md-02.emffhz
mds: 2/2 daemons up, 1 standby
osd: 46 osds: 46 up (since 8d), 46 in (since 4w)
data:
volumes: 1/1 healthy
pools: 4 pools, 1313 pgs
objects: 271.28M objects, 494 TiB
usage: 746 TiB used, 382 TiB / 1.1 PiB avail
pgs: 1307 active+clean
5 active+clean+scrubbing
1 active+clean+scrubbing+deep
io:
client: 55 MiB/s rd, 2.6 MiB/s wr, 15 op/s rd, 46 op/s wr
And this:
# ceph health detail
HEALTH_WARN 2 clients failing to advance oldest client/flush tid; 2
MDSs report slow requests; 1 MDSs behind on trimming
[WRN] MDS_CLIENT_OLDEST_TID: 2 clients failing to advance oldest
client/flush tid
mds.slugfs.pr-md-01.xdtppo(mds.0): Client phoenix-06.prism
failing to advance its oldest client/flush tid. client_id: 125780
mds.slugfs.pr-md-02.sbblqq(mds.1): Client phoenix-00.prism
failing to advance its oldest client/flush tid. client_id: 99385
[WRN] MDS_SLOW_REQUEST: 2 MDSs report slow requests
mds.slugfs.pr-md-01.xdtppo(mds.0): 4 slow requests are blocked > 30 secs
mds.slugfs.pr-md-02.sbblqq(mds.1): 67 slow requests are blocked > 30 secs
[WRN] MDS_TRIM: 1 MDSs behind on trimming
mds.slugfs.pr-md-02.sbblqq(mds.1): Behind on trimming
(109410/250) max_segments: 250, num_segments: 109410
The "cure" is the restart the active MDS daemons, one at a time.
Then everything becomes healthy again, for a time.
We also have the following MDS config items in play:
mds_cache_memory_limit = 8589934592
mds_cache_trim_decay_rate = .6
mds_log_max_segments = 250
Thanks for any pointers!
cheers,
erich
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx