Hi,
I am running a Ceph Octopus (15.2.8) cluster primarily for CephFS.
Metadata is stored on SSD, data is stored in three different pools on
HDD. Currently, I use 22 subvolumes.
I am rotating snapshots on 16 subvolumes, all in the same pool, which is
the primary data pool for CephFS. Currently I have 41 snapshots per
subvolume. The goal is 50 snapshots (see bottom of mail for details).
Snapshots are only placed in the root subvolume directory, i.e.
/volumes/_nogroup/subvolname/hex-id/.snap
I place the snapshots on one of the nodes. Complete CephFS is mounted,
mkdir and rmdir is performed for each relevant subvolume, then CephFS is
unmounted again. All PGs are active+clean most of the time, only a few
in snaptrim for 1-2 minutes after snapshot deletion. I therefore assume
that snaptrim is not a limiting factor.
Obviously, the total number of snapshots is more than the 400 and 100 I
see mentioned in some documentation. I am unsure if that is an issue
here, as the snapshots are all in disjunct subvolumes.
When mounting the subvolumes with kernel client (ranging from CentOS 7
supplied 3.10 up to 5.4.93), after some time and for some subvolumes the
kworker process begins to hug 100% cpu usage and stat operations become
very slow (even slower than with fuse client). I can mostly replicate
this by starting specific rsync operations (with many small files, e.g.
CTAN, CentOS, Debian mirrors) and by running a bareos backup. The
kworker process seems to be stuck even after terminating the causing
operating, i.e. rsync or bareos-fd.
Interestingly, I can even trigger these issues on a host that has only a
single CephFS subvolume without any snapshots mounted, as long as that
subvolume is in the same pool as other subvolumes with snapshots.
I don't see any abnormal behaviour on the cluster nodes or on other
clients during these kworker hanging phases.
With fuse client, in normal operation stat calls are about 10-20x slower
than with the kernel client. However, I don't encounter the extreme
slowdown behaviour. I am therefore currently mounting some
known-problematic subvolumes with fuse and non-problematic subvolumes
with the kernel client.
My questions are:
- Is this known or expected behaviour?
- I could move the subvolumes with snapshots into a subvolumegroup and
snapshot the whole group instead of each subvolume. Will this be likely
to solve the issues?
- What is the current recommendation regarding CephFS and max number of
snapshots?
Cluster setup:
5 nodes with a total of 56 OSDs
Each node has a Xeon Silver 4208 and 128 GB RAM
Each node has two 480GB Samsung PM883 SSD used for CephFS metadata pool
HDDs are ranging from 8TB to 14TB, majority is 14TB
10 GbE internal network and 10 GbE client network, no Jumbo frames
$ ceph df
--- RAW STORAGE ---
CLASS SIZE AVAIL USED RAW USED %RAW USED
hdd 520 TiB 141 TiB 378 TiB 379 TiB 72.88
ssd 3.9 TiB 3.8 TiB 1.7 GiB 97 GiB 2.46
TOTAL 524 TiB 145 TiB 378 TiB 379 TiB 72.36
--- POOLS ---
POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL
device_health_metrics 1 1 66 MiB 57 198 MiB 0 23 TiB
cephfs.cephfs.meta 2 1024 26 GiB 2.29M 77 GiB 2.06 1.2 TiB
cephfs.cephfs.data 3 1024 70 TiB 54.95M 213 TiB 75.19 23 TiB
lofar 4 512 77 TiB 21.41M 154 TiB 68.68 35 TiB
proxmox 6 64 526 GiB 158.60k 1.6 TiB 2.16 23 TiB
archive 7 32 7.3 TiB 5.42M 10 TiB 12.57 56 TiB
Snapshots are only on cephfs.cephfs.data pool.
Intended snapshot rotation:
4 quarter-hourly snapshots
24 hourly snapshots
14 daily snapshots
8 weekly snapshots
Cheers
Sebastian
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx