Hi,
We have a CephFS filesystem holding 70 TiB of data in ~ 300 M files and
~ 900 M sub directories. We currently have 180 OSDs in this cluster.
POOL ID PGS STORED (DATA) (OMAP) OBJECTS USED
(DATA) (OMAP) %USED MAX AVAIL
cephfs_metadata 6 512 984 GiB 243 MiB 984 GiB 903.98M 2.9 TiB
728 MiB 2.9 TiB 3.06 30 TiB
The PGs in this pool (replicated, size=3, min_size=2), 6, are giving us
a hard time (again). When PGs get remapped to other OSDs it introduces
(tons of) slow ops and mds slow requests. Remapping more than 10 PGs at
a time will result in OSDs marked as dead (iothread timeout). Scrubbing
(with default settings) triggers slow ops too. Half of the cluster is
running on SSDs (SAMSUNG MZ7LM3T8HMLP-00005 / INTEL SSDSC2KB03) with
cache mode in write through, the other half is NVMe (SAMSUNG
MZQLB3T8HALS-00007). No seperate WAL/DB devices. SSDs run on Intel (14
cores / 128 GB RAM), NVMe on AMD EPYC gen 1 / 2 with 16 cores 128 GB
RAM). OSD_MEMORY_TARGET=11G. The load on the pool (and cluster in
general) is modest. Plenty of CPU power available (mostly idling
really). In the order of ~6 K MDS requests, ~ 1.5 K metadata ops
(ballpark figure).
We currently have 512 PGs allocated to this pool. The autoscaler suggest
reducing this amount to "32" PGs. This would result in only a fraction
of the OSDs having *all* of the metadata. I can tell you, based on
experience, that is not a good advise (the longer story here [1]). At
least you want to spread out all OMAP data over as many (fast) disks as
possible. So in this case it should advise 256.
As the PGs merely act as a "placeholder" for the (OMAP) data residing in
the RocksDB database I wonder if it would help improve performance if we
would split the PGs to, let's say, 2048 PGs. The amount of OMAP per PG
would go down dramatically. Currently the amount of OMAP bytes per PG is
~ 1 GiB and # keys is ~ 2.3 M. Are these numbers crazy high causing the
issues we see?
I guess upgrading to Pacific and sharding RocksDB would help a lot as
well. But is there anything we can do to improve the current situation?
Apart from throwing more OSDs at the problem ...
Thanks,
Gr. Stefan
[1]:
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/SDFJECHHVGVP3RTL3U5SG4NNYZOV5ALT/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx