Help needed picking the right amount of PGs for (Cephfs) metadata pool

Stefan Kooman <stefan@xxxxxx> · Thu, 2 Jun 2022 17:40:15 +0200

Hi,

We have a CephFS filesystem holding 70 TiB of data in ~ 300 M files and 
~ 900 M sub directories. We currently have 180 OSDs in this cluster.

POOL              ID  PGS   STORED   (DATA)   (OMAP)   OBJECTS  USED 
(DATA)   (OMAP)   %USED  MAX AVAIL
cephfs_metadata   6   512  984 GiB  243 MiB  984 GiB  903.98M  2.9 TiB 
728 MiB  2.9 TiB   3.06     30 TiB

The PGs in this pool (replicated, size=3, min_size=2), 6, are giving us 
a hard time (again). When PGs get remapped to other OSDs it introduces 
(tons of) slow ops and mds slow requests. Remapping more than 10 PGs at 
a time will result in OSDs marked as dead (iothread timeout). Scrubbing 
(with default settings) triggers slow ops too. Half of the cluster is 
running on SSDs (SAMSUNG MZ7LM3T8HMLP-00005 / INTEL SSDSC2KB03) with 
cache mode in write through, the other half is NVMe (SAMSUNG 
MZQLB3T8HALS-00007). No seperate WAL/DB devices. SSDs run on Intel (14 
cores / 128 GB RAM), NVMe on AMD EPYC gen 1 / 2 with 16 cores 128 GB 
RAM). OSD_MEMORY_TARGET=11G. The load on the pool (and cluster in 
general) is modest. Plenty of CPU power available (mostly idling 
really). In the order of ~6 K MDS requests, ~ 1.5 K metadata ops 
(ballpark figure).

We currently have 512 PGs allocated to this pool. The autoscaler suggest 
reducing this amount to "32" PGs. This would result in only a fraction 
of the OSDs having *all* of the metadata. I can tell you, based on 
experience, that is not a good advise (the longer story here [1]). At 
least you want to spread out all OMAP data over as many (fast) disks as 
possible. So in this case it should advise 256.

As the PGs merely act as a "placeholder" for the (OMAP) data residing in 
the RocksDB database I wonder if it would help improve performance if we 
would split the PGs to, let's say, 2048 PGs. The amount of OMAP per PG 
would go down dramatically. Currently the amount of OMAP bytes per PG is 
~ 1 GiB and # keys is ~ 2.3 M. Are these numbers crazy high causing the 
issues we see?

I guess upgrading to Pacific and sharding RocksDB would help a lot as 
well. But is there anything we can do to improve the current situation? 
Apart from throwing more OSDs at the problem ...

Thanks,

Gr. Stefan

[1]: 
https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/SDFJECHHVGVP3RTL3U5SG4NNYZOV5ALT/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx