Re: Help needed picking the right amount of PGs for (Cephfs) metadata pool

Ramana Venkatesh Raja <rraja@xxxxxxxxxx> · Thu, 2 Jun 2022 14:46:09 -0400

On Thu, Jun 2, 2022 at 11:40 AM Stefan Kooman <stefan@xxxxxx> wrote:
>
> Hi,
>
> We have a CephFS filesystem holding 70 TiB of data in ~ 300 M files and
> ~ 900 M sub directories. We currently have 180 OSDs in this cluster.
>
> POOL              ID  PGS   STORED   (DATA)   (OMAP)   OBJECTS  USED
> (DATA)   (OMAP)   %USED  MAX AVAIL
> cephfs_metadata   6   512  984 GiB  243 MiB  984 GiB  903.98M  2.9 TiB
> 728 MiB  2.9 TiB   3.06     30 TiB
>
> The PGs in this pool (replicated, size=3, min_size=2), 6, are giving us
> a hard time (again). When PGs get remapped to other OSDs it introduces
> (tons of) slow ops and mds slow requests. Remapping more than 10 PGs at
> a time will result in OSDs marked as dead (iothread timeout). Scrubbing
> (with default settings) triggers slow ops too. Half of the cluster is
> running on SSDs (SAMSUNG MZ7LM3T8HMLP-00005 / INTEL SSDSC2KB03) with
> cache mode in write through, the other half is NVMe (SAMSUNG
> MZQLB3T8HALS-00007). No seperate WAL/DB devices. SSDs run on Intel (14
> cores / 128 GB RAM), NVMe on AMD EPYC gen 1 / 2 with 16 cores 128 GB
> RAM). OSD_MEMORY_TARGET=11G. The load on the pool (and cluster in
> general) is modest. Plenty of CPU power available (mostly idling
> really). In the order of ~6 K MDS requests, ~ 1.5 K metadata ops
> (ballpark figure).
>
> We currently have 512 PGs allocated to this pool. The autoscaler suggest
> reducing this amount to "32" PGs. This would result in only a fraction
> of the OSDs having *all* of the metadata. I can tell you, based on
> experience, that is not a good advise (the longer story here [1]). At
> least you want to spread out all OMAP data over as many (fast) disks as
> possible. So in this case it should advise 256.
>

Curious, how many PGs do you have in total in all the pools of your
Ceph cluster? What are the other pools (e.g., data pools) and each of
their PG counts?

What version of Ceph are you using?

> As the PGs merely act as a "placeholder" for the (OMAP) data residing in
> the RocksDB database I wonder if it would help improve performance if we
> would split the PGs to, let's say, 2048 PGs. The amount of OMAP per PG
> would go down dramatically. Currently the amount of OMAP bytes per PG is
> ~ 1 GiB and # keys is ~ 2.3 M. Are these numbers crazy high causing the
> issues we see?
>
> I guess upgrading to Pacific and sharding RocksDB would help a lot as
> well. But is there anything we can do to improve the current situation?
> Apart from throwing more OSDs at the problem ...
>
> Thanks,
>
> Gr. Stefan
>
> [1]:
> https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/SDFJECHHVGVP3RTL3U5SG4NNYZOV5ALT/
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>

Regards,
Ramana

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx