Help needed picking the right amount of PGs for (Cephfs) metadata pool

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi,

We have a CephFS filesystem holding 70 TiB of data in ~ 300 M files and ~ 900 M sub directories. We currently have 180 OSDs in this cluster.

POOL ID PGS STORED (DATA) (OMAP) OBJECTS USED (DATA) (OMAP) %USED MAX AVAIL cephfs_metadata 6 512 984 GiB 243 MiB 984 GiB 903.98M 2.9 TiB 728 MiB 2.9 TiB 3.06 30 TiB

The PGs in this pool (replicated, size=3, min_size=2), 6, are giving us a hard time (again). When PGs get remapped to other OSDs it introduces (tons of) slow ops and mds slow requests. Remapping more than 10 PGs at a time will result in OSDs marked as dead (iothread timeout). Scrubbing (with default settings) triggers slow ops too. Half of the cluster is running on SSDs (SAMSUNG MZ7LM3T8HMLP-00005 / INTEL SSDSC2KB03) with cache mode in write through, the other half is NVMe (SAMSUNG MZQLB3T8HALS-00007). No seperate WAL/DB devices. SSDs run on Intel (14 cores / 128 GB RAM), NVMe on AMD EPYC gen 1 / 2 with 16 cores 128 GB RAM). OSD_MEMORY_TARGET=11G. The load on the pool (and cluster in general) is modest. Plenty of CPU power available (mostly idling really). In the order of ~6 K MDS requests, ~ 1.5 K metadata ops (ballpark figure).

We currently have 512 PGs allocated to this pool. The autoscaler suggest reducing this amount to "32" PGs. This would result in only a fraction of the OSDs having *all* of the metadata. I can tell you, based on experience, that is not a good advise (the longer story here [1]). At least you want to spread out all OMAP data over as many (fast) disks as possible. So in this case it should advise 256.

As the PGs merely act as a "placeholder" for the (OMAP) data residing in the RocksDB database I wonder if it would help improve performance if we would split the PGs to, let's say, 2048 PGs. The amount of OMAP per PG would go down dramatically. Currently the amount of OMAP bytes per PG is ~ 1 GiB and # keys is ~ 2.3 M. Are these numbers crazy high causing the issues we see?

I guess upgrading to Pacific and sharding RocksDB would help a lot as well. But is there anything we can do to improve the current situation? Apart from throwing more OSDs at the problem ...

Thanks,

Gr. Stefan

[1]: https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/SDFJECHHVGVP3RTL3U5SG4NNYZOV5ALT/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx



[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux