On Thu, Jun 2, 2022 at 11:40 AM Stefan Kooman <stefan@xxxxxx> wrote: > > Hi, > > We have a CephFS filesystem holding 70 TiB of data in ~ 300 M files and > ~ 900 M sub directories. We currently have 180 OSDs in this cluster. > > POOL ID PGS STORED (DATA) (OMAP) OBJECTS USED > (DATA) (OMAP) %USED MAX AVAIL > cephfs_metadata 6 512 984 GiB 243 MiB 984 GiB 903.98M 2.9 TiB > 728 MiB 2.9 TiB 3.06 30 TiB > > The PGs in this pool (replicated, size=3, min_size=2), 6, are giving us > a hard time (again). When PGs get remapped to other OSDs it introduces > (tons of) slow ops and mds slow requests. Remapping more than 10 PGs at > a time will result in OSDs marked as dead (iothread timeout). Scrubbing > (with default settings) triggers slow ops too. Half of the cluster is > running on SSDs (SAMSUNG MZ7LM3T8HMLP-00005 / INTEL SSDSC2KB03) with > cache mode in write through, the other half is NVMe (SAMSUNG > MZQLB3T8HALS-00007). No seperate WAL/DB devices. SSDs run on Intel (14 > cores / 128 GB RAM), NVMe on AMD EPYC gen 1 / 2 with 16 cores 128 GB > RAM). OSD_MEMORY_TARGET=11G. The load on the pool (and cluster in > general) is modest. Plenty of CPU power available (mostly idling > really). In the order of ~6 K MDS requests, ~ 1.5 K metadata ops > (ballpark figure). > > We currently have 512 PGs allocated to this pool. The autoscaler suggest > reducing this amount to "32" PGs. This would result in only a fraction > of the OSDs having *all* of the metadata. I can tell you, based on > experience, that is not a good advise (the longer story here [1]). At > least you want to spread out all OMAP data over as many (fast) disks as > possible. So in this case it should advise 256. > Curious, how many PGs do you have in total in all the pools of your Ceph cluster? What are the other pools (e.g., data pools) and each of their PG counts? What version of Ceph are you using? > As the PGs merely act as a "placeholder" for the (OMAP) data residing in > the RocksDB database I wonder if it would help improve performance if we > would split the PGs to, let's say, 2048 PGs. The amount of OMAP per PG > would go down dramatically. Currently the amount of OMAP bytes per PG is > ~ 1 GiB and # keys is ~ 2.3 M. Are these numbers crazy high causing the > issues we see? > > I guess upgrading to Pacific and sharding RocksDB would help a lot as > well. But is there anything we can do to improve the current situation? > Apart from throwing more OSDs at the problem ... > > Thanks, > > Gr. Stefan > > [1]: > https://lists.ceph.io/hyperkitty/list/ceph-users@xxxxxxx/thread/SDFJECHHVGVP3RTL3U5SG4NNYZOV5ALT/ > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > Regards, Ramana _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx