Re: MDS Performance and PG/PGP value

Yoann Moulin <yoann.moulin@xxxxxxx> · Thu, 13 Oct 2022 13:47:47 +0200

Hello Patrick,

Unfortunately, increasing the number of PG did not help a lot in the end, my cluster is still in trouble...

Here the current state of my cluster : https://pastebin.com/Avw5ybgd

Is 256 good value in our case ? We have 80TB of data with more than 300M files.

You want at least as many PGs that each of the OSDs host a portion of the OMAP data. You want to spread out OMAP to as many _fast_ OSDs as possible.

I have tried to find an answer to your question: are more metadata PGs better? I haven't found a definitive answer. This would ideally be tested in a non-prod / pre-prod environment and tuned
to individual requirements (type of workload). For now, I would not blindly trust the PG autoscaler. I have seen it advise settings that would definately not be OK. You can skew things in the
autoscaler with the "bias" parameter, to compensate for this. But as far as I know the current heuristics to determine a good value do not take into account the importance of OMAP (RocksDB)
spread accross OSDs. See a blog post about autoscaler tuning [1].

It would be great if tuning metadata PGs for CephFS / RGW could be performed during the "large scale tests" the devs are planning to perform in the future. With use cases that take into
consideration "a lot of small files / objects" versus "loads of large files / objects" to get a feeling how tuning this impacts performance for different work loads.

Gr. Stefan

[1]: https://ceph.io/en/news/blog/2022/autoscaler_tuning/

Thanks for the information, I agree that autoscaler seem to not be able to handle my use case.
(thanks to icepic.dz@xxxxxxxxx too)

By the way, since I have set PG=256, I have much less SLOW requests than before, even I still have, the impact on my users has been reduced a lot.

# zgrep -c -E 'WRN.*(SLOW_OPS|SLOW_REQUEST|MDS_SLOW_METADATA_IO)' floki.log.4.gz floki.log.3.gz floki.log.2.gz floki.log.1.gz floki.log
floki.log.4.gz:6883
floki.log.3.gz:11794
floki.log.2.gz:3391
floki.log.1.gz:1180
floki.log:122

If I have the opportunity, I will try to run some benchmark with multiple value of the PG on cephfs_metadata pool.

256 sounds like a good number to me. Maybe even 128. If you do some
experiments, please do share the results.

Yes, of course.

Also, you mentioned you're using 7 active MDS. How's that working out
for you? Do you use pinning?

I don't really know how to do that, I have 55 worker nodes in my K8s cluster, each one can run pods that have access to a cephfs pvc. we have 28 cephfs persistent volumes. Pods are ML/DL/AI 
workload, each can be start and stop whenever our researchers need it. The workloads are unpredictable.

Thanks for your help.

Best regards,

--
Yoann Moulin
EPFL IC-IT

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx