Quoting Stefan Kooman (stefan@xxxxxx): > Hi, > > Like I said in an earlier mail to this list, we re-balanced ~ 60% of the > CephFS metadata pool to NVMe backed devices. Roughly 422 M objects (1.2 > Billion replicated). We have 512 PGs allocated to them. While > rebalancing we suffered from quite a few SLOW_OPS. Memory, CPU and > device IOPS capacity were not a limiting factor as far as we can see (plenty of > them available ... nowhere near max capacity). We saw quite a few > slow ops with the following events: > > "time": "2019-12-19 09:41:02.712010", > "event": "reached_pg" > }, > { > "time": "2019-12-19 09:41:02.712014", > "event": "waiting for rw locks" > }, > { > "time": "2019-12-19 09:41:02.881939", > "event": "reached_pg" > > ... and this repeated 100's of times taking ~ 30 seconds to complete > > Does this indicate PG lock contention? > > If so ... would we need to provide more PGs to the metadata pool to avoid this? > > The metadata pool is only ~ 166 MiB big ... but with loads of OMAPs ... > > Most advice on PG planning is concerned with the _amount_ of data ... but the > metadata pool (and this might also be true for RGW index pools) seem to be a > special case. This does seem to be the case. We moved the data to a subset of the cluster which turned out not to be a good idea. The OSDs suffered badly from this. Spreading the workload accross all OSDs (reverting change) fixed the issues. If you have *lots* of small files and / or directories in your cluster ... scale your metadata PGs accordingly. Gr. Stefan -- | BIT BV https://www.bit.nl/ Kamer van Koophandel 09090351 | GPG: 0xD14839C6 +31 318 648 688 / info@xxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx