Re: MDS Performance and PG/PGP value

Frank Schilder <frans@xxxxxx> · Thu, 13 Oct 2022 12:25:12 +0000

Hi Yoann,

I'm not using pacific yet, but this here looks very strange to me:

  cephfs_data      data     243T  19.7T
    usage:   245 TiB used, 89 TiB / 334 TiB avail

I'm not sure if there is a mix of raw vs. stored here. Assuming the cephfs_data allocation is right, I'm wondering what your osd [near] full ratios are. The PG counts look very good. The slow ops can have 2 reasons: a bad disk or full OSDs. Looking at 19.7/(243+16.7)=6.4% free I wonder why there are no osd [near] full warnings all over the place. Even if its still 20% free performance can degrade dramatically according to benchmarks we made on octopus.

I think you need to provide a lot more details here. Of interest are:

ceph df detail
ceph osd df tree

and possibly a few others. I don't think the multi-MDS mode is bugging you, but you should check. We have seen degraded performance on mimic caused by excessive export_dir operations between the MDSes. However, I can't see such operations reported as stuck. You might want to check on your MDSes with ceph daemon mds.xzy ops | grep -e dirfrag -e export and/or similar commands. You should report a bit what kind of operations tend to be stuck longest.

I also remember that there used to be problems having a kclient ceph fs mount on OSD nodes. Not sure if this could play a role here.

You have basically zero IO going on:

    client:   6.2 MiB/s rd, 12 MiB/s wr, 10 op/s rd, 366 op/s wr

yet, PGs are laggy. The problem could sit on a non-ceph component.

With the hardware you have, there is something very weird going on. You might also want to check that you have the correct MTU on all devices on every single host and that the speed negotiated is the same. Problems like these I have seen with a single host having a wrong MTU and with LACP bonds with a broken transceiver.

Something else to check is flaky controller/PCIe connections. We had a case where a controller was behaving odd and we had a huge amount of device resets in the logs. On the host with the broken controller, IO wait was way above average (shown by top). Something similar might happen with NVMes. A painful procedure to locate a bad host could be to out OSDs manually on a single host and wait for PGs to peer and become active. If you have a bad host, in this moment IO should recover to good levels. Do this host by host. I know, it will be a day or two but, well, it might locate something.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Stefan Kooman <stefan@xxxxxx>
Sent: 13 October 2022 13:56:45
To: Yoann Moulin; Patrick Donnelly
Cc: ceph-users@xxxxxxx
Subject:  Re: MDS Performance and PG/PGP value

On 10/13/22 13:47, Yoann Moulin wrote:
>> Also, you mentioned you're using 7 active MDS. How's that working out
>> for you? Do you use pinning?
>
> I don't really know how to do that, I have 55 worker nodes in my K8s
> cluster, each one can run pods that have access to a cephfs pvc. we have
> 28 cephfs persistent volumes. Pods are ML/DL/AI workload, each can be
> start and stop whenever our researchers need it. The workloads are
> unpredictable.

See [1] and [2].

Gr. Stefan

[1]:
https://docs.ceph.com/en/quincy/cephfs/multimds/#manually-pinning-directory-trees-to-a-particular-rank
[2]:
https://docs.ceph.com/en/quincy/cephfs/multimds/#setting-subtree-partitioning-policies

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx