Edit: screenshot removed due to message size constraints on the mailing list.
Hey Patrick,
I understand your skepticism! I'm also confident that this is some kind of a configuration issue; I'm not very familiar with all of Ceph's various configuration options as Rook generally abstracts those away, so I appreciate you taking the time to look into this.
Types of devices:
We run our Ceph pods on 3 AWS i3.2xlarge nodes. We're running 3 OSDs, 3 Mons, and 2 MDS pods (1 active, 1 standby-replay). Currently, each pod runs with the following resources:
- osds: 2 CPU, 6Gi RAM, 1.7Ti NVMe disk
- mds: 3 CPU, 24Gi RAM
- mons: 500m (.5) CPU, 1Gi RAM
`ceph osd tree`:
```
ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF
-1 5.18399 root default
-5 1.72800 host ip-10-0-28-88-ec2-internal
0 ssd 1.72800 osd.0 up 1.00000 1.00000
-3 1.72800 host ip-10-0-7-200-ec2-internal
1 ssd 1.72800 osd.1 up 1.00000 1.00000
-7 1.72800 host ip-10-0-9-172-ec2-internal
2 ssd 1.72800 osd.2 up 1.00000 1.00000
```
`ceph osd df`:
```
ID CLASS WEIGHT REWEIGHT SIZE USE AVAIL %USE VAR PGS
0 ssd 1.72800 1.00000 1.7 TiB 1.9 GiB 1.7 TiB 0.11 1.00 200
1 ssd 1.72800 1.00000 1.7 TiB 1.9 GiB 1.7 TiB 0.11 1.00 200
2 ssd 1.72800 1.00000 1.7 TiB 1.9 GiB 1.7 TiB 0.11 1.00 200
TOTAL 5.2 TiB 5.6 GiB 5.2 TiB 0.11
MIN/MAX VAR: 1.00/1.00 STDDEV: 0
```
`ceph osd lspools`:
```
1 myfs-metadata
2 myfs-data0
```
Let me know if there's any other information I can provide that would be helpful.
Thanks,
Zack
On Wed, Mar 6, 2019 at 9:49 PM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote:
Hello Zack,
On Wed, Mar 6, 2019 at 1:18 PM Zack Brenton <zack@xxxxxxxxxxxx> wrote:
>
> Hello,
>
> We're running Ceph on Kubernetes 1.12 using the Rook operator (https://rook.io), but we've been struggling to scale applications mounting CephFS volumes above 600 pods / 300 nodes. All our instances use the kernel client and run kernel `4.19.23-coreos-r1`.
>
> We've tried increasing the MDS memory limits, running multiple active MDS pods, and running different versions of Ceph (up to the latest Luminous and Mimic releases), but we run into MDS_SLOW_REQUEST errors at the same scale regardless of the memory limits we set. See this GitHub issue for more info on what we've tried up to this point: https://github.com/rook/rook/issues/2590
>
> I've written a simple load test that reads all the files in a given directory on an interval. While running this test, I've noticed that the `mds_co.bytes` value (from `ceph daemon mds.myfs-a dump_mempools | jq -c '.mempool.by_pool.mds_co'`) increases each time files are read. Why is this number increasing after the first iteration? If the same client is reading the same cached files, why would the data in the cache change at all? What is `mds_co.bytes` actually reporting?
>
> My most important question is this: How do I configure Ceph to be able to scale to large numbers of clients?
Please post more information about your cluster: types of devices,
`ceph osd tree`, `ceph osd df`, and `ceph osd lspools`.
There's no reason why CephFS shouldn't be able to scale to that number
of clients. The issue is probably related configuration of the
pools/MDS. From your ticket, I have a *lot* of trouble believing the
MDS still at 3GB memory usage with that number of clients and
mds_cache_memory_limit=17179869184 (16GB).
--
Patrick Donnelly
_______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com