Re: How To Scale Ceph for Large Numbers of Clients?

Zack Brenton <zack@xxxxxxxxxxxx> · Thu, 7 Mar 2019 13:56:40 -0400

Edit: screenshot removed due to message size constraints on the mailing list.

Hey Patrick,

I understand your skepticism! I'm also confident that this is some kind of a configuration issue; I'm not very familiar with all of Ceph's various configuration options as Rook generally abstracts those away, so I appreciate you taking the time to look into this.

I've attached a screenshot of our internal Ceph MDS dashboard that includes some data from one of my older load tests showing the memory and CPU usage of each MDS pod, as well as the session count, handled client request rate, and object r/w op rates. I'm confident that the `mds_cache_memory_limit` was 16GB for this test, although I've been testing with different values and unfortunately I don't have a historical record of those like I do for the metrics included on our dashboard.

Types of devices:
We run our Ceph pods on 3 AWS i3.2xlarge nodes. We're running 3 OSDs, 3 Mons, and 2 MDS pods (1 active, 1 standby-replay). Currently, each pod runs with the following resources:
- osds: 2 CPU, 6Gi RAM, 1.7Ti NVMe disk
- mds:  3 CPU, 24Gi RAM
- mons: 500m (.5) CPU, 1Gi RAM

`ceph osd tree`:
```
ID CLASS WEIGHT  TYPE NAME                           STATUS REWEIGHT PRI-AFF 
-1       5.18399 root default                                                
-5       1.72800     host ip-10-0-28-88-ec2-internal                         
 0   ssd 1.72800         osd.0                           up  1.00000 1.00000 
-3       1.72800     host ip-10-0-7-200-ec2-internal                         
 1   ssd 1.72800         osd.1                           up  1.00000 1.00000 
-7       1.72800     host ip-10-0-9-172-ec2-internal                         
 2   ssd 1.72800         osd.2                           up  1.00000 1.00000 
 ```

`ceph osd df`:
```
ID CLASS WEIGHT  REWEIGHT SIZE    USE     AVAIL   %USE VAR  PGS 
 0   ssd 1.72800  1.00000 1.7 TiB 1.9 GiB 1.7 TiB 0.11 1.00 200 
 1   ssd 1.72800  1.00000 1.7 TiB 1.9 GiB 1.7 TiB 0.11 1.00 200 
 2   ssd 1.72800  1.00000 1.7 TiB 1.9 GiB 1.7 TiB 0.11 1.00 200 
                    TOTAL 5.2 TiB 5.6 GiB 5.2 TiB 0.11          
MIN/MAX VAR: 1.00/1.00  STDDEV: 0
```

`ceph osd lspools`:
```
1 myfs-metadata
2 myfs-data0
```

Let me know if there's any other information I can provide that would be helpful.

Thanks,
Zack

On Wed, Mar 6, 2019 at 9:49 PM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote:
Hello Zack,

On Wed, Mar 6, 2019 at 1:18 PM Zack Brenton <zack@xxxxxxxxxxxx> wrote:

>

> Hello,

>

> We're running Ceph on Kubernetes 1.12 using the Rook operator (https://rook.io), but we've been struggling to scale applications mounting CephFS volumes above 600 pods / 300 nodes. All our instances use the kernel client and run kernel `4.19.23-coreos-r1`.

>

> We've tried increasing the MDS memory limits, running multiple active MDS pods, and running different versions of Ceph (up to the latest Luminous and Mimic releases), but we run into MDS_SLOW_REQUEST errors at the same scale regardless of the memory limits we set. See this GitHub issue for more info on what we've tried up to this point: https://github.com/rook/rook/issues/2590

>

> I've written a simple load test that reads all the files in a given directory on an interval. While running this test, I've noticed that the `mds_co.bytes` value (from `ceph daemon mds.myfs-a dump_mempools | jq -c '.mempool.by_pool.mds_co'`) increases each time files are read. Why is this number increasing after the first iteration? If the same client is reading the same cached files, why would the data in the cache change at all? What is `mds_co.bytes` actually reporting?

>

> My most important question is this: How do I configure Ceph to be able to scale to large numbers of clients?

Please post more information about your cluster: types of devices,

`ceph osd tree`, `ceph osd df`, and `ceph osd lspools`.

There's no reason why CephFS shouldn't be able to scale to that number

of clients. The issue is probably related configuration of the

pools/MDS. From your ticket, I have a *lot* of trouble believing the

MDS still at 3GB memory usage with that number of clients and

mds_cache_memory_limit=17179869184 (16GB).

-- 

Patrick Donnelly

_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com