Re: How To Scale Ceph for Large Numbers of Clients?

Patrick Donnelly <pdonnell@xxxxxxxxxx> · Thu, 7 Mar 2019 10:38:15 -0800

On Thu, Mar 7, 2019 at 8:24 AM Zack Brenton <zack@xxxxxxxxxxxx> wrote:
>
> Hey Patrick,
>
> I understand your skepticism! I'm also confident that this is some kind of a configuration issue; I'm not very familiar with all of Ceph's various configuration options as Rook generally abstracts those away, so I appreciate you taking the time to look into this.
>
> I've attached a screenshot of our internal Ceph MDS dashboard that includes some data from one of my older load tests showing the memory and CPU usage of each MDS pod, as well as the session count, handled client request rate, and object r/w op rates. I'm confident that the `mds_cache_memory_limit` was 16GB for this test, although I've been testing with different values and unfortunately I don't have a historical record of those like I do for the metrics included on our dashboard.

Is this with one active MDS and one standby-replay? The graph is odd
to me because the session count shows sessions on fs-b and fs-d but
not fs-c. Or maybe max_mds=2 and fs-d has no activity and fs-c is
standby-replay?

> Types of devices:
> We run our Ceph pods on 3 AWS i3.2xlarge nodes. We're running 3 OSDs, 3 Mons, and 2 MDS pods (1 active, 1 standby-replay). Currently, each pod runs with the following resources:
> - osds: 2 CPU, 6Gi RAM, 1.7Ti NVMe disk
> - mds:  3 CPU, 24Gi RAM
> - mons: 500m (.5) CPU, 1Gi RAM

Three OSDs are going to really struggle with the client load you're
putting on it. It doesn't surprise me you are getting slow requests
warning on the MDS for this reason. When you were running Luminous
12.2.9+ or Mimic 13.2.2+, were you seeing slow metadata I/O warnings?
Even if you did not, it possible that the MDS is delayed issuing caps
to clients because it's waiting for another client to flush writes and
release conflicting caps.

Generally we recommend that the metadata pool be located on OSDs with
fast devices separate from the data pool. This avoids priority
inversion of MDS metadata I/O with data I/O. See [1] to configure the
metadata pool on a separate set of OSDs.

Also, you're not going to saturate a 1.9TB NVMe SSD with one OSD. You
must partition it and setup multiple OSDs. This ends up being positive
for you so that you can put the metadata pool on its own set of OSDs.

[1] https://ceph.com/community/new-luminous-crush-device-classes/

-- 
Patrick Donnelly
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com