Re: How To Scale Ceph for Large Numbers of Clients?

Zack Brenton <zack@xxxxxxxxxxxx> · Thu, 7 Mar 2019 15:02:39 -0400

On Thu, Mar 7, 2019 at 2:38 PM Patrick Donnelly <pdonnell@xxxxxxxxxx> wrote:
Is this with one active MDS and one standby-replay? The graph is odd

to me because the session count shows sessions on fs-b and fs-d but

not fs-c. Or maybe max_mds=2 and fs-d has no activity and fs-c is

standby-replay?

The graphs were taken when we were running with 2 active MDS and 2 standby-replay. Currently we're running with 1 active and 1 standby-replay.

Three OSDs are going to really struggle with the client load you're

putting on it. It doesn't surprise me you are getting slow requests

warning on the MDS for this reason. When you were running Luminous

12.2.9+ or Mimic 13.2.2+, were you seeing slow metadata I/O warnings?

Even if you did not, it possible that the MDS is delayed issuing caps

to clients because it's waiting for another client to flush writes and

release conflicting caps.

We didn't see any slow metadata I/O warnings, but this is the sort of thing that I've been suspecting is the underlying issue. However, one thing to note is that my current test only walks a directory and reads all the files in it, and it's running on a dev cluster that only I'm using, so I'm not sure which client would be generating writes that the MDS would be waiting on.

Generally we recommend that the metadata pool be located on OSDs with

fast devices separate from the data pool. This avoids priority

inversion of MDS metadata I/O with data I/O. See [1] to configure the

metadata pool on a separate set of OSDs.

Also, you're not going to saturate a 1.9TB NVMe SSD with one OSD. You

must partition it and setup multiple OSDs. This ends up being positive

for you so that you can put the metadata pool on its own set of OSDs.

[1] https://ceph.com/community/new-luminous-crush-device-classes/

This is excellent info, especially the second point as it lets me split out the metadata pool onto separate OSDs without spinning up any additional resources or using EBS volumes. I will look into partitioning the NVMe drives, split out the metadata pool, and check how this impacts performance.

Thanks,
Zack
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com