On Thu, Mar 7, 2019 at 8:24 AM Zack Brenton <zack@xxxxxxxxxxxx> wrote: > > Hey Patrick, > > I understand your skepticism! I'm also confident that this is some kind of a configuration issue; I'm not very familiar with all of Ceph's various configuration options as Rook generally abstracts those away, so I appreciate you taking the time to look into this. > > I've attached a screenshot of our internal Ceph MDS dashboard that includes some data from one of my older load tests showing the memory and CPU usage of each MDS pod, as well as the session count, handled client request rate, and object r/w op rates. I'm confident that the `mds_cache_memory_limit` was 16GB for this test, although I've been testing with different values and unfortunately I don't have a historical record of those like I do for the metrics included on our dashboard. Is this with one active MDS and one standby-replay? The graph is odd to me because the session count shows sessions on fs-b and fs-d but not fs-c. Or maybe max_mds=2 and fs-d has no activity and fs-c is standby-replay? > Types of devices: > We run our Ceph pods on 3 AWS i3.2xlarge nodes. We're running 3 OSDs, 3 Mons, and 2 MDS pods (1 active, 1 standby-replay). Currently, each pod runs with the following resources: > - osds: 2 CPU, 6Gi RAM, 1.7Ti NVMe disk > - mds: 3 CPU, 24Gi RAM > - mons: 500m (.5) CPU, 1Gi RAM Three OSDs are going to really struggle with the client load you're putting on it. It doesn't surprise me you are getting slow requests warning on the MDS for this reason. When you were running Luminous 12.2.9+ or Mimic 13.2.2+, were you seeing slow metadata I/O warnings? Even if you did not, it possible that the MDS is delayed issuing caps to clients because it's waiting for another client to flush writes and release conflicting caps. Generally we recommend that the metadata pool be located on OSDs with fast devices separate from the data pool. This avoids priority inversion of MDS metadata I/O with data I/O. See [1] to configure the metadata pool on a separate set of OSDs. Also, you're not going to saturate a 1.9TB NVMe SSD with one OSD. You must partition it and setup multiple OSDs. This ends up being positive for you so that you can put the metadata pool on its own set of OSDs. [1] https://ceph.com/community/new-luminous-crush-device-classes/ -- Patrick Donnelly _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com