Re: Recommended hardware for MDS server

Burkhard Linke <Burkhard.Linke@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx> · Tue, 23 Aug 2016 10:23:11 +0200

Hi,

On 08/22/2016 07:27 PM, Wido den Hollander wrote:
Op 22 augustus 2016 om 15:52 schreef Christian Balzer <chibi@xxxxxxx>:

Hello,

first off, not a CephFS user, just installed it on a lab setup for fun.
That being said, I tend to read most posts here.

And I do remember participating in similar discussions.

On Mon, 22 Aug 2016 14:47:38 +0200 Burkhard Linke wrote:

Hi,

we are running CephFS with about 70TB data, > 5 million files and about
100 clients. The MDS is currently colocated on a storage box with 14 OSD
(12 HDD, 2SSD). The box has two E52680v3 CPUs and 128 GB RAM. CephFS
runs fine, but it feels like the metadata operations may need more speed.

Firstly, I wouldn't share share the MDS with a storage/OSD node, a MON
node would make a more "natural" co-location spot.
Indeed. I always try to avoid to co-locate anything with the OSDs.
The MONs are also colocated with other OSD hosts, but this is also 
subject to change in the near future.

That being said, CPU wise that machine feels vastly overpowered, don't see
more then half of the cores utilized ever for OSD purposes, even in the
most contrived test cases.

Have you monitored that node with something like atop to get a feel what
tasks are using how much (of a specific) CPU?

Excerpt of MDS perf dump:
"mds": {
          "request": 73389282,
          "reply": 73389282,
          "reply_latency": {
              "avgcount": 73389282,
              "sum": 259696.749971457
          },
          "forward": 0,
          "dir_fetch": 4094842,
          "dir_commit": 720085,
          "dir_split": 0,
          "inode_max": 5000000,
          "inodes": 5000065,
          "inodes_top": 320979,
          "inodes_bottom": 530518,
          "inodes_pin_tail": 4148568,
          "inodes_pinned": 4469666,
          "inodes_expired": 60001276,
          "inodes_with_caps": 4468714,
          "caps": 4850520,
          "subtrees": 2,
          "traverse": 92378836,
          "traverse_hit": 75743822,
          "traverse_forward": 0,
          "traverse_discover": 0,
          "traverse_dir_fetch": 1719440,
          "traverse_remote_ino": 33,
          "traverse_lock": 3952,
          "load_cent": 7339063064,
          "q": 0,
          "exported": 0,
          "exported_inodes": 0,
          "imported": 0,
          "imported_inodes": 0
      },....

The setup is expected grow, with regards to the amount of stored data
and the number of clients. The MDS process currently consumes about 36
TB RAM, with 22 TB resident. Since a large part of the MDS run single
threaded, a CPU with less core and more CPU frequency might be a better
choice in this setup.

I suppose you mean GB up there. ^o^

If memory serves me well, there are knobs to control MDS memory usage, so
tuning them upwards may help.

mds_cache_size you mean probably. That's the amount of inodes the MDS will cache at max.

Keep in mind, a single inodes uses about 4k of memory. So the default of 100k will consume 400MB of memory.

You can increase this to 16.777.216 so it will use about 64GB at max. I would still advise to put 128GB of memory in that machine since the MDS might have a leak at some points and you want to give it some headroom.

Source: http://docs.ceph.com/docs/master/cephfs/mds-config-ref/
mds_cache_size is already set to 5.000.000 and will need to be changed 
again since there are already cache pressure messages in the ceph logs. 
128GB RAM will definitely be a good idea.

And yes to the less cores, more speed rationale. Up to a point of course.
Indeed. Faster single-core E5 is better for the MDS than a slower multi-core.
So I'll have a closer look at configurations with E5-1XXX.

Again, checking with atop should give you a better insight there.

Also up there you said metadata stuff feels sluggish, have you considered
moving that pool to SSDs?
I recall from recent benchmarks that there was no benefit in having the metadata on SSD. Sure, it might help a bit with maybe a journal replay, but I think that regular disks with a proper journal do just fine.
Most of the metadata is read by the MDS upon start and cached in memory 
(that's why the process consumes several GB of RAM...). Given a suitable 
cache size, only journal updates should result in I/O to the metadata 
pool; client requests should be served from memory.

Thanks for hints, I'll go for a single socket setup with a E5-1XXX and 
128GB RAM.

Regards,
Burkhard
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com