Re: Revisiting MDS memory footprint

John Spray <john.spray@xxxxxxxxxx> · Mon, 1 Dec 2014 16:06:19 +0000

I meant to chime in earlier here but then the weekend happened, comments inline

On Sun, Nov 30, 2014 at 7:20 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
> Why would you want all CephFS metadata in memory? With any filesystem
> that will be a problem.

The latency associated with a cache miss (RADOS OMAP dirfrag read) is
fairly high, so the goal when sizing will to allow the MDSs to keep a
very large proportion of the metadata in RAM.  In a local FS, the
filesystem metadata in RAM is relatively small, and the speed to disk
is relatively high.  In Ceph FS, that is reversed: we want to
compensate for the cache miss latency by having lots of RAM in the MDS
and a big cache.

hot-standby MDSs are another manifestation of the expected large
cache: we expect these caches to be big, to the point where refilling
from the backing store on a failure would be annoyingly slow, and it's
worth keeping that hot standby cache.

Also, remember that because we embed inodes in dentries, when we load
a directory fragment we are also loading all the inodes in that
directory fragment -- if you have only one file open, but it has an
ancestor with lots of files, then you'll have more files in cache than
you might have expected.

> We do however need a good rule of thumb of how much memory is used for
> each inode.

Yes -- and ideally some practical measurements too :-)

One important point that I don't think anyone mentioned so far: the
memory consumption per inode depends on how many clients have
capabilities on the inode.  So if many clients hold a read capability
on a file, more memory will be used MDS-side for that file.  If
designing a benchmark for this, the client count, and level of overlap
in the client workloads would be an important dimension.

The number of *open* files on clients strongly affects the ability of
the MDS to trim is cache, since the MDS pins in cache any inode which
is in use by a client.  We recently added health checks so that the
MDS can complain about clients that are failing to respond to requests
to trim their caches, and the way we test this is to have a client
obstinately keep some number of files open.

We also allocate memory for pending metadata updates (so-called
'projected inodes') while they are in the journal, so the memory usage
will depend on the journal size and the number of writes in flight.

It would be really useful to come up with a test script that monitors
MDS memory consumption as a function of number of files in cache,
number of files opened by clients, number of clients opening the same
files.  I feel a 3d chart plot coming on :-)

Cheers,
John
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com