Re: Revisiting MDS memory footprint

Gregory Farnum <greg@xxxxxxxxxxx> · Mon, 1 Dec 2014 17:55:27 -0800

On Mon, Dec 1, 2014 at 8:06 AM, John Spray <john.spray@xxxxxxxxxx> wrote:
> I meant to chime in earlier here but then the weekend happened, comments inline
>
> On Sun, Nov 30, 2014 at 7:20 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
>> Why would you want all CephFS metadata in memory? With any filesystem
>> that will be a problem.
>
> The latency associated with a cache miss (RADOS OMAP dirfrag read) is
> fairly high, so the goal when sizing will to allow the MDSs to keep a
> very large proportion of the metadata in RAM.  In a local FS, the
> filesystem metadata in RAM is relatively small, and the speed to disk
> is relatively high.  In Ceph FS, that is reversed: we want to
> compensate for the cache miss latency by having lots of RAM in the MDS
> and a big cache.
>
> hot-standby MDSs are another manifestation of the expected large
> cache: we expect these caches to be big, to the point where refilling
> from the backing store on a failure would be annoyingly slow, and it's
> worth keeping that hot standby cache.

I actually don't think the cache misses should be *dramatically* more
expensive than local FS misses. They'll be larger since it's remote
and a leveldb lookup is a bit slower than hitting the rest spot on
disk, but everything's nicely streamed in and such so it's not too
bad.
But I'm also making this up as much as you are the rest of it, which
looks good to me. :)

The one thing I'd also bring up is just to be a bit more explicit
about CephFS in-memory inode size having nothing to do with that of a
local FS. We don't need to keep track of things like block locations,
but we do keep track of file "capabilities" (leases) and a whole bunch
of other state like the scrubbing/fsck status of it (coming soon!),
the clean/dirty status in a lot more detail than the kernel does, any
old versions of the inode that have been snapshotted, etc etc etc.
Once upon a time Sage did have some numbers indicating that a cached
dentry took about 1KB, but things change in both directions pretty
frequently and memory use will likely be a thing we look at around the
time we're wondering if we should declare CephFS to be ready for
community use in production previews.
-Greg

>
> Also, remember that because we embed inodes in dentries, when we load
> a directory fragment we are also loading all the inodes in that
> directory fragment -- if you have only one file open, but it has an
> ancestor with lots of files, then you'll have more files in cache than
> you might have expected.
>
>> We do however need a good rule of thumb of how much memory is used for
>> each inode.
>
> Yes -- and ideally some practical measurements too :-)
>
> One important point that I don't think anyone mentioned so far: the
> memory consumption per inode depends on how many clients have
> capabilities on the inode.  So if many clients hold a read capability
> on a file, more memory will be used MDS-side for that file.  If
> designing a benchmark for this, the client count, and level of overlap
> in the client workloads would be an important dimension.
>
> The number of *open* files on clients strongly affects the ability of
> the MDS to trim is cache, since the MDS pins in cache any inode which
> is in use by a client.  We recently added health checks so that the
> MDS can complain about clients that are failing to respond to requests
> to trim their caches, and the way we test this is to have a client
> obstinately keep some number of files open.
>
> We also allocate memory for pending metadata updates (so-called
> 'projected inodes') while they are in the journal, so the memory usage
> will depend on the journal size and the number of writes in flight.
>
> It would be really useful to come up with a test script that monitors
> MDS memory consumption as a function of number of files in cache,
> number of files opened by clients, number of clients opening the same
> files.  I feel a 3d chart plot coming on :-)
>
> Cheers,
> John
> _______________________________________________
> ceph-users mailing list
> ceph-users@xxxxxxxxxxxxxx
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com