I meant to chime in earlier here but then the weekend happened, comments inline On Sun, Nov 30, 2014 at 7:20 PM, Wido den Hollander <wido@xxxxxxxx> wrote: > Why would you want all CephFS metadata in memory? With any filesystem > that will be a problem. The latency associated with a cache miss (RADOS OMAP dirfrag read) is fairly high, so the goal when sizing will to allow the MDSs to keep a very large proportion of the metadata in RAM. In a local FS, the filesystem metadata in RAM is relatively small, and the speed to disk is relatively high. In Ceph FS, that is reversed: we want to compensate for the cache miss latency by having lots of RAM in the MDS and a big cache. hot-standby MDSs are another manifestation of the expected large cache: we expect these caches to be big, to the point where refilling from the backing store on a failure would be annoyingly slow, and it's worth keeping that hot standby cache. Also, remember that because we embed inodes in dentries, when we load a directory fragment we are also loading all the inodes in that directory fragment -- if you have only one file open, but it has an ancestor with lots of files, then you'll have more files in cache than you might have expected. > We do however need a good rule of thumb of how much memory is used for > each inode. Yes -- and ideally some practical measurements too :-) One important point that I don't think anyone mentioned so far: the memory consumption per inode depends on how many clients have capabilities on the inode. So if many clients hold a read capability on a file, more memory will be used MDS-side for that file. If designing a benchmark for this, the client count, and level of overlap in the client workloads would be an important dimension. The number of *open* files on clients strongly affects the ability of the MDS to trim is cache, since the MDS pins in cache any inode which is in use by a client. We recently added health checks so that the MDS can complain about clients that are failing to respond to requests to trim their caches, and the way we test this is to have a client obstinately keep some number of files open. We also allocate memory for pending metadata updates (so-called 'projected inodes') while they are in the journal, so the memory usage will depend on the journal size and the number of writes in flight. It would be really useful to come up with a test script that monitors MDS memory consumption as a function of number of files in cache, number of files opened by clients, number of clients opening the same files. I feel a 3d chart plot coming on :-) Cheers, John _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com