On 11/28/2014 03:36 PM, Florian Haas wrote: > On Fri, Nov 28, 2014 at 3:29 PM, Wido den Hollander <wido@xxxxxxxx> wrote: >> On 11/28/2014 03:22 PM, Florian Haas wrote: >>> On Fri, Nov 28, 2014 at 3:14 PM, Wido den Hollander <wido@xxxxxxxx> wrote: >>>> On 11/28/2014 01:04 PM, Florian Haas wrote: >>>>> Hi everyone, >>>>> >>>>> I'd like to come back to a discussion from 2012 (thread at >>>>> http://marc.info/?l=ceph-devel&m=134808745719233) to estimate the >>>>> expected MDS memory consumption from file metadata caching. I am certain >>>>> the following is full of untested assumptions, some of which are >>>>> probably inaccurate, so please shoot those down as needed. >>>>> >>>>> I did an entirely unscientific study of a real data set (my laptop, in >>>>> case you care to know) which currently holds about 70G worth of data in >>>>> a huge variety of file sizes and several file systems, and currently >>>>> lists about 944,000 inodes as being in use. So going purely by order of >>>>> magnitude and doing a wild approximation, I'll assume a ratio of 1 >>>>> million files in 100G, or 10,000 files per gigabyte, which means an >>>>> average file size of about 100KB -- again, approximating and forgetting >>>>> about the difference between 10^3 and 2^10, and using a stupid >>>>> arithmetic mean rather than a median which would probably be much more >>>>> useful. >>>>> >>>>> If I were to assume that all those files were in CephFS, and they were >>>>> all somehow regularly in use (or at least one file in each directory), >>>>> then the Ceph MDS would have to keep the metadata of all those files in >>>>> cache. Suppose further that the stat struct for all those files is >>>>> anywhere between 1 and 2KB, and we go by an average of 1.5KB metadata >>>>> per file including some overhead, then that would mean the average >>>>> metadata per file is about 1.5% of the average file size. So for my 100G >>>>> of data, the MDS would use about 1.5G of RAM for caching. >>>>> >>>>> If you scale that up for a filestore of say a petabyte, that means all >>>>> your Ceph MDSs would consume a relatively whopping 15TB in total RAM for >>>>> metadata caching, again assuming that *all* the data is actually used by >>>>> clients. >>>>> >>>> >>>> Why do you assume that ALL MDSs keep ALL metadata in memory? Isn't the >>>> whole point of directory fragmentation that they all keep a bit of the >>>> inodes in memory to spread the load? >>> >>> Directory subtree partitioning is considered neither stable nor >>> supported. Hence why it's important to understand what a single active >>> MDS will hold. >>> >> >> Understood. So it's about sizing your MDS right now, not in the future >> then the subtree partitioning works :) > > Correct. > >> Isn't the memory consumption not also influenced by mds_cache_size? >> Those are the amount of inodes the MDS will cache in memory. >> >> Something that is not in cache will be read from RADOS afaik, so there >> will be a limit in to how much memory the MDS will consume. > > I am acutely aware of that, but this is not about *limiting* MDS > memory consumption. It's about "if I wanted to make sure that all my > metadata fits in the cache, how much memory would I need for that?" > Understood. I hoped someone else chimed in here who has more knowledge about this. But lets make a analogy with ZFS for example. You size your (L2)ARC there based on your hot data. Why would you want all CephFS metadata in memory? With any filesystem that will be a problem. We do however need a good rule of thumb of how much memory is used for each inode. > Also as a corollary to this discussion, I'm not sure if anyone has > actually run any stats on CephFS performance (read/write throughput, > latency, and IOPS) as a function of cache hit/miss ratio. In other > words I don't know, and I'm not sure if anyone knows, what the actual > impact of MDS cache misses is — I am just assuming it would be quite > significant, otherwise I can't imagine why Sage would have come up > with the idea of a metadata-caching MDS in the first place. :) > >>>>> Now of course it's entirely unrealistic that in a production system data >>>>> is actually ever used across the board, but are the above considerations >>>>> "close enough" for a rule-of-thumb approximation of MDS memory >>>>> footprint? As in, >>>>> >>>>> Total MDS RAM = (Total used storage) * (fraction of data in regular use) >>>>> * 0.015 >>>>> >>>>> If CephFS users could use a rule of thumb like that, it would help them >>>>> answer questions like "given a filesystem of size X, will a single MDS >>>>> be enough to hold my metadata caches if Y is the maximum amount of >>>>> memory I can afford for budget Z". >>>>> >>>>> All thoughts and comments much appreciated. Thank you! >>>>> >>>>> Cheers, >>>>> Florian -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com