On Fri, Nov 28, 2014 at 3:29 PM, Wido den Hollander <wido@xxxxxxxx> wrote: > On 11/28/2014 03:22 PM, Florian Haas wrote: >> On Fri, Nov 28, 2014 at 3:14 PM, Wido den Hollander <wido@xxxxxxxx> wrote: >>> On 11/28/2014 01:04 PM, Florian Haas wrote: >>>> Hi everyone, >>>> >>>> I'd like to come back to a discussion from 2012 (thread at >>>> http://marc.info/?l=ceph-devel&m=134808745719233) to estimate the >>>> expected MDS memory consumption from file metadata caching. I am certain >>>> the following is full of untested assumptions, some of which are >>>> probably inaccurate, so please shoot those down as needed. >>>> >>>> I did an entirely unscientific study of a real data set (my laptop, in >>>> case you care to know) which currently holds about 70G worth of data in >>>> a huge variety of file sizes and several file systems, and currently >>>> lists about 944,000 inodes as being in use. So going purely by order of >>>> magnitude and doing a wild approximation, I'll assume a ratio of 1 >>>> million files in 100G, or 10,000 files per gigabyte, which means an >>>> average file size of about 100KB -- again, approximating and forgetting >>>> about the difference between 10^3 and 2^10, and using a stupid >>>> arithmetic mean rather than a median which would probably be much more >>>> useful. >>>> >>>> If I were to assume that all those files were in CephFS, and they were >>>> all somehow regularly in use (or at least one file in each directory), >>>> then the Ceph MDS would have to keep the metadata of all those files in >>>> cache. Suppose further that the stat struct for all those files is >>>> anywhere between 1 and 2KB, and we go by an average of 1.5KB metadata >>>> per file including some overhead, then that would mean the average >>>> metadata per file is about 1.5% of the average file size. So for my 100G >>>> of data, the MDS would use about 1.5G of RAM for caching. >>>> >>>> If you scale that up for a filestore of say a petabyte, that means all >>>> your Ceph MDSs would consume a relatively whopping 15TB in total RAM for >>>> metadata caching, again assuming that *all* the data is actually used by >>>> clients. >>>> >>> >>> Why do you assume that ALL MDSs keep ALL metadata in memory? Isn't the >>> whole point of directory fragmentation that they all keep a bit of the >>> inodes in memory to spread the load? >> >> Directory subtree partitioning is considered neither stable nor >> supported. Hence why it's important to understand what a single active >> MDS will hold. >> > > Understood. So it's about sizing your MDS right now, not in the future > then the subtree partitioning works :) Correct. > Isn't the memory consumption not also influenced by mds_cache_size? > Those are the amount of inodes the MDS will cache in memory. > > Something that is not in cache will be read from RADOS afaik, so there > will be a limit in to how much memory the MDS will consume. I am acutely aware of that, but this is not about *limiting* MDS memory consumption. It's about "if I wanted to make sure that all my metadata fits in the cache, how much memory would I need for that?" Also as a corollary to this discussion, I'm not sure if anyone has actually run any stats on CephFS performance (read/write throughput, latency, and IOPS) as a function of cache hit/miss ratio. In other words I don't know, and I'm not sure if anyone knows, what the actual impact of MDS cache misses is — I am just assuming it would be quite significant, otherwise I can't imagine why Sage would have come up with the idea of a metadata-caching MDS in the first place. :) >>>> Now of course it's entirely unrealistic that in a production system data >>>> is actually ever used across the board, but are the above considerations >>>> "close enough" for a rule-of-thumb approximation of MDS memory >>>> footprint? As in, >>>> >>>> Total MDS RAM = (Total used storage) * (fraction of data in regular use) >>>> * 0.015 >>>> >>>> If CephFS users could use a rule of thumb like that, it would help them >>>> answer questions like "given a filesystem of size X, will a single MDS >>>> be enough to hold my metadata caches if Y is the maximum amount of >>>> memory I can afford for budget Z". >>>> >>>> All thoughts and comments much appreciated. Thank you! >>>> >>>> Cheers, >>>> Florian _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com