On 11/28/2014 03:22 PM, Florian Haas wrote: > On Fri, Nov 28, 2014 at 3:14 PM, Wido den Hollander <wido@xxxxxxxx> wrote: >> On 11/28/2014 01:04 PM, Florian Haas wrote: >>> Hi everyone, >>> >>> I'd like to come back to a discussion from 2012 (thread at >>> http://marc.info/?l=ceph-devel&m=134808745719233) to estimate the >>> expected MDS memory consumption from file metadata caching. I am certain >>> the following is full of untested assumptions, some of which are >>> probably inaccurate, so please shoot those down as needed. >>> >>> I did an entirely unscientific study of a real data set (my laptop, in >>> case you care to know) which currently holds about 70G worth of data in >>> a huge variety of file sizes and several file systems, and currently >>> lists about 944,000 inodes as being in use. So going purely by order of >>> magnitude and doing a wild approximation, I'll assume a ratio of 1 >>> million files in 100G, or 10,000 files per gigabyte, which means an >>> average file size of about 100KB -- again, approximating and forgetting >>> about the difference between 10^3 and 2^10, and using a stupid >>> arithmetic mean rather than a median which would probably be much more >>> useful. >>> >>> If I were to assume that all those files were in CephFS, and they were >>> all somehow regularly in use (or at least one file in each directory), >>> then the Ceph MDS would have to keep the metadata of all those files in >>> cache. Suppose further that the stat struct for all those files is >>> anywhere between 1 and 2KB, and we go by an average of 1.5KB metadata >>> per file including some overhead, then that would mean the average >>> metadata per file is about 1.5% of the average file size. So for my 100G >>> of data, the MDS would use about 1.5G of RAM for caching. >>> >>> If you scale that up for a filestore of say a petabyte, that means all >>> your Ceph MDSs would consume a relatively whopping 15TB in total RAM for >>> metadata caching, again assuming that *all* the data is actually used by >>> clients. >>> >> >> Why do you assume that ALL MDSs keep ALL metadata in memory? Isn't the >> whole point of directory fragmentation that they all keep a bit of the >> inodes in memory to spread the load? > > Directory subtree partitioning is considered neither stable nor > supported. Hence why it's important to understand what a single active > MDS will hold. > Understood. So it's about sizing your MDS right now, not in the future then the subtree partitioning works :) Isn't the memory consumption not also influenced by mds_cache_size? Those are the amount of inodes the MDS will cache in memory. Something that is not in cache will be read from RADOS afaik, so there will be a limit in to how much memory the MDS will consume. >>> Now of course it's entirely unrealistic that in a production system data >>> is actually ever used across the board, but are the above considerations >>> "close enough" for a rule-of-thumb approximation of MDS memory >>> footprint? As in, >>> >>> Total MDS RAM = (Total used storage) * (fraction of data in regular use) >>> * 0.015 >>> >>> If CephFS users could use a rule of thumb like that, it would help them >>> answer questions like "given a filesystem of size X, will a single MDS >>> be enough to hold my metadata caches if Y is the maximum amount of >>> memory I can afford for budget Z". >>> >>> All thoughts and comments much appreciated. Thank you! >>> >>> Cheers, >>> Florian -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on _______________________________________________ ceph-users mailing list ceph-users@xxxxxxxxxxxxxx http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com