Re: Revisiting MDS memory footprint

Wido den Hollander <wido@xxxxxxxx> · Fri, 28 Nov 2014 15:29:18 +0100

On 11/28/2014 03:22 PM, Florian Haas wrote:
> On Fri, Nov 28, 2014 at 3:14 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
>> On 11/28/2014 01:04 PM, Florian Haas wrote:
>>> Hi everyone,
>>>
>>> I'd like to come back to a discussion from 2012 (thread at
>>> http://marc.info/?l=ceph-devel&m=134808745719233) to estimate the
>>> expected MDS memory consumption from file metadata caching. I am certain
>>> the following is full of untested assumptions, some of which are
>>> probably inaccurate, so please shoot those down as needed.
>>>
>>> I did an entirely unscientific study of a real data set (my laptop, in
>>> case you care to know) which currently holds about 70G worth of data in
>>> a huge variety of file sizes and several file systems, and currently
>>> lists about 944,000 inodes as being in use. So going purely by order of
>>> magnitude and doing a wild approximation, I'll assume a ratio of 1
>>> million files in 100G, or 10,000 files per gigabyte, which means an
>>> average file size of about 100KB -- again, approximating and forgetting
>>> about the difference between 10^3 and 2^10, and using a stupid
>>> arithmetic mean rather than a median which would probably be much more
>>> useful.
>>>
>>> If I were to assume that all those files were in CephFS, and they were
>>> all somehow regularly in use (or at least one file in each directory),
>>> then the Ceph MDS would have to keep the metadata of all those files in
>>> cache. Suppose further that the stat struct for all those files is
>>> anywhere between 1 and 2KB, and we go by an average of 1.5KB metadata
>>> per file including some overhead, then that would mean the average
>>> metadata per file is about 1.5% of the average file size. So for my 100G
>>> of data, the MDS would use about 1.5G of RAM for caching.
>>>
>>> If you scale that up for a filestore of say a petabyte, that means all
>>> your Ceph MDSs would consume a relatively whopping 15TB in total RAM for
>>> metadata caching, again assuming that *all* the data is actually used by
>>> clients.
>>>
>>
>> Why do you assume that ALL MDSs keep ALL metadata in memory? Isn't the
>> whole point of directory fragmentation that they all keep a bit of the
>> inodes in memory to spread the load?
> 
> Directory subtree partitioning is considered neither stable nor
> supported. Hence why it's important to understand what a single active
> MDS will hold.
> 

Understood. So it's about sizing your MDS right now, not in the future
then the subtree partitioning works :)

Isn't the memory consumption not also influenced by mds_cache_size?
Those are the amount of inodes the MDS will cache in memory.

Something that is not in cache will be read from RADOS afaik, so there
will be a limit in to how much memory the MDS will consume.

>>> Now of course it's entirely unrealistic that in a production system data
>>> is actually ever used across the board, but are the above considerations
>>> "close enough" for a rule-of-thumb approximation of MDS memory
>>> footprint? As in,
>>>
>>> Total MDS RAM = (Total used storage) * (fraction of data in regular use)
>>> * 0.015
>>>
>>> If CephFS users could use a rule of thumb like that, it would help them
>>> answer questions like "given a filesystem of size X, will a single MDS
>>> be enough to hold my metadata caches if Y is the maximum amount of
>>> memory I can afford for budget Z".
>>>
>>> All thoughts and comments much appreciated. Thank you!
>>>
>>> Cheers,
>>> Florian

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com