Re: Revisiting MDS memory footprint

Florian Haas <florian@xxxxxxxxxxx> · Fri, 28 Nov 2014 15:36:54 +0100

On Fri, Nov 28, 2014 at 3:29 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
> On 11/28/2014 03:22 PM, Florian Haas wrote:
>> On Fri, Nov 28, 2014 at 3:14 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
>>> On 11/28/2014 01:04 PM, Florian Haas wrote:
>>>> Hi everyone,
>>>>
>>>> I'd like to come back to a discussion from 2012 (thread at
>>>> http://marc.info/?l=ceph-devel&m=134808745719233) to estimate the
>>>> expected MDS memory consumption from file metadata caching. I am certain
>>>> the following is full of untested assumptions, some of which are
>>>> probably inaccurate, so please shoot those down as needed.
>>>>
>>>> I did an entirely unscientific study of a real data set (my laptop, in
>>>> case you care to know) which currently holds about 70G worth of data in
>>>> a huge variety of file sizes and several file systems, and currently
>>>> lists about 944,000 inodes as being in use. So going purely by order of
>>>> magnitude and doing a wild approximation, I'll assume a ratio of 1
>>>> million files in 100G, or 10,000 files per gigabyte, which means an
>>>> average file size of about 100KB -- again, approximating and forgetting
>>>> about the difference between 10^3 and 2^10, and using a stupid
>>>> arithmetic mean rather than a median which would probably be much more
>>>> useful.
>>>>
>>>> If I were to assume that all those files were in CephFS, and they were
>>>> all somehow regularly in use (or at least one file in each directory),
>>>> then the Ceph MDS would have to keep the metadata of all those files in
>>>> cache. Suppose further that the stat struct for all those files is
>>>> anywhere between 1 and 2KB, and we go by an average of 1.5KB metadata
>>>> per file including some overhead, then that would mean the average
>>>> metadata per file is about 1.5% of the average file size. So for my 100G
>>>> of data, the MDS would use about 1.5G of RAM for caching.
>>>>
>>>> If you scale that up for a filestore of say a petabyte, that means all
>>>> your Ceph MDSs would consume a relatively whopping 15TB in total RAM for
>>>> metadata caching, again assuming that *all* the data is actually used by
>>>> clients.
>>>>
>>>
>>> Why do you assume that ALL MDSs keep ALL metadata in memory? Isn't the
>>> whole point of directory fragmentation that they all keep a bit of the
>>> inodes in memory to spread the load?
>>
>> Directory subtree partitioning is considered neither stable nor
>> supported. Hence why it's important to understand what a single active
>> MDS will hold.
>>
>
> Understood. So it's about sizing your MDS right now, not in the future
> then the subtree partitioning works :)

Correct.

> Isn't the memory consumption not also influenced by mds_cache_size?
> Those are the amount of inodes the MDS will cache in memory.
>
> Something that is not in cache will be read from RADOS afaik, so there
> will be a limit in to how much memory the MDS will consume.

I am acutely aware of that, but this is not about *limiting* MDS
memory consumption. It's about "if I wanted to make sure that all my
metadata fits in the cache, how much memory would I need for that?"

Also as a corollary to this discussion, I'm not sure if anyone has
actually run any stats on CephFS performance (read/write throughput,
latency, and IOPS) as a function of cache hit/miss ratio. In other
words I don't know, and I'm not sure if anyone knows, what the actual
impact of MDS cache misses is — I am just assuming it would be quite
significant, otherwise I can't imagine why Sage would have come up
with the idea of a metadata-caching MDS in the first place. :)

>>>> Now of course it's entirely unrealistic that in a production system data
>>>> is actually ever used across the board, but are the above considerations
>>>> "close enough" for a rule-of-thumb approximation of MDS memory
>>>> footprint? As in,
>>>>
>>>> Total MDS RAM = (Total used storage) * (fraction of data in regular use)
>>>> * 0.015
>>>>
>>>> If CephFS users could use a rule of thumb like that, it would help them
>>>> answer questions like "given a filesystem of size X, will a single MDS
>>>> be enough to hold my metadata caches if Y is the maximum amount of
>>>> memory I can afford for budget Z".
>>>>
>>>> All thoughts and comments much appreciated. Thank you!
>>>>
>>>> Cheers,
>>>> Florian
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com