Re: Revisiting MDS memory footprint

Wido den Hollander <wido@xxxxxxxx> · Sun, 30 Nov 2014 20:20:50 +0100

On 11/28/2014 03:36 PM, Florian Haas wrote:
> On Fri, Nov 28, 2014 at 3:29 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
>> On 11/28/2014 03:22 PM, Florian Haas wrote:
>>> On Fri, Nov 28, 2014 at 3:14 PM, Wido den Hollander <wido@xxxxxxxx> wrote:
>>>> On 11/28/2014 01:04 PM, Florian Haas wrote:
>>>>> Hi everyone,
>>>>>
>>>>> I'd like to come back to a discussion from 2012 (thread at
>>>>> http://marc.info/?l=ceph-devel&m=134808745719233) to estimate the
>>>>> expected MDS memory consumption from file metadata caching. I am certain
>>>>> the following is full of untested assumptions, some of which are
>>>>> probably inaccurate, so please shoot those down as needed.
>>>>>
>>>>> I did an entirely unscientific study of a real data set (my laptop, in
>>>>> case you care to know) which currently holds about 70G worth of data in
>>>>> a huge variety of file sizes and several file systems, and currently
>>>>> lists about 944,000 inodes as being in use. So going purely by order of
>>>>> magnitude and doing a wild approximation, I'll assume a ratio of 1
>>>>> million files in 100G, or 10,000 files per gigabyte, which means an
>>>>> average file size of about 100KB -- again, approximating and forgetting
>>>>> about the difference between 10^3 and 2^10, and using a stupid
>>>>> arithmetic mean rather than a median which would probably be much more
>>>>> useful.
>>>>>
>>>>> If I were to assume that all those files were in CephFS, and they were
>>>>> all somehow regularly in use (or at least one file in each directory),
>>>>> then the Ceph MDS would have to keep the metadata of all those files in
>>>>> cache. Suppose further that the stat struct for all those files is
>>>>> anywhere between 1 and 2KB, and we go by an average of 1.5KB metadata
>>>>> per file including some overhead, then that would mean the average
>>>>> metadata per file is about 1.5% of the average file size. So for my 100G
>>>>> of data, the MDS would use about 1.5G of RAM for caching.
>>>>>
>>>>> If you scale that up for a filestore of say a petabyte, that means all
>>>>> your Ceph MDSs would consume a relatively whopping 15TB in total RAM for
>>>>> metadata caching, again assuming that *all* the data is actually used by
>>>>> clients.
>>>>>
>>>>
>>>> Why do you assume that ALL MDSs keep ALL metadata in memory? Isn't the
>>>> whole point of directory fragmentation that they all keep a bit of the
>>>> inodes in memory to spread the load?
>>>
>>> Directory subtree partitioning is considered neither stable nor
>>> supported. Hence why it's important to understand what a single active
>>> MDS will hold.
>>>
>>
>> Understood. So it's about sizing your MDS right now, not in the future
>> then the subtree partitioning works :)
> 
> Correct.
> 
>> Isn't the memory consumption not also influenced by mds_cache_size?
>> Those are the amount of inodes the MDS will cache in memory.
>>
>> Something that is not in cache will be read from RADOS afaik, so there
>> will be a limit in to how much memory the MDS will consume.
> 
> I am acutely aware of that, but this is not about *limiting* MDS
> memory consumption. It's about "if I wanted to make sure that all my
> metadata fits in the cache, how much memory would I need for that?"
> 

Understood. I hoped someone else chimed in here who has more knowledge
about this.

But lets make a analogy with ZFS for example. You size your (L2)ARC
there based on your hot data.

Why would you want all CephFS metadata in memory? With any filesystem
that will be a problem.

We do however need a good rule of thumb of how much memory is used for
each inode.

> Also as a corollary to this discussion, I'm not sure if anyone has
> actually run any stats on CephFS performance (read/write throughput,
> latency, and IOPS) as a function of cache hit/miss ratio. In other
> words I don't know, and I'm not sure if anyone knows, what the actual
> impact of MDS cache misses is — I am just assuming it would be quite
> significant, otherwise I can't imagine why Sage would have come up
> with the idea of a metadata-caching MDS in the first place. :)
> 
>>>>> Now of course it's entirely unrealistic that in a production system data
>>>>> is actually ever used across the board, but are the above considerations
>>>>> "close enough" for a rule-of-thumb approximation of MDS memory
>>>>> footprint? As in,
>>>>>
>>>>> Total MDS RAM = (Total used storage) * (fraction of data in regular use)
>>>>> * 0.015
>>>>>
>>>>> If CephFS users could use a rule of thumb like that, it would help them
>>>>> answer questions like "given a filesystem of size X, will a single MDS
>>>>> be enough to hold my metadata caches if Y is the maximum amount of
>>>>> memory I can afford for budget Z".
>>>>>
>>>>> All thoughts and comments much appreciated. Thank you!
>>>>>
>>>>> Cheers,
>>>>> Florian

-- 
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
_______________________________________________
ceph-users mailing list
ceph-users@xxxxxxxxxxxxxx
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com