Re: How many MDS servers

Robert LeBlanc <robert@xxxxxxxxxxxxx> · Thu, 7 May 2020 09:55:08 -0700

On Thu, May 7, 2020 at 9:41 AM Robert LeBlanc <robert@xxxxxxxxxxxxx> wrote:

> On Thu, May 7, 2020 at 6:22 AM Yan, Zheng <ukernel@xxxxxxxxx> wrote:
>
>> On Thu, May 7, 2020 at 1:27 AM Patrick Donnelly <pdonnell@xxxxxxxxxx>
>> wrote:
>> >
>> > Hello Robert,
>> >
>> > On Mon, Mar 9, 2020 at 7:55 PM Robert Ruge <robert.ruge@xxxxxxxxxxxxx>
>> wrote:
>> > > For a 1.1PB raw cephfs system currently storing 191TB of data and 390
>> million objects (mostly small Python, ML training files etc.) how many MDS
>> servers should I be running?
>> > >
>> > > System is Nautilus 14.2.8.
>> > >
>> > >
>> > >
>> > > I ask because up to know I have run one MDS with one standby-replay
>> and occasionally it blows up with large memory consumption, 60Gb+ even
>> though I have mds_cache_memory_limit = 32G and that was 16G until recently.
>> It of course tries to restart on another MDS node fails again and after
>> several attempts usually comes back up. Today I increased to two active
>> MDS’s but the question is what is the optimal number for a pretty active
>> system? The single MDS seemed to regularly run around 1400 req/s and I
>> often get up to six clients failing to respond to cache pressure.
>> >
>> > Ideally, the only reason you should add more active MDS (increase
>> > max_mds) is because you want to increase request throughput.
>> >
>> > 60GB RSS is not completely unexpected. A 32GB cache size would use
>> > approximately 48GB (150%) RSS in a steady state situation. You may
>> > ahve hit some kind of bug as others have reported which is causing the
>> > cache size / anonymous memory to continually increase. You will need
>> > to post more information about the client type/version, cache usage,
>> > perf dumps, and workload to help diagnose.
>> >
>> >
>> https://github.com/ceph/ceph/pull/34571 may help if "ceph daemon mds.a
>> dump_mempools" shows buffer_anon uses lots of memory.
>>
>
> We struggle with cache management as well and I just thought it was due to
> our really old kernel clients (I'm sure that doesn't help). I'll keep an
> eye on the buffer_anon. Looks like that is dumped in perf dump so I can go
> back and look through graphite data for it.
>

So, when we have trouble with our MDS, there are spikes in both
buffer_anon_bytes (sometimes GiBs) and buffer_anon_items (sometimes
millions). There are some times that there is high bytes (~2GiB) used and
low items (~1k) and we don't seem to have big issues then. Only when the
items value goes high do we seem to get into the danger zone.

Will this PR apply cleanly to Nautilus?

Thanks,
----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx