Re: Useful MDS configuration for heavily used Cephfs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi, we are using ceph fs for data on an HPC cluster and, looking at your file size distribution, I doubt that MDS performance is a bottleneck. Your limiting factors are super-small files and IOP/s budget of the fs data pool. On our system, we moved these workloads to an all-flash beegfs. Ceph is very good for cold data, but this kind of hot data on HDD is a pain.

Problems to consider: What is the replication factor of your data pool? The IOP/s budget of a pool is approximately 100*#OSDs/replication factor. If you use 3xrep, the budget is 1/3rd of what your drives can do aggregated. If you use something like 8+2 or 8+3, its 1/10 or 1/11 of aggregated. Having WAL/DB on SSD might give you 100% of HDD IOP/s, because the bluestore admin IO goes to SSD.

All in all, this is not much. Example: 100HDD OSDs, aggregated IOP/s raw=10000, 8+2 EC pool gives 1000 aggregated. If you try to process 10000 small files, this will feel very slow compared to a crappy desktop-SSD. On the other hand, large file-IO (streaming of data) will fly with up to 1.2GB/s.

Second issue is allocation amplification. Even with min_alloc_size=4K on HDD, with your file size distribution you will have a dramatic allocation amplification. Take again EC 8+2. For this pool, the minimum allocation size is 8*4K=32K. All files from 0-32K will allocate 32K. All files from 32K to 64K will allocate 64K.  Similar but less pessimistic calculation applies to replicated pools.

If you have mostly small file IO and only the occasional large files, you might consider having 2 data pools on HDD. One 3x-replicated pool as default and an 8+2 or 8+3 EC pool for large files. You can give every user a folder on the large-file pool (say "archives" under home). For data folders users should tell you what it is for.

If you use a replicated HDD pool for all the small files up to 64K you will probably get good performance out of it. Still, solid state storage would be much preferable. When I look at your small file count (0-64K), 2Miox64Kx3 is about 400G. If you increase by a factor of 10, we are talking about 4T solid state storage. That is really not much and should be affordable if money is not the limit. I would consider a small SSD pool for all the small files and use the HDDs for the large ones and possibly compressed archives.

You can force users to use the right place by setting appropriate quotas on the sub-dirs on different pools. You could offer a hierarchy of rep-HDD as default, EC-HDD on "archives" and rep-SSD on demand.

For MDS config, my experience is that human users do actually not allocate large amounts of meta-data. We have 8 active MDSes with 24G memory limit each and I don't see human users using all that cache. What does happen though is that backup daemons allocate a lot, but obviously don't use it (its a read-once workload). Our MDSes hold 4M inodes and 4M dentries in 12G (I have set mid-point to 0.5) and that seems more than enough. What is really important is to pin directories. We use manual pinning over all ranks and it works like a charm. If you don't pin, the MDSes will not work very well. I had a thread on that 2-3 months ago.

Hope that is helpful.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: E Taka <0etaka0@xxxxxxxxx>
Sent: 15 January 2023 17:45:25
To: Darren Soothill
Cc: ceph-users@xxxxxxx
Subject:  Re: Useful MDS configuration for heavily used Cephfs

Thanks for the detailed inquiry. We use HDD with WAL/DB on SSD. The Ceph
servers have of lots of RAM and many CPU cores. We are looking for
a general purpose approach – running reasonably well in most cases is
better than a perfect solution for one use case.

This is the existing file size distribution.For the future let's multiply
the number of files by 10 (that's more than 100 TB, I know):

  1k: 532069
 2k:  54458
 4k:  36613
 8k:  37139
16k: 726302
32k: 286573
64k:  55841
128k:  30510
256k:  37386
512k:  48462
 1M:   9461
 2M:   4707
 4M:   9233
 8M:   4816
16M:   3059
32M:   2268
64M:   4314
128M:  17017
256M:   7263
512M:   1917
 1G:   1561
 2G:   1342
 4G:    670
 8G:    493
16G:    238
32G:    121
64G:     15
128G:     10
256G:      5
512G:      4
 1T:      3
 2T:      2

There will be home directories for a few hundred users, and dozens of data
dirs with thousands of files between 10-100 kB, which will be processed one
by one or in parallel. In this process, some small files are written, but
in the main usage, many files of rather small size are read.
(A database would be better suited for this, but I have no influence on
that.) I would prefer not to create separate pools on SSD, but to use the
RAM (some spare servers with 128 GB - 512 GB) for Metadata caching.

Thank you for your encouragement!
Erich

Am So., 15. Jan. 2023 um 16:29 Uhr schrieb Darren Soothill <
darren.soothill@xxxxxxxx>:

> There are a few details missing to allow people to provide you with advice.
>
> How many files are you expecting to be in this 100TB of capacity?
> This really dictates what you are looking for. It could be full of 4K
> files which is a very different proposition to it being full of 100M files.
>
> What sort of media is this file system made up of?
> If you have 10’s millions of files on HDD then you are going to be wanting
> a separate metadata pool for CephFS on some much faster storage.
>
> What is the sort of use case that you are expecting for this storage?
> You say it is heavily used but what does that really mean?
> You have a 1000 HPC nodes all trying to access millions of 4K files?
> Or are you using it as a more general purpose file system for say home
> directories?
>
>
>
> Darren Soothill
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io/
>
> croit GmbH, Freseniusstr. 31h, 81247 Munich
> CEO: Martin Verges - VAT-ID: DE310638492
> Com. register: Amtsgericht Munich HRB 231263
> Web: https://croit.io/ | YouTube: https://goo.gl/PGE1Bx
>
>
>
> On 15 Jan 2023, at 09:26, E Taka <0etaka0@xxxxxxxxx> wrote:
>
> Ceph 17.2.5:
>
> Hi,
>
> I'm looking for a reasonable and useful MDS configuration for a – in
> future, no experiences until now – heavily used CephFS (~100TB).
> For example, does it make a difference to increase the
> mds_cache_memory_limit or the number of MDS instances?
>
> The hardware does not set any limits, I just want to know where the default
> values can be optimized usefully before problem occur.
>
> Thanks,
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
>
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux