Re: Useful MDS configuration for heavily used Cephfs

E Taka <0etaka0@xxxxxxxxx> · Mon, 16 Jan 2023 10:19:47 +0100

Thanks, Frank, for these detailed insights! I really appreciate your help.

Am Mo., 16. Jan. 2023 um 09:49 Uhr schrieb Frank Schilder <frans@xxxxxx>:

> Hi, we are using ceph fs for data on an HPC cluster and, looking at your
> file size distribution, I doubt that MDS performance is a bottleneck. Your
> limiting factors are super-small files and IOP/s budget of the fs data
> pool. On our system, we moved these workloads to an all-flash beegfs. Ceph
> is very good for cold data, but this kind of hot data on HDD is a pain.
>
> Problems to consider: What is the replication factor of your data pool?
> The IOP/s budget of a pool is approximately 100*#OSDs/replication factor.
> If you use 3xrep, the budget is 1/3rd of what your drives can do
> aggregated. If you use something like 8+2 or 8+3, its 1/10 or 1/11 of
> aggregated. Having WAL/DB on SSD might give you 100% of HDD IOP/s, because
> the bluestore admin IO goes to SSD.
>
> All in all, this is not much. Example: 100HDD OSDs, aggregated IOP/s
> raw=10000, 8+2 EC pool gives 1000 aggregated. If you try to process 10000
> small files, this will feel very slow compared to a crappy desktop-SSD. On
> the other hand, large file-IO (streaming of data) will fly with up to
> 1.2GB/s.
>
> Second issue is allocation amplification. Even with min_alloc_size=4K on
> HDD, with your file size distribution you will have a dramatic allocation
> amplification. Take again EC 8+2. For this pool, the minimum allocation
> size is 8*4K=32K. All files from 0-32K will allocate 32K. All files from
> 32K to 64K will allocate 64K.  Similar but less pessimistic calculation
> applies to replicated pools.
>
> If you have mostly small file IO and only the occasional large files, you
> might consider having 2 data pools on HDD. One 3x-replicated pool as
> default and an 8+2 or 8+3 EC pool for large files. You can give every user
> a folder on the large-file pool (say "archives" under home). For data
> folders users should tell you what it is for.
>
> If you use a replicated HDD pool for all the small files up to 64K you
> will probably get good performance out of it. Still, solid state storage
> would be much preferable. When I look at your small file count (0-64K),
> 2Miox64Kx3 is about 400G. If you increase by a factor of 10, we are talking
> about 4T solid state storage. That is really not much and should be
> affordable if money is not the limit. I would consider a small SSD pool for
> all the small files and use the HDDs for the large ones and possibly
> compressed archives.
>
> You can force users to use the right place by setting appropriate quotas
> on the sub-dirs on different pools. You could offer a hierarchy of rep-HDD
> as default, EC-HDD on "archives" and rep-SSD on demand.
>
> For MDS config, my experience is that human users do actually not allocate
> large amounts of meta-data. We have 8 active MDSes with 24G memory limit
> each and I don't see human users using all that cache. What does happen
> though is that backup daemons allocate a lot, but obviously don't use it
> (its a read-once workload). Our MDSes hold 4M inodes and 4M dentries in 12G
> (I have set mid-point to 0.5) and that seems more than enough. What is
> really important is to pin directories. We use manual pinning over all
> ranks and it works like a charm. If you don't pin, the MDSes will not work
> very well. I had a thread on that 2-3 months ago.
>
> Hope that is helpful.
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: E Taka <0etaka0@xxxxxxxxx>
> Sent: 15 January 2023 17:45:25
> To: Darren Soothill
> Cc: ceph-users@xxxxxxx
> Subject:  Re: Useful MDS configuration for heavily used Cephfs
>
> Thanks for the detailed inquiry. We use HDD with WAL/DB on SSD. The Ceph
> servers have of lots of RAM and many CPU cores. We are looking for
> a general purpose approach – running reasonably well in most cases is
> better than a perfect solution for one use case.
>
> This is the existing file size distribution.For the future let's multiply
> the number of files by 10 (that's more than 100 TB, I know):
>
>   1k: 532069
>  2k:  54458
>  4k:  36613
>  8k:  37139
> 16k: 726302
> 32k: 286573
> 64k:  55841
> 128k:  30510
> 256k:  37386
> 512k:  48462
>  1M:   9461
>  2M:   4707
>  4M:   9233
>  8M:   4816
> 16M:   3059
> 32M:   2268
> 64M:   4314
> 128M:  17017
> 256M:   7263
> 512M:   1917
>  1G:   1561
>  2G:   1342
>  4G:    670
>  8G:    493
> 16G:    238
> 32G:    121
> 64G:     15
> 128G:     10
> 256G:      5
> 512G:      4
>  1T:      3
>  2T:      2
>
> There will be home directories for a few hundred users, and dozens of data
> dirs with thousands of files between 10-100 kB, which will be processed one
> by one or in parallel. In this process, some small files are written, but
> in the main usage, many files of rather small size are read.
> (A database would be better suited for this, but I have no influence on
> that.) I would prefer not to create separate pools on SSD, but to use the
> RAM (some spare servers with 128 GB - 512 GB) for Metadata caching.
>
> Thank you for your encouragement!
> Erich
>
> Am So., 15. Jan. 2023 um 16:29 Uhr schrieb Darren Soothill <
> darren.soothill@xxxxxxxx>:
>
> > There are a few details missing to allow people to provide you with
> advice.
> >
> > How many files are you expecting to be in this 100TB of capacity?
> > This really dictates what you are looking for. It could be full of 4K
> > files which is a very different proposition to it being full of 100M
> files.
> >
> > What sort of media is this file system made up of?
> > If you have 10’s millions of files on HDD then you are going to be
> wanting
> > a separate metadata pool for CephFS on some much faster storage.
> >
> > What is the sort of use case that you are expecting for this storage?
> > You say it is heavily used but what does that really mean?
> > You have a 1000 HPC nodes all trying to access millions of 4K files?
> > Or are you using it as a more general purpose file system for say home
> > directories?
> >
> >
> >
> > Darren Soothill
> >
> > Looking for help with your Ceph cluster? Contact us at https://croit.io/
> >
> > croit GmbH, Freseniusstr. 31h, 81247 Munich
> > CEO: Martin Verges - VAT-ID: DE310638492
> > Com. register: Amtsgericht Munich HRB 231263
> > Web: https://croit.io/ | YouTube: https://goo.gl/PGE1Bx
> >
> >
> >
> > On 15 Jan 2023, at 09:26, E Taka <0etaka0@xxxxxxxxx> wrote:
> >
> > Ceph 17.2.5:
> >
> > Hi,
> >
> > I'm looking for a reasonable and useful MDS configuration for a – in
> > future, no experiences until now – heavily used CephFS (~100TB).
> > For example, does it make a difference to increase the
> > mds_cache_memory_limit or the number of MDS instances?
> >
> > The hardware does not set any limits, I just want to know where the
> default
> > values can be optimized usefully before problem occur.
> >
> > Thanks,
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@xxxxxxx
> > To unsubscribe send an email to ceph-users-leave@xxxxxxx
> >
> >
> >
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx