Thanks, Frank, for these detailed insights! I really appreciate your help. Am Mo., 16. Jan. 2023 um 09:49 Uhr schrieb Frank Schilder <frans@xxxxxx>: > Hi, we are using ceph fs for data on an HPC cluster and, looking at your > file size distribution, I doubt that MDS performance is a bottleneck. Your > limiting factors are super-small files and IOP/s budget of the fs data > pool. On our system, we moved these workloads to an all-flash beegfs. Ceph > is very good for cold data, but this kind of hot data on HDD is a pain. > > Problems to consider: What is the replication factor of your data pool? > The IOP/s budget of a pool is approximately 100*#OSDs/replication factor. > If you use 3xrep, the budget is 1/3rd of what your drives can do > aggregated. If you use something like 8+2 or 8+3, its 1/10 or 1/11 of > aggregated. Having WAL/DB on SSD might give you 100% of HDD IOP/s, because > the bluestore admin IO goes to SSD. > > All in all, this is not much. Example: 100HDD OSDs, aggregated IOP/s > raw=10000, 8+2 EC pool gives 1000 aggregated. If you try to process 10000 > small files, this will feel very slow compared to a crappy desktop-SSD. On > the other hand, large file-IO (streaming of data) will fly with up to > 1.2GB/s. > > Second issue is allocation amplification. Even with min_alloc_size=4K on > HDD, with your file size distribution you will have a dramatic allocation > amplification. Take again EC 8+2. For this pool, the minimum allocation > size is 8*4K=32K. All files from 0-32K will allocate 32K. All files from > 32K to 64K will allocate 64K. Similar but less pessimistic calculation > applies to replicated pools. > > If you have mostly small file IO and only the occasional large files, you > might consider having 2 data pools on HDD. One 3x-replicated pool as > default and an 8+2 or 8+3 EC pool for large files. You can give every user > a folder on the large-file pool (say "archives" under home). For data > folders users should tell you what it is for. > > If you use a replicated HDD pool for all the small files up to 64K you > will probably get good performance out of it. Still, solid state storage > would be much preferable. When I look at your small file count (0-64K), > 2Miox64Kx3 is about 400G. If you increase by a factor of 10, we are talking > about 4T solid state storage. That is really not much and should be > affordable if money is not the limit. I would consider a small SSD pool for > all the small files and use the HDDs for the large ones and possibly > compressed archives. > > You can force users to use the right place by setting appropriate quotas > on the sub-dirs on different pools. You could offer a hierarchy of rep-HDD > as default, EC-HDD on "archives" and rep-SSD on demand. > > For MDS config, my experience is that human users do actually not allocate > large amounts of meta-data. We have 8 active MDSes with 24G memory limit > each and I don't see human users using all that cache. What does happen > though is that backup daemons allocate a lot, but obviously don't use it > (its a read-once workload). Our MDSes hold 4M inodes and 4M dentries in 12G > (I have set mid-point to 0.5) and that seems more than enough. What is > really important is to pin directories. We use manual pinning over all > ranks and it works like a charm. If you don't pin, the MDSes will not work > very well. I had a thread on that 2-3 months ago. > > Hope that is helpful. > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: E Taka <0etaka0@xxxxxxxxx> > Sent: 15 January 2023 17:45:25 > To: Darren Soothill > Cc: ceph-users@xxxxxxx > Subject: Re: Useful MDS configuration for heavily used Cephfs > > Thanks for the detailed inquiry. We use HDD with WAL/DB on SSD. The Ceph > servers have of lots of RAM and many CPU cores. We are looking for > a general purpose approach – running reasonably well in most cases is > better than a perfect solution for one use case. > > This is the existing file size distribution.For the future let's multiply > the number of files by 10 (that's more than 100 TB, I know): > > 1k: 532069 > 2k: 54458 > 4k: 36613 > 8k: 37139 > 16k: 726302 > 32k: 286573 > 64k: 55841 > 128k: 30510 > 256k: 37386 > 512k: 48462 > 1M: 9461 > 2M: 4707 > 4M: 9233 > 8M: 4816 > 16M: 3059 > 32M: 2268 > 64M: 4314 > 128M: 17017 > 256M: 7263 > 512M: 1917 > 1G: 1561 > 2G: 1342 > 4G: 670 > 8G: 493 > 16G: 238 > 32G: 121 > 64G: 15 > 128G: 10 > 256G: 5 > 512G: 4 > 1T: 3 > 2T: 2 > > There will be home directories for a few hundred users, and dozens of data > dirs with thousands of files between 10-100 kB, which will be processed one > by one or in parallel. In this process, some small files are written, but > in the main usage, many files of rather small size are read. > (A database would be better suited for this, but I have no influence on > that.) I would prefer not to create separate pools on SSD, but to use the > RAM (some spare servers with 128 GB - 512 GB) for Metadata caching. > > Thank you for your encouragement! > Erich > > Am So., 15. Jan. 2023 um 16:29 Uhr schrieb Darren Soothill < > darren.soothill@xxxxxxxx>: > > > There are a few details missing to allow people to provide you with > advice. > > > > How many files are you expecting to be in this 100TB of capacity? > > This really dictates what you are looking for. It could be full of 4K > > files which is a very different proposition to it being full of 100M > files. > > > > What sort of media is this file system made up of? > > If you have 10’s millions of files on HDD then you are going to be > wanting > > a separate metadata pool for CephFS on some much faster storage. > > > > What is the sort of use case that you are expecting for this storage? > > You say it is heavily used but what does that really mean? > > You have a 1000 HPC nodes all trying to access millions of 4K files? > > Or are you using it as a more general purpose file system for say home > > directories? > > > > > > > > Darren Soothill > > > > Looking for help with your Ceph cluster? Contact us at https://croit.io/ > > > > croit GmbH, Freseniusstr. 31h, 81247 Munich > > CEO: Martin Verges - VAT-ID: DE310638492 > > Com. register: Amtsgericht Munich HRB 231263 > > Web: https://croit.io/ | YouTube: https://goo.gl/PGE1Bx > > > > > > > > On 15 Jan 2023, at 09:26, E Taka <0etaka0@xxxxxxxxx> wrote: > > > > Ceph 17.2.5: > > > > Hi, > > > > I'm looking for a reasonable and useful MDS configuration for a – in > > future, no experiences until now – heavily used CephFS (~100TB). > > For example, does it make a difference to increase the > > mds_cache_memory_limit or the number of MDS instances? > > > > The hardware does not set any limits, I just want to know where the > default > > values can be optimized usefully before problem occur. > > > > Thanks, > > _______________________________________________ > > ceph-users mailing list -- ceph-users@xxxxxxx > > To unsubscribe send an email to ceph-users-leave@xxxxxxx > > > > > > > _______________________________________________ > ceph-users mailing list -- ceph-users@xxxxxxx > To unsubscribe send an email to ceph-users-leave@xxxxxxx > _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx