Hi Stefan, I think the answer depends on whether or not you have the 2-pool or the 3-pool ceph-fs layout. I converted ours to the recommended 3-pool layout: - 1 4xreplicated meta data pool on SSD - 1 4xreplicated pool for object backtraces on SSD (primary fs data pool) - 1 secondary 8+3 EC data pool on HDD set as default at the root of the file system A ceph df for these main file system pools looks like: POOLS: NAME ID USED %USED MAX AVAIL OBJECTS con-fs2-meta1 12 464 MiB 0.05 925 GiB 37519078 con-fs2-meta2 13 0 B 0 925 GiB 351929999 con-fs2-data2 19 1.2 PiB 21.46 4.6 PiB 519602711 Note that the second meta-data pool (primary data pool) con-fs2-meta2 has 10 times the number of objects compared with the actual MDS meta-data pool con-fs2-meta1. The data in this pool is a total pain during recovery. After quite some experimentation, I found that this strategy works best: Use a total PG num "largest power of 2" such that on every OSD you have between 100 and 200 PGs. I never found a good reason to use fewer PGs than an OSD can support. In my experience, aiming for a number of PGs between 100 and 200 per OSD gives best performance and security. With fewer PGs IO can be very unevenly distributed and introduce artificial bottlenecks due to unreasonably high load of the busiest OSD(s). Next step is to find out how many OSDs per SSD, assuming you use SSDs for the 2 meta data sets. I ended up stuffing as much as I could afford on the SSD drives. We have TOSHIBA PX05SMB040Y 372.61GB drives for the meta data pools and these drives have extremely good performance, which means they should be provisioned with more than 1 OSD. I ended up deploying 4 OSDs per disk with a bit over 100 PGs per OSD, meaning almost 500 PGs per disk. Yes, each OSD is only <100G in size, but the performance utilisation per disk is very very good. Recovery of the meta-data pools as well as snap_trim finishes in a couple of minutes without users noticing any client performance impact. Last step is to find out how to distribute the max. available PGs between pools that reside on the same drives, in my case, con-fs2-meta1 and con-fs2-meta2. Here the allocation of PGs that works best for me: pool 12 'con-fs2-meta1' replicated size 4 min_size 2 crush_rule 3 object_hash rjenkins pg_num 256 pgp_num 256 pool 13 'con-fs2-meta2' replicated size 4 min_size 2 crush_rule 3 object_hash rjenkins pg_num 2048 pgp_num 2048 pool 19 'con-fs2-data2' erasure size 11 min_size 9 crush_rule 13 object_hash rjenkins pg_num 8192 pgp_num 8192 Remember to multiply the pg_num with the replication factor to get the real number of PG shards distributed over all disks (unfortunately, the ceph documentation uses the same term for "placement group" and "membership in a placement group (PG shard)" even though these are completely different things and numbers). Note the large PG allocation for the primary data pool con-fs2-meta2 (in fact a secondary meta-data pool). This was absolutely necessary to get recovery times to be acceptable. With smaller PG numbers this can take longer than recovery of the HDD pools even though it is on SSD! If you have this data sitting on a HDD pool I would see potential for trouble. Lastly, I still use mimic and plan to upgrade to octopus during the summer. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Stefan Kooman <stefan@xxxxxx> Sent: 02 June 2022 22:03:39 To: Ramana Venkatesh Raja Cc: ceph-users@xxxxxxx Subject: Re: Help needed picking the right amount of PGs for (Cephfs) metadata pool On 6/2/22 20:46, Ramana Venkatesh Raja wrote: <snip> >> >> We currently have 512 PGs allocated to this pool. The autoscaler suggest >> reducing this amount to "32" PGs. This would result in only a fraction >> of the OSDs having *all* of the metadata. I can tell you, based on >> experience, that is not a good advise (the longer story here [1]). At >> least you want to spread out all OMAP data over as many (fast) disks as >> possible. So in this case it should advise 256. >> > > Curious, how many PGs do you have in total in all the pools of your > Ceph cluster? What are the other pools (e.g., data pools) and each of > their PG counts? POOL SIZE TARGET SIZE RATE RAW CAPACITY RATIO TARGET RATIO EFFECTIVE RATIO BIAS PG_NUM NEW PG_NUM AUTOSCALE REDACTEDXXXXXXXXXXXX 7724G 3.0 628.7T 0.0360 1.0 512 off < rbd REDACTEDXXXXXXXXXXXX 31806M 3.0 628.7T 0.0001 1.0 128 32 off < rbd pool REDACTEDXXXXXXXXXXXX 53914G 3.0 628.7T 0.2512 1.0 4096 1024 off < rbd pool REDACTEDXXXXXXXXXXXX 5729G 3.0 628.7T 0.0267 1.0 256 off < rbd pool REDACTEDXXXXXXXXXXXX 72411G 3.0 628.7T 0.3374 1.0 2048 off < cephfs data pool REDACTEDXXXXXXXXXXXX 999.4G 3.0 628.7T 0.0047 1.0 512 32 off < rbd pool REDACTEDXXXXXXXXXXXX 355.7k 3.0 628.7T 0.0000 1.0 8 32 off < librados, used for locking (samba ctdb) REDACTEDXXXXXXXXXXXX 19 3.0 628.7T 0.0000 1.0 256 32 off < rbd, test volume REDACTEDXXXXXXXXXXXX 0 3.0 628.7T 0.0000 1.0 128 32 off < rbd, to be removed REDACTEDXXXXXXXXXXXX 3316G 3.0 628.7T 0.0155 1.0 128 off < rbd REDACTEDXXXXXXXXXXXX 98.61M 3.0 628.7T 0.0000 1.0 1 off < device metrics > > What version of Ceph are you using? 15.2.16 Gr. Stefan _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx