Re: Help needed picking the right amount of PGs for (Cephfs) metadata pool

Frank Schilder <frans@xxxxxx> · Tue, 14 Jun 2022 14:06:31 +0000

Hi Stefan,

I think the answer depends on whether or not you have the 2-pool or the 3-pool ceph-fs layout. I converted ours to the recommended 3-pool layout:

- 1 4xreplicated meta data pool on SSD
- 1 4xreplicated pool for object backtraces on SSD (primary fs data pool)
- 1 secondary 8+3 EC data pool on HDD set as default at the root of the file system

A ceph df for these main file system pools looks like:

POOLS:
    NAME                     ID     USED        %USED     MAX AVAIL     OBJECTS
    con-fs2-meta1            12     464 MiB      0.05       925 GiB      37519078
    con-fs2-meta2            13         0 B         0       925 GiB     351929999
    con-fs2-data2            19     1.2 PiB     21.46       4.6 PiB     519602711

Note that the second meta-data pool (primary data pool) con-fs2-meta2 has 10 times the number of objects compared with the actual MDS meta-data pool con-fs2-meta1. The data in this pool is a total pain during recovery. After quite some experimentation, I found that this strategy works best: Use a total PG num "largest power of 2" such that on every OSD you have between 100 and 200 PGs. I never found a good reason to use fewer PGs than an OSD can support. In my experience, aiming for a number of PGs between 100 and 200 per OSD gives best performance and security. With fewer PGs IO can be very unevenly distributed and introduce artificial bottlenecks due to unreasonably high load of the busiest OSD(s).

Next step is to find out how many OSDs per SSD, assuming you use SSDs for the 2 meta data sets. I ended up stuffing as much as I could afford on the SSD drives. We have TOSHIBA PX05SMB040Y 372.61GB drives for the meta data pools and these drives have extremely good performance, which means they should be provisioned with more than 1 OSD. I ended up deploying 4 OSDs per disk with a bit over 100 PGs per OSD, meaning almost 500 PGs per disk. Yes, each OSD is only <100G in size, but the performance utilisation per disk is very very good. Recovery of the meta-data pools as well as snap_trim finishes in a couple of minutes without users noticing any client performance impact.

Last step is to find out how to distribute the max. available PGs between pools that reside on the same drives, in my case, con-fs2-meta1 and con-fs2-meta2. Here the allocation of PGs that works best for me:

pool 12 'con-fs2-meta1' replicated size 4 min_size 2 crush_rule 3 object_hash rjenkins pg_num 256 pgp_num 256
pool 13 'con-fs2-meta2' replicated size 4 min_size 2 crush_rule 3 object_hash rjenkins pg_num 2048 pgp_num 2048
pool 19 'con-fs2-data2' erasure size 11 min_size 9 crush_rule 13 object_hash rjenkins pg_num 8192 pgp_num 8192

Remember to multiply the pg_num with the replication factor to get the real number of PG shards distributed over all disks (unfortunately, the ceph documentation uses the same term for "placement group" and "membership in a placement group (PG shard)" even though these are completely different things and numbers).

Note the large PG allocation for the primary data pool con-fs2-meta2 (in fact a secondary meta-data pool). This was absolutely necessary to get recovery times to be acceptable. With smaller PG numbers this can take longer than recovery of the HDD pools even though it is on SSD! If you have this data sitting on a HDD pool I would see potential for trouble.

Lastly, I still use mimic and plan to upgrade to octopus during the summer.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Stefan Kooman <stefan@xxxxxx>
Sent: 02 June 2022 22:03:39
To: Ramana Venkatesh Raja
Cc: ceph-users@xxxxxxx
Subject:  Re: Help needed picking the right amount of PGs for (Cephfs) metadata pool

On 6/2/22 20:46, Ramana Venkatesh Raja wrote:
<snip>
>>
>> We currently have 512 PGs allocated to this pool. The autoscaler suggest
>> reducing this amount to "32" PGs. This would result in only a fraction
>> of the OSDs having *all* of the metadata. I can tell you, based on
>> experience, that is not a good advise (the longer story here [1]). At
>> least you want to spread out all OMAP data over as many (fast) disks as
>> possible. So in this case it should advise 256.
>>
>
> Curious, how many PGs do you have in total in all the pools of your
> Ceph cluster? What are the other pools (e.g., data pools) and each of
> their PG counts?

  POOL                            SIZE  TARGET SIZE  RATE  RAW CAPACITY
   RATIO  TARGET RATIO  EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM
AUTOSCALE
  REDACTEDXXXXXXXXXXXX          7724G                3.0        628.7T
0.0360                                  1.0     512              off
     < rbd
  REDACTEDXXXXXXXXXXXX          31806M               3.0        628.7T
0.0001                                  1.0     128          32  off
     < rbd pool
  REDACTEDXXXXXXXXXXXX          53914G               3.0        628.7T
0.2512                                  1.0    4096        1024  off
     < rbd pool
  REDACTEDXXXXXXXXXXXX          5729G                3.0        628.7T
0.0267                                  1.0     256              off
     < rbd pool
  REDACTEDXXXXXXXXXXXX          72411G               3.0        628.7T
0.3374                                  1.0    2048              off
     < cephfs data pool
  REDACTEDXXXXXXXXXXXX          999.4G               3.0        628.7T
0.0047                                  1.0     512          32  off
     < rbd pool
  REDACTEDXXXXXXXXXXXX          355.7k               3.0        628.7T
0.0000                                  1.0       8          32  off
     < librados, used for locking (samba ctdb)
  REDACTEDXXXXXXXXXXXX          19                   3.0        628.7T
0.0000                                  1.0     256          32  off
     < rbd, test volume
  REDACTEDXXXXXXXXXXXX          0                    3.0        628.7T
0.0000                                  1.0     128          32  off
     < rbd, to be removed
  REDACTEDXXXXXXXXXXXX          3316G                3.0        628.7T
0.0155                                  1.0     128              off
     < rbd
  REDACTEDXXXXXXXXXXXX          98.61M               3.0        628.7T
0.0000                                  1.0       1              off
     < device metrics

>
> What version of Ceph are you using?

15.2.16

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx