Re: Understanding filesystem size

"Anthony D'Atri" <aad@xxxxxxxxxxxxxx> · Thu, 2 Jan 2025 16:17:46 -0500

> They seem quite even 
> 
Indeed.  Assuming that your failure domain is host, that shouldn’t be a factor in stranded capacity.  We mostly see that happen with say a rack failure domain cluster with 3 racks and replicated pools, or with your 6,2 pool 6 racks.  Having failure domains > replication eases those concerns.

> About CRUSH rules: I don't know exactly what to search for, so if you believe it's important then I'd need some advice.

ceph osd crush rule dump

> I have a single user producing lots of small files (currently about 4.7M with a mean size of 3 MB). The total number of files is about 7M.

That could contribute to stored vs used disparity, since Ceph (currently) writes full stripes.  The data pool at EC 6,2 will allocate underlying storage in multiples of 8*4=32KB.  So if there are a substantial number of objects smaller than, say, 128KB, they will strand some percentage of capacity.  This visualization shows that as the object size increases, the potential space amp quickly falls into the background noise.

https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI/edit?gid=358760253#gid=358760253;
Bluestore Space Amplification Cheat Sheet
docs.google.com

Now, I didn’t think to ask before which Ceph release you’re running, and more importantly which was running when the OSDs were built.  Around the Octopus / Pacific timeframe the defaults for min_alloc_size were reduced from 64KB to 4KB to minimize space amp, especially for small RGW objects.  If you’re running a recent release now `ceph osd metadata` will show you the value baked into each OSD.

> About the occupancy: in 1.8 TiB disks I see the PG count ranging from 27 (-> 38% occupancy) to 20 (-> 27% occupancy) at the same OSD weight (1.819). I guess these fluctuations of the number of PGs are due to the small number of PGs

I think we haven’t seen your PG count.  `ceph osd df` please.  And for completeness `ceph osd dump | grep pool`

> coupled to the inefficiency of the balancer, do you agree?

If the balancer were working it would do better than a 27-38 spread.

> If it's correct then I see only two ways: a manual rebalancing (tried in the past with much effort and little results)

Did you use `reweight by utilization`? If `ceph osd tree` shows OSDs with a REWEIGHT value < 1.0000 that could be a factor.  Mixing old-style override reweighs with new-style pg-upmap can confuse the balancer.  If you do have OSDs with REWEIGHT values set, try resetting them to 1.000.

> or an increase in PG count (risky because of old hardware)

I don’t think increasing PGs would intersect with old hardware, unless perhaps you’re riiiight at the edge with respect to RAM.  More PGs will use a bit more RAM within OSD processes, but at your scale I doubt that will be significant.  Send the above and we’ll be able to pass judgement on your pg_nums.

> do you see any other possibility?

https://www.syfy.com/sites/syfy/files/styles/hero_image__large__computer__alt/public/wire/legacy/itsaliens.jpg;
itsaliens
JPEG Image · 73 KB

> 
> Cheers,
> 
> Nicola
> 
> On 02/01/25 5:30 PM, Anthony D'Atri wrote:
>>> On Jan 2, 2025, at 11:18 AM, Nicola Mori <mori@xxxxxxxxxx> wrote:
>>> 
>>> Hi Anthony, thanks for your insights. I actually used df -h from the bash shell of a machine mounting the CephFS with the kernel module, and here's the current result:
>>> 
>>> wizardfs_rootsquash@b1029256-7bb3-11ec-a8ce-ac1f6b627b45.wizardfs=/ 217T   78T  139T  36% /wizard/ceph
>>> 
>>> So it seems the fs size is 217 TiB, which is about 66% of the total amount of raw disk space (320 TiB) as I wrote before.
>>> 
>>> Then I tried the command you suggested:
>>> 
>>> # ceph df
>>> --- RAW STORAGE ---
>>> CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
>>> hdd    320 TiB  216 TiB  104 TiB   104 TiB      32.56
>>> TOTAL  320 TiB  216 TiB  104 TiB   104 TiB      32.56
>>> 
>>> --- POOLS ---
>>> POOL             ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
>>> .mgr              1    1  242 MiB       62  726 MiB      0     62 TiB
>>> wizard_metadata   2   16  1.2 GiB   85.75k  3.5 GiB      0     62 TiB
>>> wizard_data       3  512   78 TiB   27.03M  104 TiB  36.06    138 TiB
>>> 
>>> In order to find the total size of the data pool I don't understand how to interpret the "MAX AVAIL" column: should it be summed to "STORED" or to "USED”?
>> Do you have a lot of small files?
>>> In the first case I'd get 216 TiB which corresponds to what df -h says and thus to 66%, in the second case I'd get 242 TiB which is very close to 75%... But I guess the first option is the right one.
>>> 
>>> Then I looked at the weights of my failure domain (host):
>>> 
>>> #    ceph osd tree | grep host
>>> 
>>> -7          25.51636      host aka
>>> -3          25.51636      host balin
>>> -13          29.10950      host bifur
>>> -17          29.10950      host bofur
>>> -21          29.10371      host dwalin
>>> -23          21.83276      host fili
>>> -25          29.10950      host kili
>>> -9          25.51636      host ogion
>>> -19          25.51636      host prestno
>>> -15          29.10522      host remolo
>>> -5          25.51636      host rokanan
>>> -11          27.29063      host romolo
>>> 
>>> They seem quite even and quite reflecting the actual total size of each host:
>>> 
>>> # ceph orch host ls --detail
>>> HOST     . . .  HDD
>>> aka              9/28.3TB
>>> balin            9/28.3TB
>>> bifur            9/32.5TB
>>> bofur            8/32.0TB
>>> dwalin          16/32.0TB
>>> fili            12/24.0TB
>>> kili             8/32.0TB
>>> ogion            8/28.0TB
>>> prestno          9/28.3TB
>>> remolo          16/32.0TB
>>> rokanan          9/28.5TB
>>> romolo          16/30.0TB
>>> 
>>> so I see no problem here (in fact, making these even is the idea behind the disk upgrade strategy I am pursuing).
>>> 
>>> About the OSD outlier: there seems to be not such an OSD, the maximum OSD occupancy is 38% and it smoothly decreases down to a minimum of 27% with no jumps.
>> That’s a very high variance.  If the balancer is working it should be like +/- 1-2%.  Available space in the cluster will be reported as though all OSDs are 38%.
>>> 
>>> About PGs: I have 512 PGs in the data pool and 124 OSDs in total, maybe the count is too low but I'm hesitant to increase it since my cluster is very low specs and I fear to run out of memory on the oldest machines.
>>> 
>>> About CRUSH rules: I don't know exactly what to search for, so if you believe it's important then I'd need some advice.
>>> 
>>> Thank you again for your precious help,
>>> 
>>> Nicola
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 
> -- 
> Nicola Mori, Ph.D.
> INFN sezione di Firenze
> Via Bruno Rossi 1, 50019 Sesto F.no (Italy)
> +390554572660
> mori@xxxxxxxxxx
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx