Re: Understanding filesystem size

Nicola Mori <mori@xxxxxxxxxx> · Thu, 2 Jan 2025 18:13:59 +0100

I have a single user producing lots of small files (currently about 4.7M 
with a mean size of 3 MB). The total number of files is about 7M.
About the occupancy: in 1.8 TiB disks I see the PG count ranging from 27 
(-> 38% occupancy) to 20 (-> 27% occupancy) at the same OSD weight 
(1.819). I guess these fluctuations of the number of PGs are due to the 
small number of PGs coupled to the inefficiency of the balancer, do you 
agree? If it's correct then I see only two ways: a manual rebalancing 
(tried in the past with much effort and little results) or an increase 
in PG count (risky because of old hardware), do you see any other 
possibility?
Cheers,

Nicola

On 02/01/25 5:30 PM, Anthony D'Atri wrote:

On Jan 2, 2025, at 11:18 AM, Nicola Mori <mori@xxxxxxxxxx> wrote:

Hi Anthony, thanks for your insights. I actually used df -h from the 
bash shell of a machine mounting the CephFS with the kernel module, 
and here's the current result:
wizardfs_rootsquash@b1029256-7bb3-11ec-a8ce-ac1f6b627b45.wizardfs=/ 
217T   78T  139T  36% /wizard/ceph
So it seems the fs size is 217 TiB, which is about 66% of the total 
amount of raw disk space (320 TiB) as I wrote before.
Then I tried the command you suggested:

# ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    320 TiB  216 TiB  104 TiB   104 TiB      32.56
TOTAL  320 TiB  216 TiB  104 TiB   104 TiB      32.56

--- POOLS ---
POOL             ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
.mgr              1    1  242 MiB       62  726 MiB      0     62 TiB
wizard_metadata   2   16  1.2 GiB   85.75k  3.5 GiB      0     62 TiB
wizard_data       3  512   78 TiB   27.03M  104 TiB  36.06    138 TiB

In order to find the total size of the data pool I don't understand 
how to interpret the "MAX AVAIL" column: should it be summed to 
"STORED" or to "USED”? 
Do you have a lot of small files?

In the first case I'd get 216 TiB which corresponds to what df -h says 
and thus to 66%, in the second case I'd get 242 TiB which is very 
close to 75%... But I guess the first option is the right one.
Then I looked at the weights of my failure domain (host):

#    ceph osd tree | grep host

-7          25.51636      host aka
-3          25.51636      host balin
-13          29.10950      host bifur
-17          29.10950      host bofur
-21          29.10371      host dwalin
-23          21.83276      host fili
-25          29.10950      host kili
-9          25.51636      host ogion
-19          25.51636      host prestno
-15          29.10522      host remolo
-5          25.51636      host rokanan
-11          27.29063      host romolo

They seem quite even and quite reflecting the actual total size of 
each host:
# ceph orch host ls --detail
HOST     . . .  HDD
aka              9/28.3TB
balin            9/28.3TB
bifur            9/32.5TB
bofur            8/32.0TB
dwalin          16/32.0TB
fili            12/24.0TB
kili             8/32.0TB
ogion            8/28.0TB
prestno          9/28.3TB
remolo          16/32.0TB
rokanan          9/28.5TB
romolo          16/30.0TB

so I see no problem here (in fact, making these even is the idea 
behind the disk upgrade strategy I am pursuing).
About the OSD outlier: there seems to be not such an OSD, the maximum 
OSD occupancy is 38% and it smoothly decreases down to a minimum of 
27% with no jumps.
That’s a very high variance.  If the balancer is working it should be 
like +/- 1-2%.  Available space in the cluster will be reported as 
though all OSDs are 38%.
About PGs: I have 512 PGs in the data pool and 124 OSDs in total, 
maybe the count is too low but I'm hesitant to increase it since my 
cluster is very low specs and I fear to run out of memory on the 
oldest machines.
About CRUSH rules: I don't know exactly what to search for, so if you 
believe it's important then I'd need some advice.
Thank you again for your precious help,

Nicola
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
--
Nicola Mori, Ph.D.
INFN sezione di Firenze
Via Bruno Rossi 1, 50019 Sesto F.no (Italy)
+390554572660
mori@xxxxxxxxxx

Attachment:
smime.p7s

Description: S/MIME Cryptographic Signature
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx