Ceph Allocation - used space is unreasonably higher than stored space

Motahare S <motaharesdq@xxxxxxxxx> · Mon, 13 Nov 2023 15:14:07 +0330

Hello everyone,

Recently we have noticed that the results of "ceph df" stored and used
space does not match; as the amount of stored data *1.5 (ec factor) is
still like 5TB away from used amount:

POOL                            ID   PGS   STORED  OBJECTS     USED  %USED
 MAX AVAIL
default.rgw.buckets.data    12  1024  144 TiB   70.60M  221 TiB  18.68
 643 TiB

blob and alloc configs are as below:
bluestore_min_alloc_size_hdd : 65536
bluestore_min_alloc_size_ssd  : 4096
luestore_max_blob_size_hdd : 524288

bluestore_max_blob_size_ssd : 65536

bluefs_shared_alloc_size : 65536

>From sources across web about how ceph actually writes on the disk, I
presumed that It will zero-pad the extents of an object to match the
4KB bdev_block_size, and then writes it in a blob which matches the
min_alloc_size, however it can re-use parts of the blob's unwritten (but
allocated because of min_alloc_size) space for another extent later.
The problem though, was that we tested different configs in a minimal ceph
octopus cluster with a 2G osd and bluestore_min_alloc_size_hdd = 65536.
When we uploaded a 1KB file with aws s3 client, the amount of used/stored
space was 64KB/1KB. We then uploaded another 1KB, and it went 128K/2K; kept
doing it until 100% of the pool was used, but only 32MB stored. I expected
ceph to start writing new 1KB files in the wasted 63KB(60KB)s of
min_alloc_size blocks, but the cluster was totally acting as a full cluster
and could no longer receive any new object. Is this behaviour expected for
s3? Does ceph really use 64x space if your dataset is made of 1KB files?
and all your object sizes should be a multiple of 64KB? Note that 5TB /
(70.6M*1.5) ~ 50 so for every rados object about 50KB is wasted on average.
we didn't observe this problem in RBD pools, probably because it cuts all
objects in 4MB.
I know that min_alloc_hdd is changed to 4KB in pacific, but I'm still
curious how allocation really works and why it doesn't behave as expected?
Also, re-deploying OSDs is a headache.

Sincerely
Motahare
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx