Hi Motahare,
On 13/11/2023 14:44, Motahare S wrote:
Hello everyone,
Recently we have noticed that the results of "ceph df" stored and used
space does not match; as the amount of stored data *1.5 (ec factor) is
still like 5TB away from used amount:
POOL ID PGS STORED OBJECTS USED %USED
MAX AVAIL
default.rgw.buckets.data 12 1024 144 TiB 70.60M 221 TiB 18.68
643 TiB
blob and alloc configs are as below:
bluestore_min_alloc_size_hdd : 65536
bluestore_min_alloc_size_ssd : 4096
luestore_max_blob_size_hdd : 524288
bluestore_max_blob_size_ssd : 65536
bluefs_shared_alloc_size : 65536
From sources across web about how ceph actually writes on the disk, I
presumed that It will zero-pad the extents of an object to match the
4KB bdev_block_size, and then writes it in a blob which matches the
min_alloc_size, however it can re-use parts of the blob's unwritten (but
allocated because of min_alloc_size) space for another extent later.
The problem though, was that we tested different configs in a minimal ceph
octopus cluster with a 2G osd and bluestore_min_alloc_size_hdd = 65536.
When we uploaded a 1KB file with aws s3 client, the amount of used/stored
space was 64KB/1KB. We then uploaded another 1KB, and it went 128K/2K; kept
doing it until 100% of the pool was used, but only 32MB stored. I expected
ceph to start writing new 1KB files in the wasted 63KB(60KB)s of
min_alloc_size blocks, but the cluster was totally acting as a full cluster
and could no longer receive any new object. Is this behaviour expected for
s3? Does ceph really use 64x space if your dataset is made of 1KB files?
and all your object sizes should be a multiple of 64KB? Note that 5TB /
(70.6M*1.5) ~ 50 so for every rados object about 50KB is wasted on average.
we didn't observe this problem in RBD pools, probably because it cuts all
objects in 4MB.
The above analysis is correct, indeed BlueStore will waste up to 64K for
every object unaligned to 64K (i.e. both 1K and 65K objects will waste
63K).
Hence n*1K objects take n*64K bytes.
And since S3 objects are unaligned it tend to waste 32K bytes in average
on each object (assuming their sizes are distributed equally).
The only correction to the above math would be due to the actual m+n EC
layout. E.g. for 2+1 EC object count multiplier would be 3 not 1.5.
Hence the overhead per rados object is rather less than 50K in your case.
I know that min_alloc_hdd is changed to 4KB in pacific, but I'm still
curious how allocation really works and why it doesn't behave as expected?
Also, re-deploying OSDs is a headache.
Sincerely
Motahare
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx