Igor Fedotov wrote: > Hi Motahare, > > On 13/11/2023 14:44, Motahare S wrote: > > Hello everyone, > > > > Recently we have noticed that the results of "ceph df" stored and used > > space does not match; as the amount of stored data *1.5 (ec factor) is > > still like 5TB away from used amount: > > > > POOL ID PGS STORED OBJECTS USED %USED > > MAX AVAIL > > default.rgw.buckets.data 12 1024 144 TiB 70.60M 221 TiB 18.68 > > 643 TiB > > > > blob and alloc configs are as below: > > bluestore_min_alloc_size_hdd : 65536 > > bluestore_min_alloc_size_ssd : 4096 > > luestore_max_blob_size_hdd : 524288 > > > > bluestore_max_blob_size_ssd : 65536 > > > > bluefs_shared_alloc_size : 65536 > > > > From sources across web about how ceph actually writes on the disk, I > > presumed that It will zero-pad the extents of an object to match the > > 4KB bdev_block_size, and then writes it in a blob which matches the > > min_alloc_size, however it can re-use parts of the blob's unwritten (but > > allocated because of min_alloc_size) space for another extent later. > > The problem though, was that we tested different configs in a minimal ceph > > octopus cluster with a 2G osd and bluestore_min_alloc_size_hdd = 65536. > > When we uploaded a 1KB file with aws s3 client, the amount of used/stored > > space was 64KB/1KB. We then uploaded another 1KB, and it went 128K/2K; kept > > doing it until 100% of the pool was used, but only 32MB stored. I expected > > ceph to start writing new 1KB files in the wasted 63KB(60KB)s of > > min_alloc_size blocks, but the cluster was totally acting as a full cluster > > and could no longer receive any new object. Is this behaviour expected for > > s3? Does ceph really use 64x space if your dataset is made of 1KB files? > > and all your object sizes should be a multiple of 64KB? Note that 5TB / > > (70.6M*1.5) ~ 50 so for every rados object about 50KB is wasted on average. > > we didn't observe this problem in RBD pools, probably because it cuts all > > objects in 4MB. > The above analysis is correct, indeed BlueStore will waste up to 64K for > every object unaligned to 64K (i.e. both 1K and 65K objects will waste > 63K). > > Hence n*1K objects take n*64K bytes. > > And since S3 objects are unaligned it tend to waste 32K bytes in average > on each object (assuming their sizes are distributed equally). > > The only correction to the above math would be due to the actual m+n EC > layout. E.g. for 2+1 EC object count multiplier would be 3 not 1.5. > Hence the overhead per rados object is rather less than 50K in your case. > > > I know that min_alloc_hdd is changed to 4KB in > > pacific, but I'm still > > curious how allocation really works and why it doesn't behave as expected? > > Also, re-deploying OSDs is a headache. > > > > Sincerely > > Motahare > > _______________________________________________ > > ceph-users mailing list -- ceph-users(a)ceph.io > > To unsubscribe send an email to ceph-users-leave(a)ceph.io Thank you Igor, Yeah the 25K waste per rados object seems reasonable. Couple of questions though: 1. Is the whole flow of blobs re-using allocated space (empty sub-sections of already allocated "min_alloc_size"ed blocks) just for RBD/FS? I read some blogs about onode->extent->blob->min_alloc->pextent re-using via small writes, and I expected this behavior in rados overally. e. g. https://blog.51cto.com/u_15265005/2888373 Is my assumption totally wrong or it just does not apply for s3? (maybe because objects are immutable?) 2. We have a ceph cluster that was updated to pacific but the OSDs were from a previous octopus cluster and were updated but not re-deployed afterward. We were concerned that re-deploying OSDs with bluestore_min_alloc_size_hdd = 4KB might cause i/o performance issues since the number of blocks and hence r/w operations will increase. Do you have any views on how it might affect our cluster? Many thanks _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx