Re: Ceph Allocation - used space is unreasonably higher than stored space

motaharesdq@xxxxxxxxx · Mon, 13 Nov 2023 13:42:52 -0000

Igor Fedotov wrote:
> Hi Motahare,
> 
> On 13/11/2023 14:44, Motahare S wrote:
> >   Hello everyone,
> > 
> >  Recently we have noticed that the results of "ceph df" stored and used
> >  space does not match; as the amount of stored data *1.5 (ec factor) is
> >  still like 5TB away from used amount:
> > 
> >  POOL                            ID   PGS   STORED  OBJECTS     USED  %USED
> >    MAX AVAIL
> >  default.rgw.buckets.data    12  1024  144 TiB   70.60M  221 TiB  18.68
> >    643 TiB
> > 
> >  blob and alloc configs are as below:
> >  bluestore_min_alloc_size_hdd : 65536
> >  bluestore_min_alloc_size_ssd  : 4096
> >  luestore_max_blob_size_hdd : 524288
> > 
> >  bluestore_max_blob_size_ssd : 65536
> > 
> >  bluefs_shared_alloc_size : 65536
> > 
> >   From sources across web about how ceph actually writes on the disk, I
> >  presumed that It will zero-pad the extents of an object to match the
> >  4KB bdev_block_size, and then writes it in a blob which matches the
> >  min_alloc_size, however it can re-use parts of the blob's unwritten (but
> >  allocated because of min_alloc_size) space for another extent later.
> >  The problem though, was that we tested different configs in a minimal ceph
> >  octopus cluster with a 2G osd and bluestore_min_alloc_size_hdd = 65536.
> >  When we uploaded a 1KB file with aws s3 client, the amount of used/stored
> >  space was 64KB/1KB. We then uploaded another 1KB, and it went 128K/2K; kept
> >  doing it until 100% of the pool was used, but only 32MB stored. I expected
> >  ceph to start writing new 1KB files in the wasted 63KB(60KB)s of
> >  min_alloc_size blocks, but the cluster was totally acting as a full cluster
> >  and could no longer receive any new object. Is this behaviour expected for
> >  s3? Does ceph really use 64x space if your dataset is made of 1KB files?
> >  and all your object sizes should be a multiple of 64KB? Note that 5TB /
> >  (70.6M*1.5) ~ 50 so for every rados object about 50KB is wasted on average.
> >  we didn't observe this problem in RBD pools, probably because it cuts all
> >  objects in 4MB. 
> The above analysis is correct, indeed BlueStore will waste up to 64K for 
> every object unaligned to 64K (i.e. both 1K and 65K objects will waste 
> 63K).
> 
> Hence n*1K objects take n*64K bytes.
> 
> And since S3 objects are unaligned it tend to waste 32K bytes in average 
> on each object (assuming their sizes are distributed equally).
> 
> The only correction to the above math would be due to the actual m+n EC 
> layout. E.g. for 2+1 EC object count multiplier would be 3 not 1.5. 
> Hence the overhead per rados object is rather less than 50K in your case.
> 
> >   I know that min_alloc_hdd is changed to 4KB in
> > pacific, but I'm still
> >  curious how allocation really works and why it doesn't behave as expected?
> >  Also, re-deploying OSDs is a headache. 
> >
> > Sincerely
> > Motahare
> > _______________________________________________
> > ceph-users mailing list -- ceph-users(a)ceph.io
> > To unsubscribe send an email to ceph-users-leave(a)ceph.io

Thank you Igor,

Yeah the 25K waste per rados object seems reasonable. Couple of questions though:

1. Is the whole flow of blobs re-using allocated space (empty sub-sections of already allocated "min_alloc_size"ed blocks) just for RBD/FS? I read some blogs about onode->extent->blob->min_alloc->pextent re-using via small writes, and I expected this behavior in rados overally.
e. g. https://blog.51cto.com/u_15265005/2888373
Is my assumption totally wrong or it just does not apply for s3? (maybe because objects are immutable?)

2. We have a ceph cluster that was updated to pacific but the OSDs were from a previous octopus cluster and were updated but not re-deployed afterward. We were concerned that re-deploying OSDs with bluestore_min_alloc_size_hdd = 4KB might cause i/o performance issues since the number of blocks and hence r/w operations will increase. Do you have any views on how it might affect our cluster?

Many thanks
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx