Re: RGW Pool uses way more space than it should be

Hendrik Peyerl <hpeyerl@xxxxxxxxxxxx> · Tue, 12 Apr 2022 01:06:38 +0200

Thanks for all this information.

We are running version 16.2.7, but we also had this issue before upgrading to pacific.

We are using the default value for bluestore_min_alloc_size(_hdd) and are currently redeploying every osd with a new 4TB hdd.

The confusing part is that the space was already used before we actively started using the S3 service.

> * If you’ve ever run  `rados bench` bench against any of your pools, there may be a bunch of leftover RADOS objects laying around taking up space.  By default something like `rados ls -p my pool | egrep ‘^bench.*$` will show these.  Note that this may take a long time to run, and if the `rados bench` invocation specified a non-default job name the pattern may be different.

I did run rados bench in the past but I cannot find any leftovers. In the past I changed many things as I was playing around with the cluster.

Wouldn’t all those described issues lead to the usage being displayed in ceph df? I have 20TiB used as of now but all pools combined only use a little more than 16TiB.

Thanks,

Hendrik

> On 10. Apr 2022, at 10:17, Anthony D'Atri <anthony.datri@xxxxxxxxx> wrote:
> 
>>> 
>>> Which version of Ceph was this deployed on? Did you use the default
>>> value for bluestore_min_alloc_size(_hdd)? If it's before Pacific and
>>> you used the default, then the min alloc size is 64KiB for HDDs, which
>>> could be causing quite a bit of usage inflation depending on the sizes
>>> of objects involved.
>>> 
>> 
>> Is it recommended that if you have a pre-pacific cluster you change this now before upgrading?
> 
> It’s baked into a given OSD at creation time.  Changing after the fact should have no effect unless you rebuild affected OSDs.
> 
> As noted above, significant space amplification can happen with RGW when storing a significant fraction of relatively small objects.
> 
> This sheet quantifies and visualizes this phenomenon nicely: 
> 
> https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI
> 
> If your OSDs were deployed with bluestore_min_alloc_size=16KB, S3/Swift objects that aren’t roughly an even multiple of 16KB in size will allocate unused space.  Think remainder in a modulus operation.  Eg., if you write a 1KB object, BlueStore will store 16KB and you’ll waste 15KB.  If you write a 15KB object, the percentage is much lower.  If you write a 17KB object, the space amplification ratchets up somewhat — 17 mod 16 => remainder 1KB, but in that case you’ve also stored a full 17KB object so the _percentage_ of stranded space is lower.  This rapidly becomes insigificant as S3 object size increases.
> 
> Note that this multiplied by replication.  With 3R, the total stranded space will be 3x the remainder.  With EC, depending on K and M, the total is potentially much larger since the client object is shared over a larger number of RADOS objects and thus OSDs.
> 
> There is a doc PR already in progress that explains this phenomenon.
> 
> If your population / distribution of objects is rich in relatively small objects, you can reclaim space by iteratively destroying and redeploying OSDs that were created with the larger value.
> 
> RBD volumes tend to be much larger than min_alloc_size*, so this phenomenon is generally not significant for RBD pools.
> 
> 
> Other factors that may be at play here:
> 
> * Your OSDs at 600MB are small by Ceph standards, we’ve seen in the past that this can result in a relatively large ratio of overhead to raw / payload capacity.
> 
> * ISTR having read that versioned objects / buckets and resharding operations can in some situations leave orphaned RADOS objects 
> 
> * If you’ve ever run  `rados bench` bench against any of your pools, there may be a bunch of leftover RADOS objects laying around taking up space.  By default something like `rados ls -p my pool | egrep ‘^bench.*$` will show these.  Note that this may take a long time to run, and if the `rados bench` invocation specified a non-default job name the pattern may be different.
> 
> — aad
> 
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx