Re: RGW Pool uses way more space than it should be

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



>> 
>> Which version of Ceph was this deployed on? Did you use the default
>> value for bluestore_min_alloc_size(_hdd)? If it's before Pacific and
>> you used the default, then the min alloc size is 64KiB for HDDs, which
>> could be causing quite a bit of usage inflation depending on the sizes
>> of objects involved.
>> 
> 
> Is it recommended that if you have a pre-pacific cluster you change this now before upgrading?

It’s baked into a given OSD at creation time.  Changing after the fact should have no effect unless you rebuild affected OSDs.

As noted above, significant space amplification can happen with RGW when storing a significant fraction of relatively small objects.

This sheet quantifies and visualizes this phenomenon nicely: 

https://docs.google.com/spreadsheets/d/1rpGfScgG-GLoIGMJWDixEkqs-On9w8nAUToPQjN8bDI

If your OSDs were deployed with bluestore_min_alloc_size=16KB, S3/Swift objects that aren’t roughly an even multiple of 16KB in size will allocate unused space.  Think remainder in a modulus operation.  Eg., if you write a 1KB object, BlueStore will store 16KB and you’ll waste 15KB.  If you write a 15KB object, the percentage is much lower.  If you write a 17KB object, the space amplification ratchets up somewhat — 17 mod 16 => remainder 1KB, but in that case you’ve also stored a full 17KB object so the _percentage_ of stranded space is lower.  This rapidly becomes insigificant as S3 object size increases.

Note that this multiplied by replication.  With 3R, the total stranded space will be 3x the remainder.  With EC, depending on K and M, the total is potentially much larger since the client object is shared over a larger number of RADOS objects and thus OSDs.

There is a doc PR already in progress that explains this phenomenon.

If your population / distribution of objects is rich in relatively small objects, you can reclaim space by iteratively destroying and redeploying OSDs that were created with the larger value.

RBD volumes tend to be much larger than min_alloc_size*, so this phenomenon is generally not significant for RBD pools.


Other factors that may be at play here:

* Your OSDs at 600MB are small by Ceph standards, we’ve seen in the past that this can result in a relatively large ratio of overhead to raw / payload capacity.

* ISTR having read that versioned objects / buckets and resharding operations can in some situations leave orphaned RADOS objects 

* If you’ve ever run  `rados bench` bench against any of your pools, there may be a bunch of leftover RADOS objects laying around taking up space.  By default something like `rados ls -p my pool | egrep ‘^bench.*$` will show these.  Note that this may take a long time to run, and if the `rados bench` invocation specified a non-default job name the pattern may be different.

— aad

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx




[Index of Archives]     [Information on CEPH]     [Linux Filesystem Development]     [Ceph Development]     [Ceph Large]     [Ceph Dev]     [Linux USB Development]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [xfs]


  Powered by Linux