Re: OSD read latency grows over time

Tobias Urdin <tobias.urdin@xxxxxxxxxx> · Fri, 2 Feb 2024 11:29:23 +0100

Shiming in here, just so that it’s indexed in archives.

We’ve have a lot of issues with tombstones when running RGW usage logging and when we
trim those the Ceph OSD hosting that usage.X object will basically kill the OSD performance
due to the tombstones being so many, restarting the OSD solves it.

We are not yet on Quincy but when we are will look into optimizing rocksdb_cf_compact_on_deletion_trigger
so that we don’t have to locate the objects, trim, restart OSDs everytime we want to clean them.

Unfortunately the message on Ceph Slack is lost since it was a while back I wrote more details
on that investigation, but IIRC the issue is that the "radosgw-admin usage trim” does SingleDelete() in the RocksDB layer
when deleting objects that could be bulk deleted (RangeDelete?) due to them having the same prefix (name + date). 

Best regards

> On 26 Jan 2024, at 23:18, Mark Nelson <mark.a.nelson@xxxxxxxxx> wrote:
> 
> On 1/26/24 11:26, Roman Pashin wrote:
> 
>>> Unfortunately they cannot. You'll want to set them in centralized conf
>>> and then restart OSDs for them to take effect.
>>> 
>> Got it. Thank you Josh! WIll put it to config of affected OSDs and restart
>> them.
>> 
>> Just curious, can decreasing rocksdb_cf_compact_on_deletion_trigger 16384 >
>> 4096 hurt performance of HDD OSDs in any way? I have no growing latency on
>> HDD OSD, where data is stored, but it would be easier to set it to [osd]
>> section without cherry picking only SSD/NVME OSDs, but for all at once.
> 
> 
> Potentially if you set the trigger too low, you could force constant compactions.  Say if you set it to trigger compaction every time a tombstone is encountered.  You really want to find the sweet spot where iterating over tombstones (possibly multiple times) is more expensive than doing a compaction.  The defaults are basically just tuned to avoid the worst case scenario where OSDs become laggy or even go into heartbeat timeout (and we're not 100% sure we got those right).  I believe we've got a couple of big users that tune it more aggressively, though I'll let them speak up if they are able.
> 
> 
> Mark
> 
> 
>> --
>> Thank you,
>> Roman
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx