Re: OSD read latency grows over time

Tobias Urdin <tobias.urdin@xxxxxxxxxx> · Fri, 2 Feb 2024 11:41:25 +0100

I found the internal note I made about it, see below.

	When we trim thousands of OMAP keys in RocksDB this calls SingleDelete() in the RocksDBStore in Ceph, this causes tombstones in
	the RocksDB database.

	These thousands of tombstones that each needs to be iterated over when for example reading data from the database causes the latency
	to become super high. If the OSD is restarted the issue disappears, I assume this is because RocksDB or the RocksDBStore in Ceph creates
	a new iterator or does compaction internally upon startup.

	I don't see any straight forward solution without having to rebuild internal logic in the usage trim code. More specifically that would be investigating
	in the usage trim code to use `cls_cxx_map_remove_range()` which would call `RocksDBStore::RocksDBTransactionImpl::rm_range_keys()` internally
	instead when doing a usage trim for an epoch (—start-date and —end-date only, and no user or bucket).

	The problem there though is that the `rocksdb_delete_range_threshold` config option defaults to 1_M which is way more than the amount we are deleting
	and still causing issue, that function calls `DeleteRange()` instead of `SingleDelete()` in RocksDB which would cause one tombstone for all entries
	instead of one tombstone for every single OMAP key.

	Even better for above would be calling `rmkeys_by_prefix()`  and not having to specify start and end but there is no OSD op in PrimaryLogPG for that
	which means even more work that might not be backportable.

	Our best bet right now without touching radosgw-admin is upgrading to >=16.2.14 which introduces https://github.com/ceph/ceph/pull/50894 that will
	do compaction if a threshold of tombstones is hit within a sliding window during iteration. 

Best regards

> On 2 Feb 2024, at 11:29, Tobias Urdin <tobias.urdin@xxxxxxxxxx> wrote:
> 
> Shiming in here, just so that it’s indexed in archives.
> 
> We’ve have a lot of issues with tombstones when running RGW usage logging and when we
> trim those the Ceph OSD hosting that usage.X object will basically kill the OSD performance
> due to the tombstones being so many, restarting the OSD solves it.
> 
> We are not yet on Quincy but when we are will look into optimizing rocksdb_cf_compact_on_deletion_trigger
> so that we don’t have to locate the objects, trim, restart OSDs everytime we want to clean them.
> 
> Unfortunately the message on Ceph Slack is lost since it was a while back I wrote more details
> on that investigation, but IIRC the issue is that the "radosgw-admin usage trim” does SingleDelete() in the RocksDB layer
> when deleting objects that could be bulk deleted (RangeDelete?) due to them having the same prefix (name + date). 
> 
> Best regards
> 
>> On 26 Jan 2024, at 23:18, Mark Nelson <mark.a.nelson@xxxxxxxxx> wrote:
>> 
>> On 1/26/24 11:26, Roman Pashin wrote:
>> 
>>>> Unfortunately they cannot. You'll want to set them in centralized conf
>>>> and then restart OSDs for them to take effect.
>>>> 
>>> Got it. Thank you Josh! WIll put it to config of affected OSDs and restart
>>> them.
>>> 
>>> Just curious, can decreasing rocksdb_cf_compact_on_deletion_trigger 16384 >
>>> 4096 hurt performance of HDD OSDs in any way? I have no growing latency on
>>> HDD OSD, where data is stored, but it would be easier to set it to [osd]
>>> section without cherry picking only SSD/NVME OSDs, but for all at once.
>> 
>> 
>> Potentially if you set the trigger too low, you could force constant compactions.  Say if you set it to trigger compaction every time a tombstone is encountered.  You really want to find the sweet spot where iterating over tombstones (possibly multiple times) is more expensive than doing a compaction.  The defaults are basically just tuned to avoid the worst case scenario where OSDs become laggy or even go into heartbeat timeout (and we're not 100% sure we got those right).  I believe we've got a couple of big users that tune it more aggressively, though I'll let them speak up if they are able.
>> 
>> 
>> Mark
>> 
>> 
>>> --
>>> Thank you,
>>> Roman
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@xxxxxxx
>>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>> _______________________________________________
>> ceph-users mailing list -- ceph-users@xxxxxxx
>> To unsubscribe send an email to ceph-users-leave@xxxxxxx
> 
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx