Re: OSD read latency grows over time

Mark Nelson <mark.nelson@xxxxxxxxx> · Fri, 2 Feb 2024 17:29:43 -0600

Hi Cory,

Thanks for the excellent information here!  I'm super curious how much 
the kv cache is using in this case.  If you happen to have a dump from 
the perf counters that includes the prioritycache subsystem that would 
be ideal.  By default, onode (meta) and rocksdb (except for onodes 
stored in rocksdb) each get a first shot at 45% of the available cache 
memory at high priority, but how much they actually request depends on 
the relative ages of the items in each cache.  The age bins are defined 
in seconds.  By default:

kv: "1 2 6 24 120 720 0 0 0 0"

kv_onode: "0 0 0 0 0 0 0 0 0 720"

meta: "1 2 6 24 120 720 0 0 0 0"

data: "1 2 6 24 120 720 0 0 0 0"

and the ratios:

kv: 45%

kv_onode: 4%

meta: 45%

data: 6% (implicit)

This means that items from the kv cache, meta cache, and data caches 
that are less than 1 second old will all be competing with each other 
for memory during the first round.  kv and meta cache can each get up to 
45% of the available memory and meta/data get up to 4% and 6% 
respectively.  Since kv_onode doesn't actually compete at the first 
priority level though, it won't actually request any memory.  Whatever 
memory is left after the first round (assuming there is any) will be 
divided up based on the ratios to the remaining caches that are still 
requesting memory until either there are no requests or no memory left.  
After that, the PriorityCache proceeds to the next round and does the 
same thing, this time for cache items that are between 1 and 2 seconds 
old. Then between 2 and 6 seconds old, etc.

This approach lets us have different caches compete at different 
intervals.  For instance we could have the first age-bin be 0-1 seconds 
for onodes, but 0-5 seconds for kv.  We could also make the ratios 
different.  IE the first bin might be for onodes that are 0-1 seconds, 
but we give them a first shot at 60% of the memory.  kv entries that are 
0-5 seconds old might all be put in the first priority bin with the 0-1 
second onodes, but we could give them say only a 30% iniital shot at 
available memory (but they would still all be cached with higher 
priority than onodes that are 1-2 seconds old).

Ultimately, we might find that there are better defaults for the bins 
and ratios when the index gets big, however typically we really want to 
cache onodes, so if we are seeing that the kv cache is fully utilizing 
it's default ratio, increasing the amount of memory may indeed be warranted.

Mark

On 2/2/24 12:50, Cory Snyder wrote:
We've seen issues with high index OSD latencies in multiple scenarios over the past couple of years. The issues related to rocksdb tombstones could certainly be relevant, but compact on deletion has been very effective for us in that regard. Recently, we experienced a similar issue at a higher level with the RGW bucket index deletion markers on versioned buckets. Do you happen to have versioned buckets in your cluster? If you do and the clients of those buckets are doing a bunch of deletes that leave behind S3 delete markers, the CLS code may be doing a lot of work to filter relevant entries during bucket listing ops.

Another thing that we've found is that rocksdb can become quite slow if it doesn't have enough memory for internal caches. As our cluster usage has grown, we've needed to increase OSD memory in accordance with bucket index pool usage. One one cluster, we found that increasing OSD memory improved rocksdb latencies by over 10x.

Hope this helps!

Cory Snyder

From: Tobias Urdin <tobias.urdin@xxxxxxxxxx>
Sent: Friday, February 2, 2024 5:41 AM
To: ceph-users <ceph-users@xxxxxxx>
Subject:  Re: OSD read latency grows over time

I found the internal note I made about it, see below. When we trim thousands of OMAP keys in RocksDB this calls SingleDelete() in the RocksDBStore in Ceph, this causes tombstones in the RocksDB database. These thousands of tombstones that each
ZjQcmQRYFpfptBannerStart
This Message Is From an Untrusted Sender
You have not previously corresponded with this sender.
Report Suspicious

ZjQcmQRYFpfptBannerEnd
I found the internal note I made about it, see below.

	When we trim thousands of OMAP keys in RocksDB this calls SingleDelete() in the RocksDBStore in Ceph, this causes tombstones in
	the RocksDB database.

	These thousands of tombstones that each needs to be iterated over when for example reading data from the database causes the latency
	to become super high. If the OSD is restarted the issue disappears, I assume this is because RocksDB or the RocksDBStore in Ceph creates
	a new iterator or does compaction internally upon startup.

	I don't see any straight forward solution without having to rebuild internal logic in the usage trim code. More specifically that would be investigating
	in the usage trim code to use `cls_cxx_map_remove_range()` which would call `RocksDBStore::RocksDBTransactionImpl::rm_range_keys()` internally
	instead when doing a usage trim for an epoch (—start-date and —end-date only, and no user or bucket).

	The problem there though is that the `rocksdb_delete_range_threshold` config option defaults to 1_M which is way more than the amount we are deleting
	and still causing issue, that function calls `DeleteRange()` instead of `SingleDelete()` in RocksDB which would cause one tombstone for all entries
	instead of one tombstone for every single OMAP key.

	Even better for above would be calling `rmkeys_by_prefix()`  and not having to specify start and end but there is no OSD op in PrimaryLogPG for that
	which means even more work that might not be backportable.

	Our best bet right now without touching radosgw-admin is upgrading to >=16.2.14 which introduces https://urldefense.com/v3/__https://github.com/ceph/ceph/pull/50894__;!!J0dtj8f0ZRU!jlSdazGnkKfYPm4GupnIQba_7jIceMkOBEZvj6jbtsydX46nCt3ARobEFZzuIU6hMF3g85-87RT0KbUjNAU$ that will
	do compaction if a threshold of tombstones is hit within a sliding window during iteration.

Best regards

On 2 Feb 2024, at 11:29, Tobias Urdin <tobias.urdin@xxxxxxxxxx> wrote:

Shiming in here, just so that it’s indexed in archives.

We’ve have a lot of issues with tombstones when running RGW usage logging and when we
trim those the Ceph OSD hosting that usage.X object will basically kill the OSD performance
due to the tombstones being so many, restarting the OSD solves it.

We are not yet on Quincy but when we are will look into optimizing rocksdb_cf_compact_on_deletion_trigger
so that we don’t have to locate the objects, trim, restart OSDs everytime we want to clean them.

Unfortunately the message on Ceph Slack is lost since it was a while back I wrote more details
on that investigation, but IIRC the issue is that the "radosgw-admin usage trim” does SingleDelete() in the RocksDB layer
when deleting objects that could be bulk deleted (RangeDelete?) due to them having the same prefix (name + date).

Best regards

On 26 Jan 2024, at 23:18, Mark Nelson <mark.a.nelson@xxxxxxxxx> wrote:

On 1/26/24 11:26, Roman Pashin wrote:

Unfortunately they cannot. You'll want to set them in centralized conf
and then restart OSDs for them to take effect.

Got it. Thank you Josh! WIll put it to config of affected OSDs and restart
them.

Just curious, can decreasing rocksdb_cf_compact_on_deletion_trigger 16384 >
4096 hurt performance of HDD OSDs in any way? I have no growing latency on
HDD OSD, where data is stored, but it would be easier to set it to [osd]
section without cherry picking only SSD/NVME OSDs, but for all at once.

Potentially if you set the trigger too low, you could force constant compactions.  Say if you set it to trigger compaction every time a tombstone is encountered.  You really want to find the sweet spot where iterating over tombstones (possibly multiple times) is more expensive than doing a compaction.  The defaults are basically just tuned to avoid the worst case scenario where OSDs become laggy or even go into heartbeat timeout (and we're not 100% sure we got those right).  I believe we've got a couple of big users that tune it more aggressively, though I'll let them speak up if they are able.

Mark

--
Thank you,
Roman
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Best Regards,
Mark Nelson
Head of Research and Development

Clyso GmbH
p: +49 89 21552391 12 | a: Minnesota, USA
w: https://clyso.com | e: mark.nelson@xxxxxxxxx

We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx