Re: OSD read latency grows over time

Mark Nelson <mark.nelson@xxxxxxxxx> · Fri, 19 Jan 2024 08:12:27 -0600

HI Roman,

The fact that changing the pg_num for the index pool drops the latency 
back down might be a clue.  Do you have a lot of deletes happening on 
this cluster?  If you have a lot of deletes and long pauses between 
writes, you could be accumulating tombstones that you have to keep 
iterating over during bucket listing.  Those get cleaned up during 
compaction.  If there are no writes, you might not be compacting the 
tombstones away enough.  Just a theory, but when you rearrange the PG 
counts, Ceph does a bunch of writes to move the data around, triggering 
compaction, and deleting the tombstones.

In v17.2.7 we enabled a feature that automatically performs a compaction 
if too many tombstones are present during iteration in RocksDB.  It 
might be worth upgrading to see if it helps (you might have to try 
tweaking the settings if the defaults aren't helping enough).  The PR is 
here:

https://github.com/ceph/ceph/pull/50893

Mark

On 1/16/24 04:22, Roman Pashin wrote:
Hello Ceph users,

we see strange issue on last recent Ceph installation v17.6.2. We store
data on HDD pool, index pool is on SSD. Each OSD store its wal on NVME
partition. Benchmarks didn't expose any issues with cluster, but since we
placed production load on it we see constantly growing OSD latency time
(osd_read_latency) on SSD disks (where Index pool located). Latency is
constantly growing day-by-day, but disks are not utilized even for 50%.
Interesting, that when we move Index pool from SSD to NVME disks (disk
space allows it for now) - osd latency drops to zero and start increasing
from the ground. Also we noticed, that any change of pg_num for index pool
(from 256 to 128 for instance) also drops latency to zero. And it starts
its growth again (https://postimg.cc/5YHk9bby).

 From client perspective it looks like one operation takes longer and longer
each other day and operation time drops each time when we do some change on
index pool. I've enabled debug_optracker 10/0 and it shows, that OSD spend
most time in `queued_for_pg` state, but physical disk utilization is about
10-20%. Also per logs I see, that longest operation is Listbucket, but it
is strange, that with less than 100'000 items in bucket list even with
'max_keys=1' takes 3-40 seconds.

If it matters client is Apache Flink doing checkpoints via S3 protocol.

Here is an example of operation with debug_optracking logs:

2023-12-29T16:24:28.873353+0300, event: throttled, op: osd_op(client.1227774
.0:22575820 7.19 7.a84f1a59 (undecoded)
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:28.861549+0300, event: header_read, op: osd_op(client.
1227774.0:22575820 7.19 7.a84f1a59 (undecoded)
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:28.873358+0300, event: all_read, op: osd_op(client.1227774
.0:22575820 7.19 7.a84f1a59 (undecoded)
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:28.873359+0300, event: dispatched, op: osd_op(client.
1227774.0:22575820 7.19 7.a84f1a59 (undecoded)
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:28.873389+0300, event: queued_for_pg, op: osd_op(client.
1227774.0:22575820 7.19 7.a84f1a59 (undecoded)
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:38.077528+0300, event: reached_pg, op: osd_op(client.
1227774.0:22575820 7.19 7.a84f1a59 (undecoded)
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:38.077561+0300, event: started, op: osd_op(client.1227774
.0:22575820 7.19 7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.
9780498.1.83:head [stat,call rgw.guard_bucket_resharding in=36b,call
rgw.bucket_prepare_op in=331b] snapc 0=[]
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:38.077714+0300, event: waiting for subops from 59,494, op:
osd_op(client.1227774.0:22575820 7.19
7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head
[stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op
in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio
e83833)
2023-12-29T16:24:38.146157+0300, event: sub_op_commit_rec, op:
osd_op(client.1227774.0:22575820 7.19
7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head
[stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op
in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio
e83833)
2023-12-29T16:24:38.146166+0300, event: op_commit, op: osd_op(client.1227774
.0:22575820 7.19 7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.
9780498.1.83:head [stat,call rgw.guard_bucket_resharding in=36b,call
rgw.bucket_prepare_op in=331b] snapc 0=[]
ondisk+write+known_if_redirected+supports_pool_eio e83833)
2023-12-29T16:24:38.146191+0300, event: sub_op_commit_rec, op:
osd_op(client.1227774.0:22575820 7.19
7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head
[stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op
in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio
e83833)
2023-12-29T16:24:38.146204+0300, event: commit_sent, op: osd_op(client.
1227774.0:22575820 7.19
7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head
[stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op
in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio
e83833)
2023-12-29T16:24:38.146216+0300, event: done, op:
osd_op(client.1227774.0:22575820
7.19 7:9a58f215:::.dir.68960da3-1c98-45c1-a87a-9e6c39253d27.9780498.1.83:head
[stat,call rgw.guard_bucket_resharding in=36b,call rgw.bucket_prepare_op
in=331b] snapc 0=[] ondisk+write+known_if_redirected+supports_pool_eio
e83833)

Does anybody faced with the same issue? I'd be very grateful for any idea
because at this point I've got stuck with what to tune and where to look at.

Cluster setup - replication 3x, 15 servers in 3 datacenters with datacenter
as failure domain. 7x HDD (data), 2x SSD (index), 1x NVME (wal + OS).
ceph config - https://pastebin.com/pCqxXhT3
OSD read latency graph - https://postimg.cc/5YHk9bby

--
Best Regards,
Mark Nelson
Head of Research and Development

Clyso GmbH
p: +49 89 21552391 12 | a: Minnesota, USA
w: https://clyso.com | e: mark.nelson@xxxxxxxxx

We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx