Re: Performance improvement suggestion

Mark Nelson <mark.nelson@xxxxxxxxx> · Mon, 4 Mar 2024 11:06:47 -0600

On 3/4/24 08:40, Maged Mokhtar wrote:

On 04/03/2024 15:37, Frank Schilder wrote:
Fast write enabled would mean that the primary OSD sends #size 
copies to the
entire active set (including itself) in parallel and sends an ACK 
to the
client as soon as min_size ACKs have been received from the peers 
(including
itself). In this way, one can tolerate (size-min_size) slow(er) 
OSDs (slow
for whatever reason) without suffering performance penalties 
immediately
(only after too many requests started piling up, which will show 
as a slow
requests warning).

What happens if there occurs an error on the slowest osd after the 
min_size ACK has already been send to the client?

This should not be different than what exists today..unless 
of-course if
the error happens on the local/primary osd
Can this be addressed with reasonable effort? I don't expect this to 
be a quick-fix and it should be tested. However, beating the 
tail-latency statistics with the extra redundancy should be worth it. 
I observe fluctuations of latencies, OSDs become randomly slow for 
whatever reason for short time intervals and then return to normal.

A reason for this could be DB compaction. I think during compaction 
latency tends to spike.

A fast-write option would effectively remove the impact of this.

Best regards and thanks for considering this!

i think this is something the rados devs need to say. it does sound 
worth investigating. it is not just for cases with db compaction but 
more importantly the normal(happy) io path as it will have the most 
impact.

Typically a L0->L1 compaction will have two primary effects:

1) It will cause large IO read/write traffic to the disk potentially 
impacting other IO taking place if the disk is already saturated.

2) It will block memtable flushes until the compaction finishes. This 
means that more and more data will accumulate in the memtables/WAL which 
can trigger throttling and eventually stalls if you run out of buffer 
space.  By default, we allow up to 1GB of writes to WAL/memtables before 
writes are fully stalled, but RocksDB will typlically throttle writes 
before you get to that point.  It's possible a larger buffer may allow 
you to absorb traffic spikes for longer at the expense of more disk and 
memory usage.  Ultimately though, if you are hitting throttling, it 
means that the DB can't keep up with the WAL ingestion rate.

Mark

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx

--
Best Regards,
Mark Nelson
Head of Research and Development

Clyso GmbH
p: +49 89 21552391 12 | a: Minnesota, USA
w: https://clyso.com | e: mark.nelson@xxxxxxxxx

We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx