Re: Performance improvement suggestion

Frank Schilder <frans@xxxxxx> · Mon, 4 Mar 2024 08:41:17 +0000

Hi all, coming late to the party but want to ship in as well with some experience.

The problem of tail latencies of individual OSDs is a real pain for any redundant storage system. However, there is a way to deal with this in an elegant way when using large replication factors. The idea is to use the counterpart of the "fast read" option that exists for EC pools and:

1) make this option available to replicated pools as well (is on the road map as far as I know), but also
2) implement an option "fast write" for all pool types.

Fast write enabled would mean that the primary OSD sends #size copies to the entire active set (including itself) in parallel and sends an ACK to the client as soon as min_size ACKs have been received from the peers (including itself). In this way, one can tolerate (size-min_size) slow(er) OSDs (slow for whatever reason) without suffering performance penalties immediately (only after too many requests started piling up, which will show as a slow requests warning).

I have fast read enabled on all EC pools. This does increase the cluster-internal network traffic, which is nowadays absolutely no problem (in the good old 1G times it potentially would be). In return, the read latencies on the client side are lower and much more predictable. In effect, the user experience improved dramatically.

I would really wish that such an option gets added as we use wide replication profiles (rep-(4,2) and EC(8+3), each with 2 "spare" OSDs) and exploiting large replication factors (more precisely, large (size-min_size)) to mitigate the impact of slow OSDs would be awesome. It would also add some incentive to stop the ridiculous size=2 min_size=1 habit, because one gets an extra gain from replication on top of redundancy.

In the long run, the ceph write path should try to deal with a-priori known different-latency connections (fast local ACK with async remote completion, was asked for a couple of times), for example, for stretched clusters where one has an internal connection for the local part and external connections for the remote parts. It would be great to have similar ways of mitigating some penalties of the slow write paths to remote sites.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Peter Grandi <pg@xxxxxxxxxxxxxxxxxxxx>
Sent: Wednesday, February 21, 2024 1:10 PM
To: list Linux fs Ceph
Subject:  Re: Performance improvement suggestion

> 1. Write object A from client.
> 2. Fsync to primary device completes.
> 3. Ack to client.
> 4. Writes sent to replicas.
[...]

As mentioned in the discussion this proposal is the opposite of
what the current policy, is, which is to wait for all replicas
to be written before writes are acknowledged to the client:

https://github.com/ceph/ceph/blob/main/doc/architecture.rst

   "After identifying the target placement group, the client
   writes the object to the identified placement group's primary
   OSD. The primary OSD then [...] confirms that the object was
   stored successfully in the secondary and tertiary OSDs, and
   reports to the client that the object was stored
   successfully."

A more revolutionary option would be for 'librados' to write in
parallel to all the "active set" OSDs and report this to the
primary, but that would greatly increase client-Ceph traffic,
while the current logic increases traffic only among OSDs.

> So I think that to maintain any semblance of reliability,
> you'd need to at least wait for a commit ack from the first
> replica (i.e. min_size=2).

Perhaps it could be similar to 'k'+'m' for EC, that is 'k'
synchronous (write completes to the client only when all at
least 'k' replicas, including primary, have been committed) and
'm' asynchronous, instead of 'k' being just 1 or 2.
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx