Re: Performance improvement suggestion

Dan van der Ster <dan.vanderster@xxxxxxxxx> · Tue, 20 Feb 2024 23:01:46 -0800

Hi,

I just want to echo what the others are saying.

Keep in mind that RADOS needs to guarantee read-after-write consistency for
the higher level apps to work (RBD, RGW, CephFS). If you corrupt VM block
devices, S3 objects or bucket metadata/indexes, or CephFS metadata, you're
going to suffer some long days and nights recovering.

Anyway, I think that what you proposed has at best a similar reliability to
min_size=1. And note that min_size=1 is strongly discouraged because of the
very high likelihood that a device/network/power failure turns into a
visible outage. In short: your idea would turn every OSD into a SPoF.

How would you handle this very common scenario: a power outage followed by
at least one device failing to start afterwards?

1. Write object A from client.
2. Fsync to primary device completes.
3. Ack to client.
4. Writes sent to replicas.
5. Cluster wide power outage (before replicas committed).
6. Power restored, but the primary osd does not start (e.g. permanent hdd
failure).
7. Client tries to read object A.

Today, with min_size=1 such a scenario manifests as data loss: you get
either a down PG (with many many objects offline/IO blocked until you
manually decide which data loss mode to accept) or unfounded objects (with
IO blocked until you accept data loss). With min_size=2 the likelihood of
data loss is dramatically reduced.

Another thing about that power loss scenario is that all dirty PGs would
need to be recovered when the cluster reboots. You'd lose all the writes in
transit and have to replay them from the primary's pg_log, or backfill if
the pg_log was too short. Again, any failure during that recovery would
lead to data loss.

So I think that to maintain any semblance of reliability, you'd need to at
least wait for a commit ack from the first replica (i.e. min_size=2). But
since the replica writes are dispatched in parallel, your speedup would
evaporate.

Another thing: I suspect this idea would result in many inconsistencies
from transient issues. You'd need to ramp up the number of parallel
deep-scrubs to look for those inconsistencies quickly, which would also
work against any potential speedup.

Cheers, Dan

--
Dan van der Ster
CTO

Clyso GmbH
w: https://clyso.com | e: dan.vanderster@xxxxxxxxx

Try our Ceph Analyzer!: https://analyzer.clyso.com/
We are hiring: https://www.clyso.com/jobs/

On Wed, Jan 31, 2024, 11:49 quaglio@xxxxxxxxxx <quaglio@xxxxxxxxxx> wrote:

> Hello everybody,
>      I would like to make a suggestion for improving performance in Ceph
> architecture.
>      I don't know if this group would be the best place or if my proposal
> is correct.
>
>      My suggestion would be in the item
> https://docs.ceph.com/en/latest/architecture/, at the end of the topic
> "Smart Daemons Enable Hyperscale".
>
>      The Client needs to "wait" for the configured amount of replicas to
> be written (so that the client receives an ok and continues). This way, if
> there is slowness on any of the disks on which the PG will be updated, the
> client is left waiting.
>
>      It would be possible:
>
>      1-) Only record on the primary OSD
>      2-) Write other replicas in background (like the same way as when an
> OSD fails: "degraded" ).
>
>      This way, client has a faster response when writing to storage:
> improving latency and performance (throughput and IOPS).
>
>      I would find it plausible to accept a period of time (seconds) until
> all replicas are ok (written asynchronously) at the expense of improving
> performance.
>
>      Could you evaluate this scenario?
>
>
> Rafael.
>
>  _______________________________________________
> ceph-users mailing list -- ceph-users@xxxxxxx
> To unsubscribe send an email to ceph-users-leave@xxxxxxx
>
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx