Re: Erasure coding enhancements - design for review

Frank Schilder <frans@xxxxxx> · Tue, 2 Jul 2024 07:59:25 +0000

Hi Bill, Sebastian and rados developers,

this seems a good opportunity to raise awareness of the discussion around this comment: https://www.mail-archive.com/ceph-users@xxxxxxx/msg24115.html . Many of the optimizations discussed in the design draft start targeting individual OSDs with the aim of improving performance and reducing load. Assuming all OSDs are working equally well it s likely to have the desired effect.

However, in the event that not all OSDs are operating equally well, that is, the occurrence of tail latencies due to a few (temporarily) slow disks, the following write path might beat a path limited to and directly addressing a slow OSD:

- assuming we use the "fast write" option proposed in the e-mail referenced above, we send new shards to all OSDs
- we ack to the client when min_size OSDs reply with ack
- this approach explicitly uses network- and CPU amplification in exchange for latency reduction with the simple "let the fastest win" strategy, we had great success with the already existing "fast read" option that made client IO latencies more predictable and eliminated the effect of tail latencies from reads; in our cluster design we explicitly take this into account when designing network and CPU per disk

This effectively makes the write to slowest disks asynchronous from client IO. In my experience with 6+2, 8+2 and 8+3 EC profiles, this would be a significant improvement to reduce overall IO latencies with both, spinning and solid state drives. We have enterprise SSDs in our cluster that sometimes stall for up to a few seconds and then catch up again. It would be awesome if these temporarily slow drives would not need to be waited for by clients.

I would be most grateful if you could consider both options for optimization: (1) reducing the IO path to the absolute minimum number of disks and shards as proposed in the draft document and (2) going the opposite way and "just send it to everyone and let the fastest win" as proposed in the comment linked above. On realistic clusters with a small fraction of mildly broken hardware I would assume option (2) to win. For the highly optimized option (1) I would expect that hardware health needs to be maintained at very high levels to be worth the effort as tail latencies will likely kill its benefits. It might also be possible to combine both ideas and get the best of both worlds?

I would like to point to message https://www.mail-archive.com/ceph-users@xxxxxxx/msg24126.html collecting an a-priori discussion of how much effort it will be to implement the "fast write" option. It sounds not that difficult to me.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Sebastian Wagner <sebastian.wagner@xxxxxxxx>
Sent: Monday, July 1, 2024 10:31 AM
To: Bill Scales; dev@xxxxxxx
Subject: Re: Erasure coding enhancements - design for review

Hi Bill,

Is the Partial Reads section the same as https://github.com/ceph/ceph/pull/55196 ?

Best,
Sebastian

Am 01.07.24 um 10:08 schrieb Bill Scales:
Hi,

We are planning to enhance the performance of erasure coding, in particular for use with block and file. We've got a design document https://github.com/bill-scales/CephErasureCodingDesign that sets out what we are hoping to achieve. We welcome your feedback, either posting your comments in Slack on #ceph-devel , raising issues in github or getting in contact with myself

Cheers,

Bill.
bill_scales@xxxxxxxxxx<mailto:bill_scales@xxxxxxxxxx>
IBM Distinguished Engineer, IBM Storage

Unless otherwise stated above:

IBM United Kingdom Limited
Registered in England and Wales with number 741598
Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU

_______________________________________________
Dev mailing list -- dev@xxxxxxx<mailto:dev@xxxxxxx>
To unsubscribe send an email to dev-leave@xxxxxxx<mailto:dev-leave@xxxxxxx>

--
Head of Software Development
E-Mail: sebastian.wagner@xxxxxxxx<mailto:sebastian.wagner@xxxxxxxx>

croit GmbH, Freseniusstr. 31h, 81247 Munich
CEO: Martin Verges, Andy Muthmann - VAT-ID: DE310638492
Com. register: Amtsgericht Munich HRB 231263

Web<https://croit.io/> | LinkedIn<http://linkedin.com/company/croit> | Youtube<https://www.youtube.com/channel/UCIJJSKVdcSLGLBtwSFx_epw> | Twitter<https://twitter.com/croit_io>

TOP 100 Innovator Award Winner<https://croit.io/blog/croit-receives-top-100-seal> by compamedia
Technology Fast50 Award<https://croit.io/blog/deloitte-technology-fast-50-award> Winner by Deloitte
_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx