Hi Bill, Sebastian and rados developers, this seems a good opportunity to raise awareness of the discussion around this comment: https://www.mail-archive.com/ceph-users@xxxxxxx/msg24115.html . Many of the optimizations discussed in the design draft start targeting individual OSDs with the aim of improving performance and reducing load. Assuming all OSDs are working equally well it s likely to have the desired effect. However, in the event that not all OSDs are operating equally well, that is, the occurrence of tail latencies due to a few (temporarily) slow disks, the following write path might beat a path limited to and directly addressing a slow OSD: - assuming we use the "fast write" option proposed in the e-mail referenced above, we send new shards to all OSDs - we ack to the client when min_size OSDs reply with ack - this approach explicitly uses network- and CPU amplification in exchange for latency reduction with the simple "let the fastest win" strategy, we had great success with the already existing "fast read" option that made client IO latencies more predictable and eliminated the effect of tail latencies from reads; in our cluster design we explicitly take this into account when designing network and CPU per disk This effectively makes the write to slowest disks asynchronous from client IO. In my experience with 6+2, 8+2 and 8+3 EC profiles, this would be a significant improvement to reduce overall IO latencies with both, spinning and solid state drives. We have enterprise SSDs in our cluster that sometimes stall for up to a few seconds and then catch up again. It would be awesome if these temporarily slow drives would not need to be waited for by clients. I would be most grateful if you could consider both options for optimization: (1) reducing the IO path to the absolute minimum number of disks and shards as proposed in the draft document and (2) going the opposite way and "just send it to everyone and let the fastest win" as proposed in the comment linked above. On realistic clusters with a small fraction of mildly broken hardware I would assume option (2) to win. For the highly optimized option (1) I would expect that hardware health needs to be maintained at very high levels to be worth the effort as tail latencies will likely kill its benefits. It might also be possible to combine both ideas and get the best of both worlds? I would like to point to message https://www.mail-archive.com/ceph-users@xxxxxxx/msg24126.html collecting an a-priori discussion of how much effort it will be to implement the "fast write" option. It sounds not that difficult to me. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Sebastian Wagner <sebastian.wagner@xxxxxxxx> Sent: Monday, July 1, 2024 10:31 AM To: Bill Scales; dev@xxxxxxx Subject: Re: Erasure coding enhancements - design for review Hi Bill, Is the Partial Reads section the same as https://github.com/ceph/ceph/pull/55196 ? Best, Sebastian Am 01.07.24 um 10:08 schrieb Bill Scales: Hi, We are planning to enhance the performance of erasure coding, in particular for use with block and file. We've got a design document https://github.com/bill-scales/CephErasureCodingDesign that sets out what we are hoping to achieve. We welcome your feedback, either posting your comments in Slack on #ceph-devel , raising issues in github or getting in contact with myself Cheers, Bill. bill_scales@xxxxxxxxxx<mailto:bill_scales@xxxxxxxxxx> IBM Distinguished Engineer, IBM Storage Unless otherwise stated above: IBM United Kingdom Limited Registered in England and Wales with number 741598 Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU _______________________________________________ Dev mailing list -- dev@xxxxxxx<mailto:dev@xxxxxxx> To unsubscribe send an email to dev-leave@xxxxxxx<mailto:dev-leave@xxxxxxx> -- Head of Software Development E-Mail: sebastian.wagner@xxxxxxxx<mailto:sebastian.wagner@xxxxxxxx> croit GmbH, Freseniusstr. 31h, 81247 Munich CEO: Martin Verges, Andy Muthmann - VAT-ID: DE310638492 Com. register: Amtsgericht Munich HRB 231263 Web<https://croit.io/> | LinkedIn<http://linkedin.com/company/croit> | Youtube<https://www.youtube.com/channel/UCIJJSKVdcSLGLBtwSFx_epw> | Twitter<https://twitter.com/croit_io> TOP 100 Innovator Award Winner<https://croit.io/blog/croit-receives-top-100-seal> by compamedia Technology Fast50 Award<https://croit.io/blog/deloitte-technology-fast-50-award> Winner by Deloitte _______________________________________________ Dev mailing list -- dev@xxxxxxx To unsubscribe send an email to dev-leave@xxxxxxx