Re: Erasure coding enhancements - design for review

Bill Scales <bill_scales@xxxxxxxxxx> · Wed, 3 Jul 2024 15:42:17 +0000

Hi,

Firstly, lets assure everyone that we are not taking the “fast read” option away, that will still be available and will work as it does today.

Personally, I’m not convinced that fast read works as well in practice as people might think. The problem is that each delayed read will end up blocking a thread. If a delay impacts more than one or two I/Os the cluster will quickly run out of threads, and
 this will then stop any I/O from progressing. I can believe it improves tail-latencies when you get one off glitches, but not when you have a more pervasive problem such as a sick but not yet dead HDD.

Fast writes are a trickier problem – if you start completing writes before all the updates have been completed then you are reducing redundancy. For example, if you have a 6+2 erasure coding and you allow a write to complete with one delayed update then you
 will only have +1 redundancy for the whole placement group until the delayed update completes. If you allow two writes to complete each with an delayed update to different OSDs then you might have lost all your redundancy. Therefore, you probably want the
 cluster to coordinate which and how many OSDs it allows to fall behind in processing.

It might be easier to implement both fast reads and writes by having a very aggressive I/O timeout and taking OSDs down when they fail to complete read/writes in a timely fashion. Ideally you want the OSD to stay down until the storage device completes the
 slow I/Os giving some indication that the device has recovered. This would then let reads and writes progress without the OSD and let recovery and backfill deal with bringing the OSD back up to date assuming it recovers. You would probably want a configuration
 parameter to define how much redundancy you are willing to trade off for better performance and disable the aggressive timeouts when this threshold is reached. There’s probably work to do to reduce delays to I/Os when an OSD is taken down to make this approach
 viable.

Cheers,

Bill.

bill_scales@xxxxxxxxxx

IBM Distinguished Engineer, IBM Storage 

From:
Frank Schilder <frans@xxxxxx>

Date: Tuesday, 2 July 2024 at 08:59

To: Sebastian Wagner <sebastian.wagner@xxxxxxxx>, Bill Scales <bill_scales@xxxxxxxxxx>, dev@xxxxxxx <dev@xxxxxxx>

Subject: [EXTERNAL] Re: Erasure coding enhancements - design for review

Hi Bill, Sebastian and rados developers,

this seems a good opportunity to raise awareness of the discussion around this comment:
https://www.mail-archive.com/ceph-users@xxxxxxx/msg24115.html . Many of the optimizations discussed in the design draft start targeting individual OSDs with the aim of improving performance
 and reducing load. Assuming all OSDs are working equally well it s likely to have the desired effect.

However, in the event that not all OSDs are operating equally well, that is, the occurrence of tail latencies due to a few (temporarily) slow disks, the following write path might beat a path limited to and directly addressing a slow OSD:

- assuming we use the "fast write" option proposed in the e-mail referenced above, we send new shards to all OSDs

- we ack to the client when min_size OSDs reply with ack

- this approach explicitly uses network- and CPU amplification in exchange for latency reduction with the simple "let the fastest win" strategy, we had great success with the already existing "fast read" option that made client IO latencies more predictable
 and eliminated the effect of tail latencies from reads; in our cluster design we explicitly take this into account when designing network and CPU per disk

This effectively makes the write to slowest disks asynchronous from client IO. In my experience with 6+2, 8+2 and 8+3 EC profiles, this would be a significant improvement to reduce overall IO latencies with both, spinning and solid state drives. We have enterprise
 SSDs in our cluster that sometimes stall for up to a few seconds and then catch up again. It would be awesome if these temporarily slow drives would not need to be waited for by clients.

I would be most grateful if you could consider both options for optimization: (1) reducing the IO path to the absolute minimum number of disks and shards as proposed in the draft document and (2) going the opposite way and "just send it to everyone and let
 the fastest win" as proposed in the comment linked above. On realistic clusters with a small fraction of mildly broken hardware I would assume option (2) to win. For the highly optimized option (1) I would expect that hardware health needs to be maintained
 at very high levels to be worth the effort as tail latencies will likely kill its benefits. It might also be possible to combine both ideas and get the best of both worlds?

I would like to point to message 
https://www.mail-archive.com/ceph-users@xxxxxxx/msg24126.html collecting an a-priori discussion of how much effort it will be to implement the "fast write" option. It sounds not that difficult to me.

Best regards,

=================

Frank Schilder

AIT Risø Campus

Bygning 109, rum S14

________________________________________

From: Sebastian Wagner <sebastian.wagner@xxxxxxxx>

Sent: Monday, July 1, 2024 10:31 AM

To: Bill Scales; dev@xxxxxxx

Subject: Re: Erasure coding enhancements - design for review

Hi Bill,

Is the Partial Reads section the same as 
https://github.com/ceph/ceph/pull/55196 ?

Best,

Sebastian

Am 01.07.24 um 10:08 schrieb Bill Scales:

Hi,

We are planning to enhance the performance of erasure coding, in particular for use with block and file. We've got a design document
https://github.com/bill-scales/CephErasureCodingDesign that sets out what we are hoping to achieve. We welcome your feedback, either posting your comments in Slack on #ceph-devel , raising
 issues in github or getting in contact with myself

Cheers,

Bill.

bill_scales@xxxxxxxxxx<mailto:bill_scales@xxxxxxxxxx>

IBM Distinguished Engineer, IBM Storage

Unless otherwise stated above:

IBM United Kingdom Limited

Registered in England and Wales with number 741598

Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU

_______________________________________________

Dev mailing list -- dev@xxxxxxx<mailto:dev@xxxxxxx>

To unsubscribe send an email to dev-leave@xxxxxxx<mailto:dev-leave@xxxxxxx>

--

Head of Software Development

E-Mail: sebastian.wagner@xxxxxxxx<mailto:sebastian.wagner@xxxxxxxx>

croit GmbH, Freseniusstr. 31h, 81247 Munich

CEO: Martin Verges, Andy Muthmann - VAT-ID: DE310638492

Com. register: Amtsgericht Munich HRB 231263

Web<https://croit.io/> | LinkedIn<http://linkedin.com/company/croit> | Youtube<https://www.youtube.com/channel/UCIJJSKVdcSLGLBtwSFx_epw>
 | Twitter<https://twitter.com/croit_io>

TOP 100 Innovator Award Winner<https://croit.io/blog/croit-receives-top-100-seal> by compamedia

Technology Fast50 Award<https://croit.io/blog/deloitte-technology-fast-50-award> Winner by Deloitte

Unless otherwise stated above:

IBM United Kingdom Limited

Registered in England and Wales with number 741598

Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU

_______________________________________________
Dev mailing list -- dev@xxxxxxx
To unsubscribe send an email to dev-leave@xxxxxxx