Re: Stickyness of writing vs full network storage writing

Joachim Kraftmayer - ceph ambassador <joachim.kraftmayer@xxxxxxxxx> · Sat, 28 Oct 2023 10:22:23 +0200

Hi,

I know similar requirements, the motivation and the need behind them.
We have chosen a clear approach to this, which also does not make the 
whole setup too complicated to operate.
1.) Everything that doesn't require strong consistency we do with other 
tools, especially when it comes to NVMe, PCIe 5.0 and newer technologies 
with high IOPs and low latencies.

2.) Everything that requires high data security, strong consistency and 
higher failure domains as host we do with Ceph.

Joachim

___________________________________
ceph ambassador DACH
ceph consultant since 2012

Clyso GmbH - Premier Ceph Foundation Member

https://www.clyso.com/

Am 27.10.23 um 17:58 schrieb Anthony D'Atri:
Ceph is all about strong consistency and data durability.  There can also be a distinction between performance of the cluster in aggregate vs a single client, especially in a virtualization scenario where to avoid the noisy-neighbor dynamic you deliberately throttle iops and bandwidth per client.

For my discussion I am assuming nowadays PCIe based NVMe drives, which are capable of writing about 8GiB/s, which is about 64GBit/s.
Written how, though?  Benchmarks sometimes are written with 100% sequential workloads, top-SKU CPUs that mortals can't afford, and especially with a queue depth of like 256.

With most Ceph deployments, the IO a given drive experiences is often pretty much random and with lower QD.  And depending on the drive, significant read traffic may impact write bandwidth to a degree.  At ..... Mountpoint (Vancouver BC 2018) someone gave a presentation about the difficulties saturating NVMe bandwidth.

Now considering the situation that you have 5 nodes each has 4 of that drives,
will make all small and mid-sized companies to go bankrupt ;-) only from buying the corresponding networking switches.
Depending where you get your components...

* You probably don't need "mixed-use" (~3 DWPD) drives, for most purposes "read intensive" (~1DWPD) (or less, sometimes) are plenty.  But please please please stick with real enterprise-class drives.

* Chassis brands mark up their storage (and RAM) quite a bit.  You can often get SSDs elsewhere for half of what they cost from your chassis manufacturer.

   But the servers hardware is still a simplistic commodity hardware which can saturate the given any given commodity network hardware easily.
If I want to be able to use full 64GBit/s I would require at least 100GBit/s networking or tons of trunked ports and cabaling with lower bandwidth switches.
Throughput and latency are different things, though.  Also, are you assuming here the traditional topology of separate public and cluster/private/replication networks?  With modern networking (and Ceph releases) that is often overkill and you can leave out the replication network.

Also, would your clients have the same networking provisioned?  If you're

   If we now also consider distributing the nodes over racks, building on same location or distributed datacenters, the costs will be even more painfull.
Don't you already have multiple racks?  They don't need to be dedicated only to Ceph.

The ceph commit requirement will be 2 copies on different OSDs (comparable to a mirrored drive) and in total 3 or 4 copies on the cluster (comparable to a RAID with multiple disk redudancy)
Not entirely comparable, but the distinctions mostly don't matter here.

In all our tests so far, we could not control the behavior of how ceph is persisting this 2 copies. It will always try to persist it somehow over the network.
Q1: Is this behavior mandatory?
It's a question of how important the data is, and how bad it would be to lose some.

   Our common workload, and afaik nearly all webservice based applications are:
- a short burst of high bandwidth (e.g. multiple MiB/s or even GiB/s)
- and probably mostly 1write to 4read or even 1:6 ratio on utilizing the cluster
QLC might help your costs, look into the D5-P5430, D5-P5366, etc.  Though these days if you shop smart you can get TLC for close the same cost.  Won't always be true though, and you can't get a 60TB TLC SKU ;)

Hope I could explain the situation here well enough.
     Now assuming my ideal world with ceph:
if ceph would do:
1. commit 2 copies to local drives to the node there ceph client is connected to
2. after commit sync (optimized/queued) the data over the network to fulfill the common needs of ceph storage with 4 copies
You could I think craft a CRUSH rule to do that.  Default for replicated pools FWIW is 3 copies not 4.

3. maybe optionally move 1 copy away from the intial node which still holds the 2 local copies...
I don't know of an elegant way to change placement after the fact.

   this behaviour would ensure that:
- the felt performance of the OSD clients will be the full bandwidth of the local NVMes, since 2 copies are delivered to the local NVMes with 64GBit/s and the latency would be comparable as writing locally
- we would have 2 copies nearly "immediately" reported to any ceph client
I was once told that writes return to the client when min_size copies are written; later I was told that it's actually not until all copies are written.

But say we could do this.  Think about what happens if one of those two local drives -- or the entire server -- dies.  Before any copies are persisted to other servers, or if only one copy is persisted to another server.  You risk data loss.

- bandwidth utilization will be optimized, since we do not duplicate the stored data transfers on the network immediatelly, we defer it from the initial writing of the ceph client and can so utilize better a queing mechanism
Unless you have an unusually random io pattern, I'm not sure if that would affect bandwidth much.

- IMHO the scalability with commodity network would be far easier to implement, since the networking requirements are factors lower
How so?  I would think you'd still need the same networking.  Also remember that having your PCI-e lanes and keeping them full are very different things.

   Mabe I have a total wrong understanding of ceph cluster and data distribution of the copies.
Q2: If so plz let me know where I may read more about this?
https://www.amazon.com/Learning-Ceph-scalable-reliable-solution-ebook/dp/B01NBP2D9I

;)

You might be able to achieve parts of what you envision here with commercial NVMeoF solutions.  When I researched them they tended to have low latency, but some required proprietary hardware.  Mostly they defaulted to only 2 replicas and had significant scaling and flexibility limitations.  All depends on what you're solving for.

So to bring it quickly down:
Q3: is it possible to configure ceph to behave like named above in my ideal world?
    means to first write n minimal copies to local drives, and deferred the syncing of the other copies into the network
Q4: if not, are there any plans into this direction?
Q5: if possible, is there a good documentation for it?
Q6: we would still like to be able to distribute over racks, enclosures and datacenters
   best wishes
Hans
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx
_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx