On Tue, Oct 19, 2021 at 9:11 AM Matthew Vernon <mvernon@xxxxxxxxxxxxx> wrote: > > Hi, > > On 18/10/2021 23:34, Gregory Farnum wrote: > > On Fri, Oct 15, 2021 at 8:22 AM Matthew Vernon <mvernon@xxxxxxxxxxxxx> wrote: > > >> Also, if I'm using RGWs, will they do the right thing location-wise? > >> i.e. DC A RGWs will talk to DC A OSDs wherever possible? > > > > Stretch clusters are entirely a feature of the RADOS layer at this > > point; setting up RGW/RBD/CephFS to use them efficiently is left as an > > exercise to the user. Sorry. :/ > > > > That said, I don't think it's too complicated — you want your CRUSH > > rule to specify a single site as the primary and to run your active > > RGWs on that side, or else to configure read-from-replica and local > > reads if your workloads support them. But so far the expectation is > > definitely that anybody deploying this will have their own > > orchestration systems around it (you can't really do HA from just the > > storage layer), whether it's home-brewed or Rook in Kubernetes, so we > > haven't discussed pushing it out more within Ceph itself. > > We do have existing HA infrastructure which can e.g. make sure our S3 > clients in DC A talk to our RGWs in DC A. > > But I think I understand you to be saying that in a stretch cluster > (other than in stretch degraded mode) each pg will still have 1 primary > which will serve all reads - so ~50% of our RGWs in DC B will end up > reading from DC A (and vice versa). And that there's no way round this. > Is that correct? Well, kind of. The stretch mode logic itself is concerned with sorting monitors and OSDs into "stretch buckets" (eg, your datacenters — "bucket" from the CRUSH map buckets), making sure PGs have members in both buckets when peering, and detecting when a whole bucket is gone to go degraded and allow running on a single site. So stretch mode doesn't directly try and do anything about this. But it runs with a user-provided CRUSH rule, and this rule can be shaped to do things like guarantee all primaries are in a single site — you just "take" your first two replicas from the primary site, and then "take" again from the failover location. That sort of thing is documented in CRUSH rule construction and is the format used in the suggested crush rule in the stretch mode doc. > Relatedly, I infer this means that the inter-DC link will continue to be > a bottleneck for write latency as if I were just running a "normal" > cluster that happens to be in two DCs? [because the primary OSD will > only ACK the write once all four replicas are complete] Well, yeah, it's all synchronous. If you want async replication you can make use of rbd mirror and rgw multi-site functionality. -Greg _______________________________________________ ceph-users mailing list -- ceph-users@xxxxxxx To unsubscribe send an email to ceph-users-leave@xxxxxxx