Re: Stretch cluster experiences in production?

Gregory Farnum <gfarnum@xxxxxxxxxx> · Tue, 19 Oct 2021 11:33:44 -0700

On Tue, Oct 19, 2021 at 9:11 AM Matthew Vernon <mvernon@xxxxxxxxxxxxx> wrote:
>
> Hi,
>
> On 18/10/2021 23:34, Gregory Farnum wrote:
> > On Fri, Oct 15, 2021 at 8:22 AM Matthew Vernon <mvernon@xxxxxxxxxxxxx> wrote:
>
> >> Also, if I'm using RGWs, will they do the right thing location-wise?
> >> i.e. DC A RGWs will talk to DC A OSDs wherever possible?
> >
> > Stretch clusters are entirely a feature of the RADOS layer at this
> > point; setting up RGW/RBD/CephFS to use them efficiently is left as an
> > exercise to the user. Sorry. :/
> >
> > That said, I don't think it's too complicated — you want your CRUSH
> > rule to specify a single site as the primary and to run your active
> > RGWs on that side, or else to configure read-from-replica and local
> > reads if your workloads support them. But so far the expectation is
> > definitely that anybody deploying this will have their own
> > orchestration systems around it (you can't really do HA from just the
> > storage layer), whether it's home-brewed or Rook in Kubernetes, so we
> > haven't discussed pushing it out more within Ceph itself.
>
> We do have existing HA infrastructure which can e.g. make sure our S3
> clients in DC A talk to our RGWs in DC A.
>
> But I think I understand you to be saying that in a stretch cluster
> (other than in stretch degraded mode) each pg will still have 1 primary
> which will serve all reads - so ~50% of our RGWs in DC B will end up
> reading from DC A (and vice versa). And that there's no way round this.
> Is that correct?

Well, kind of. The stretch mode logic itself is concerned with sorting
monitors and OSDs into "stretch buckets" (eg, your datacenters —
"bucket" from the CRUSH map buckets), making sure PGs have members in
both buckets when peering, and detecting when a whole bucket is gone
to go degraded and allow running on a single site. So stretch mode
doesn't directly try and do anything about this.

But it runs with a user-provided CRUSH rule, and this rule can be
shaped to do things like guarantee all primaries are in a single site
— you just "take" your first two replicas from the primary site, and
then "take" again from the failover location. That sort of thing is
documented in CRUSH rule construction and is the format used in the
suggested crush rule in the stretch mode doc.

> Relatedly, I infer this means that the inter-DC link will continue to be
> a bottleneck for write latency as if I were just running a "normal"
> cluster that happens to be in two DCs? [because the primary OSD will
> only ACK the write once all four replicas are complete]

Well, yeah, it's all synchronous. If you want async replication you
can make use of rbd mirror and rgw multi-site functionality.
-Greg

_______________________________________________
ceph-users mailing list -- ceph-users@xxxxxxx
To unsubscribe send an email to ceph-users-leave@xxxxxxx