Re: reading from local replica?

Brian Ericson <bericson@xxxxxxx> · Tue, 9 Jun 2015 17:35:01 -0500

On 06/09/2015 10:37 AM, Jeff Darcy wrote:
Am I misunderstanding cluster.read-subvolume/cluster.read-subvolume-index?

I have two regions, "A" and "B" with servers "a" and "b" in,
respectfully, each region.  I have clients in both regions. Intra-region
communication is fast, but the pipe between the regions is terrible.
I'd like to minimize inter-region communication to as close to glusterfs
write operations only and have reads go to the server in the region the
client is running in.

I have created a replica volume as:
gluster volume create gv0 replica 2 a:/data/brick1/gv0
b:/data/brick1/gv0 force

As a baseline, if I use scp to copy from the brick directly, I get --
for a 100M file -- times of about 6s if the client scps from the server
in the same region and anywhere from 3 to 5 minutes if I the client scps
the server in the other region.

I was under the impression (from something I read but can't now find)
that glusterfs automatically picks the fastest replica, but that has not
been my experience; glusterfs seems to generally prefer the server in
the other region over the "local" one, with times usually in excess of 4
minutes.

The choice of which replica to read from has become rather complicated
over time.  The first parameter that matters is cluster.read-hash-mode,
which selects between dynamic and (two forms of) static selection.  For
the default mode, we try to spread the read load across replicas based
on both the file's ID and the client's.  For read-hash-mode=0 *only*,
we do this.

  * If "choose-local" is set (as it is by default) and there's a local
    replica, use that.

  * Otherwise, select a replica based on fastest *initial* response.

Note that these are both a bit prone to hot spots, which is why this
method is not the default.  Also, re-evaluating response times is as
likely to lead to "mobile hotspot" behavior as anything else -
clients keep following each other around to previously idle but now
overloaded replicas, moving the congestion around but never resolving
it.  Thus, we only tend to re-evaluate in response to brick up/down
events.  Probably some room for improvement here.

That brings us to read-subvolume and read-subvolume-index.  The
difference between them is that read-subvolume takes a translator
*name* (which you'd have to get from the volfile) and only applies
to one replica set within a volume.  It's really only useful for
testing and debugging.  By contrast, read-subvolume-index applies
to all replica sets in a volume and doesn't require any knowledge
of translator names.  Either one is used *before* read-hash-mode;
if it's set, and if the corresponding replica is up, it will be
chosen.

Yes, it's a bit of a mess.  However, as you've clearly guessed,
this is a pretty critical decision so it's nice to have many
different ways to control it.

I've also tried having clients mount the volume using the "xlator"
options cluster.read-subvolume and cluster.read-subvolume-index, but
neither seem to have any impact.  Here are sample mount commands to show
what I'm attempting:

mount -t glusterfs -o xlator-option=cluster.read-subvolume=gv0-client-<0
or 1> a:/gv0 /mnt/glusterfs
mount -t glusterfs -o xlator-option=cluster.read-subvolume-index=<0 or
1> a:/gv0 /mnt/glusterfs

I would guess that the translator options are somehow not being passed
all the way through to the translator that actually makes the decision.
If it is being passed, it definitely should "force the decision" as
described above.  There might be a bug here, or perhaps I'm just
misunderstanding code I haven't read in a while.

Also, please not that synchronous replication (AFR) isn't really
intended or expected to work over long distances.  Anything over 5ms
RTT is risky territory; that's why we have separate geo-replication.
.

Thanks for pointing me in the direction of geo-replication.  I was 
unaware that AFR was not recommended for long distances, but am unsure 
whether you were simply pointing out that it's geo-replication that is 
intended for long-distances or that it, specifically, is a solution for 
situations (my situation) where both regions have active clients and 
want optimal performance.

Tinkering with geo-replication, I see I'm able to mount either the 
"master" volume or the geo-replicated "slave" volume, giving me the 
impression that I can point the client to the appropriate volume given 
the region it's in as a means of getting the performance I want.

The result loses some transparency, not only in that the client needs to 
choose (there is no longer a single volume everyone mounts), but also in 
that if it chooses to mount the slave volume, it needs to be careful 
with writes to it.

I'd almost expected glusterfs to prevent writes to the slave volume; 
this appears to not be the case.  Rather, slave files created in the 
slave volume are not propagated to the master (makes sense).  Slave 
modifications to master files or deletions of master files are respected 
-- glusterfs seems to "disown" slave-modified content (deleted files 
aren't resurrected, even if the master version gets modified, and 
modified files aren't replaced even if the master version changes or is 
deleted).

In short, it would seem that either were I to use geo-repliciation, 
whether recommended or not in this kind of usage, I'd need to own both 
which volume to mount and what to do with writes when the client has 
chosen to mount the slave.

Finally, given that ping times between regions are typically in excess 
of 200 ms in my case, would you strongly discourage AFR usage?
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users