Re: reading from local replica?

Brian Ericson <bericson@xxxxxxx> · Tue, 9 Jun 2015 10:40:51 -0500

On 06/09/2015 09:21 AM, Ted Miller wrote:
On 6/8/2015 5:55 PM, Brian Ericson wrote:
Am I misunderstanding
cluster.read-subvolume/cluster.read-subvolume-index?

I have two regions, "A" and "B" with servers "a" and "b" in,
respectfully, each region.  I have clients in both regions.
Intra-region communication is fast, but the pipe between the regions
is terrible.  I'd like to minimize inter-region communication to as
close to glusterfs write operations only and have reads go to the
server in the region the client is running in.

I have created a replica volume as:
gluster volume create gv0 replica 2 a:/data/brick1/gv0
b:/data/brick1/gv0 force

As a baseline, if I use scp to copy from the brick directly, I get --
for a 100M file -- times of about 6s if the client scps from the
server in the same region and anywhere from 3 to 5 minutes if I the
client scps the server in the other region.

I was under the impression (from something I read but can't now find)
that glusterfs automatically picks the fastest replica, but that has
not been my experience; glusterfs seems to generally prefer the server
in the other region over the "local" one, with times usually in excess
of 4 minutes.

I've also tried having clients mount the volume using the "xlator"
options cluster.read-subvolume and cluster.read-subvolume-index, but
neither seem to have any impact.  Here are sample mount commands to
show what I'm attempting:

mount -t glusterfs -o
xlator-option=cluster.read-subvolume=gv0-client-<0 or 1> a:/gv0
/mnt/glusterfs
mount -t glusterfs -o xlator-option=cluster.read-subvolume-index=<0 or
1> a:/gv0 /mnt/glusterfs

Am I misunderstanding how glusterfs works, particularly when trying to
"read locally"?  Is it possible to configure glusterfs to use a local
replica (or the "fastest replica") for reads?
I am not a developer, nor intimately familiar with the insides of
glusterfs, but here is how I understand that glusterfs-fuse file reads
work.
First, all replica bricks are read, to make sure they are consistent.
(If not, gluster tries to make them consistent before proceeding).
After consistency is established, then the actual read occurs from the
brick with the shortest response time.  I don't know when or how the
response time is measured, but it seems to work for most people most of
the time.  (If the client is on one of the brick hosts, it will almost
always read from the local brick.)

If the file reads involve a lot of small files, the consistency check
may be what is killing your response times, rather than the read of the
file itself.  Over a fast LAN, the consistency checks can take many
times the actual read time of the file.

Hopefully others will chime in with more information, but if you can
supply more information about what you are reading, that will help too.
Are you reading entire files, or just reading in a lot of "snippets" or
what?

Ted Miller
Elkhart, IN, USA
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users

Thanks for the response!  Your understanding matches mine after reading 
documentation and various posts -- this should just work, right?

My test consists of reading a 100M file which has been replicated to 
both regions by glusterfs.  The specific command looks similar to:
time /bin/cp -f /mnt/glusterfs/one_hundred_mb_file /tmp

To avoid local reads, I'm invoking the "cp" on separate hosts in each 
region.  I umount & mount /mnt/glusterfs prior to running the timed to 
avoid measuring a read from the (client-)local cache.  The direct-scp 
timings show that same-region reads could take under 10s and 
between-region reads will take minutes.

Almost universally, the first timed "cp" of a 100M file takes minutes. 
This is true for clients in both regions and regardless of how I mount 
the volume (with/without read-subvolume/read-subvolume-index). 
Occasionally, however (maybe once in every 20 first reads), glusterfs 
will surprise me and give times (reads of ~5-20s), which align with what 
I'd expect if it were going to a same-region glusterfs replica. I have 
never, however, seen this repeated:  if a 100M file copies in under 20s 
and I immediately follow it up with a copy of another 100M file, the 
second file will always take many minutes.

It appears that cluster.read-subvolume and cluster.read-subvolume-index 
have no impact when passed as part of the client's mount command.  I 
note that if I set this at the volume level (gluster volume set gv0 
cluster.read-subvolume gv0-client-0), the impact is immediate: those 
lucky clients on the "right side" of the divide get fast times, while 
those on the "other side" get poor times.  Again, however, I see no 
impact trying to override this as part of the mount command on the client.

So, maybe passing these options as a mount command doesn't work/is a 
no-op, but what I don't understand is why -- given that there is no 
measure by which glusterfs should ever conclude the replica in the 
"other" region is ever faster than the replica in the "same" region. In 
fact, it appears as though glusterfs is *preferring* the slower replica.
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users