On 06/09/2015 09:21 AM, Ted Miller wrote:
On 6/8/2015 5:55 PM, Brian Ericson wrote:
Am I misunderstanding
cluster.read-subvolume/cluster.read-subvolume-index?
I have two regions, "A" and "B" with servers "a" and "b" in,
respectfully, each region. I have clients in both regions.
Intra-region communication is fast, but the pipe between the regions
is terrible. I'd like to minimize inter-region communication to as
close to glusterfs write operations only and have reads go to the
server in the region the client is running in.
I have created a replica volume as:
gluster volume create gv0 replica 2 a:/data/brick1/gv0
b:/data/brick1/gv0 force
As a baseline, if I use scp to copy from the brick directly, I get --
for a 100M file -- times of about 6s if the client scps from the
server in the same region and anywhere from 3 to 5 minutes if I the
client scps the server in the other region.
I was under the impression (from something I read but can't now find)
that glusterfs automatically picks the fastest replica, but that has
not been my experience; glusterfs seems to generally prefer the server
in the other region over the "local" one, with times usually in excess
of 4 minutes.
I've also tried having clients mount the volume using the "xlator"
options cluster.read-subvolume and cluster.read-subvolume-index, but
neither seem to have any impact. Here are sample mount commands to
show what I'm attempting:
mount -t glusterfs -o
xlator-option=cluster.read-subvolume=gv0-client-<0 or 1> a:/gv0
/mnt/glusterfs
mount -t glusterfs -o xlator-option=cluster.read-subvolume-index=<0 or
1> a:/gv0 /mnt/glusterfs
Am I misunderstanding how glusterfs works, particularly when trying to
"read locally"? Is it possible to configure glusterfs to use a local
replica (or the "fastest replica") for reads?
I am not a developer, nor intimately familiar with the insides of
glusterfs, but here is how I understand that glusterfs-fuse file reads
work.
First, all replica bricks are read, to make sure they are consistent.
(If not, gluster tries to make them consistent before proceeding).
After consistency is established, then the actual read occurs from the
brick with the shortest response time. I don't know when or how the
response time is measured, but it seems to work for most people most of
the time. (If the client is on one of the brick hosts, it will almost
always read from the local brick.)
If the file reads involve a lot of small files, the consistency check
may be what is killing your response times, rather than the read of the
file itself. Over a fast LAN, the consistency checks can take many
times the actual read time of the file.
Hopefully others will chime in with more information, but if you can
supply more information about what you are reading, that will help too.
Are you reading entire files, or just reading in a lot of "snippets" or
what?
Ted Miller
Elkhart, IN, USA
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users
Thanks for the response! Your understanding matches mine after reading
documentation and various posts -- this should just work, right?
My test consists of reading a 100M file which has been replicated to
both regions by glusterfs. The specific command looks similar to:
time /bin/cp -f /mnt/glusterfs/one_hundred_mb_file /tmp
To avoid local reads, I'm invoking the "cp" on separate hosts in each
region. I umount & mount /mnt/glusterfs prior to running the timed to
avoid measuring a read from the (client-)local cache. The direct-scp
timings show that same-region reads could take under 10s and
between-region reads will take minutes.
Almost universally, the first timed "cp" of a 100M file takes minutes.
This is true for clients in both regions and regardless of how I mount
the volume (with/without read-subvolume/read-subvolume-index).
Occasionally, however (maybe once in every 20 first reads), glusterfs
will surprise me and give times (reads of ~5-20s), which align with what
I'd expect if it were going to a same-region glusterfs replica. I have
never, however, seen this repeated: if a 100M file copies in under 20s
and I immediately follow it up with a copy of another 100M file, the
second file will always take many minutes.
It appears that cluster.read-subvolume and cluster.read-subvolume-index
have no impact when passed as part of the client's mount command. I
note that if I set this at the volume level (gluster volume set gv0
cluster.read-subvolume gv0-client-0), the impact is immediate: those
lucky clients on the "right side" of the divide get fast times, while
those on the "other side" get poor times. Again, however, I see no
impact trying to override this as part of the mount command on the client.
So, maybe passing these options as a mount command doesn't work/is a
no-op, but what I don't understand is why -- given that there is no
measure by which glusterfs should ever conclude the replica in the
"other" region is ever faster than the replica in the "same" region. In
fact, it appears as though glusterfs is *preferring* the slower replica.
_______________________________________________
Gluster-users mailing list
Gluster-users@xxxxxxxxxxx
http://www.gluster.org/mailman/listinfo/gluster-users