> Am I misunderstanding cluster.read-subvolume/cluster.read-subvolume-index? > > I have two regions, "A" and "B" with servers "a" and "b" in, > respectfully, each region. I have clients in both regions. Intra-region > communication is fast, but the pipe between the regions is terrible. > I'd like to minimize inter-region communication to as close to glusterfs > write operations only and have reads go to the server in the region the > client is running in. > > I have created a replica volume as: > gluster volume create gv0 replica 2 a:/data/brick1/gv0 > b:/data/brick1/gv0 force > > As a baseline, if I use scp to copy from the brick directly, I get -- > for a 100M file -- times of about 6s if the client scps from the server > in the same region and anywhere from 3 to 5 minutes if I the client scps > the server in the other region. > > I was under the impression (from something I read but can't now find) > that glusterfs automatically picks the fastest replica, but that has not > been my experience; glusterfs seems to generally prefer the server in > the other region over the "local" one, with times usually in excess of 4 > minutes. The choice of which replica to read from has become rather complicated over time. The first parameter that matters is cluster.read-hash-mode, which selects between dynamic and (two forms of) static selection. For the default mode, we try to spread the read load across replicas based on both the file's ID and the client's. For read-hash-mode=0 *only*, we do this. * If "choose-local" is set (as it is by default) and there's a local replica, use that. * Otherwise, select a replica based on fastest *initial* response. Note that these are both a bit prone to hot spots, which is why this method is not the default. Also, re-evaluating response times is as likely to lead to "mobile hotspot" behavior as anything else - clients keep following each other around to previously idle but now overloaded replicas, moving the congestion around but never resolving it. Thus, we only tend to re-evaluate in response to brick up/down events. Probably some room for improvement here. That brings us to read-subvolume and read-subvolume-index. The difference between them is that read-subvolume takes a translator *name* (which you'd have to get from the volfile) and only applies to one replica set within a volume. It's really only useful for testing and debugging. By contrast, read-subvolume-index applies to all replica sets in a volume and doesn't require any knowledge of translator names. Either one is used *before* read-hash-mode; if it's set, and if the corresponding replica is up, it will be chosen. Yes, it's a bit of a mess. However, as you've clearly guessed, this is a pretty critical decision so it's nice to have many different ways to control it. > I've also tried having clients mount the volume using the "xlator" > options cluster.read-subvolume and cluster.read-subvolume-index, but > neither seem to have any impact. Here are sample mount commands to show > what I'm attempting: > > mount -t glusterfs -o xlator-option=cluster.read-subvolume=gv0-client-<0 > or 1> a:/gv0 /mnt/glusterfs > mount -t glusterfs -o xlator-option=cluster.read-subvolume-index=<0 or > 1> a:/gv0 /mnt/glusterfs I would guess that the translator options are somehow not being passed all the way through to the translator that actually makes the decision. If it is being passed, it definitely should "force the decision" as described above. There might be a bug here, or perhaps I'm just misunderstanding code I haven't read in a while. Also, please not that synchronous replication (AFR) isn't really intended or expected to work over long distances. Anything over 5ms RTT is risky territory; that's why we have separate geo-replication. _______________________________________________ Gluster-users mailing list Gluster-users@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-users