October 14 2015 3:11 PM, "Manoj Pillai" <mpillai@xxxxxxxxxx> wrote: > E.g. 3x number of bricks could be a problem if workload has > operations that don't scale well with brick count. Fortunately we have DHT2 to address that. > Plus the brick > configuration guidelines would not exactly be elegant. And we have Heketi to address that. > FWIW, if I look at the performance and perf regressions tests > that are run at my place of work (as these tests stand today), I'd > expect AFR to significantly outperform this design on reads. Reads tend to be absorbed by caches above us, *especially* in read-only workloads. See Rosenblum and Ousterhout's 1992 log-structured file system paper, and about a bazillion others ever since. We need to be concerned at least as much about write performance, and NSR's write performance will *far* exceed AFR's because AFR uses neither networks nor disks efficiently. It splits client bandwidth between N replicas, and it sprays writes all over the disk (data blocks plus inode plus index). Most other storage systems designed in the last ten years can turn that into nice sequential journal writes, which can even be on a separate SSD or NVMe device (something AFR can't leverage at all). Before work on NSR ever started, I had already compared AFR to other file systems using these same methods and data flows (e.g. Ceph and MooseFS) many times. Consistently, I'd see that the difference was quite a bit more than theoretical. Despite all of the optimization work we've done on it, AFR's write behavior is still a huge millstone around our necks. OK, let's bring some of these thoughts together. If you've read Hennessy and Patterson, you've probably seen this formula before. value (of an optimization) = benefit_when_applicable * probability - penalty_when_inapplicable * (1 - probability) If NSR's write performance is significantly better than AFR's, and write performance is either dominant or at least highly relevant for most real workloads, what does that mean for performance overall? As prototyping showed long ago, it means a significant improvement. Is it *possible* to construct a read-dominant workload that shows something different? Of course it is. It's even possible that write performance will degrade in certain (increasingly rare) physical configurations. No design is best for every configuration and workload. Some people tried to focus on the outliers when NSR was first proposed. Our competitors will be glad to do the same, for the same reason - to keep their own pet designs from looking too bad. The important question is whether performance improves for *most* real-world configurations and workloads. NSR is quite deliberately somewhat write-optimized, because it's where we were the furthest behind and because it's the harder problem to solve. Optimizing for read-only workloads leaves users with any other kind of workload in a permanent hole. Also, even for read-heavy workloads where we might see a deficit, we have not one but two workarounds. One (brick splitting) we've just discussed, and it is quite deliberately being paired with other technologies in 4.0 to make it more effective. The other (read from non-leaders) is also perfectly viable. It's not the default because it reduces consistency to AFR levels, which I don't think serves our users very well. However, if somebody's determined to make AFR comparisons, then it's only fair to compare at the same consistency level. Giving users the ability to decide on such tradeoffs, instead of forcing one choice on everyone, has been part of NSR's design since day one. I'm not saying your concern is invalid, but NSR's leader-based approach is *essential* to improving write performance - and thus performance overall - for most use cases. It's also essential to improving functional behavior, especially with respect to split brain, and I consider that even more important. Sure, reads don't benefit as much. They might even get worse, though that remains to be seen and is only likely to be true in certain scenarios. As long as we know how to work around that, is there any need to dwell on it further? _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel