> > The situation gets even worse if the bit-stealing is done at other > > levels than at the bricks, and I haven't seen any such proposals that > > deal with issues such as needing to renumber when disks are added or > > removed. At scale, that's going to happen a lot. The numbers get > > worse again if we split bricks ourselves, and I haven't seen any > > proposals to do things that we need to do any other way. Also, the > > failure mode with this approach - infinite looping in readdir, > > possibly even in our own daemons - is pretty catastrophic. > > Any recent Linux client at least should just fail in this case Why would it just fail? It's continuing to receive (what appear to be) valid entries. Is there code in the Linux NFS client to detect loops or duplicates? > and it > shouldn't be hard to similarly fix any such daemons to detect loops and > minimize the damage. (Though there still may be clients you can't fix.) We can certainly detect loops in our own daemons, at the cost of adding yet another secondary fix for problems introduced by the primary one. We can almost as certainly not fix all clients that our users might deploy. That includes older Linux clients, BSD clients, Mac clients, Windows clients, and who-knows-what more exotic beasties. > > By contrast, the failure mode for the map-caching approach - a simple > > failure in readdir - is relatively benign. Such failures are also > > likely to be less common, even if we adopt the *unprecedented* > > requirement that the cache be strictly space-limited. If we relax that > > requirement, the problem goes away entirely. > > Note NFS clients normally expect to be able to survive server reboots, > so a complete solution requires a persistent cache. It's not ideal that an NFS server (GlusterFS client) crash would result in an NFS client's readdir failing. On the other hand, one might reasonably expect such events to be very rare, and not to recur every time somebody tries to access the same directory. If I were a storage administrator, I'd prefer that scenario to one in which clients (or daemons) repeatedly spin out of control as long as the directory is subject to an unpredictable condition (entries hashing to the same N bits). > My worry is that the map-caching solution will be more complicated and > also have some failures in odd corner cases. Yes, it will add complexity. It might have odd corner cases. On the other hand, the bit-stealing approach also adds complexity and our users are already suffering from failures in what can no longer be called corner cases. 64 bits just isn't enough for both a sufficiently large brick number and a sufficiently collision-resistant hash. Even if we could get d_off to expand to 128 bits, we wouldn't be able to rely on that for years. Therefore, even if we solve issues like brick renumbering, we'll be stuck in this infinite loop having this same conversation every year or so until we change our approach. However inconvenient or imperfect an alternative might be, it's our only way forward. _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel