On Mon, Dec 22, 2014 at 12:04:03PM -0500, Jeff Darcy wrote: > > > The situation gets even worse if the bit-stealing is done at other > > > levels than at the bricks, and I haven't seen any such proposals that > > > deal with issues such as needing to renumber when disks are added or > > > removed. At scale, that's going to happen a lot. The numbers get > > > worse again if we split bricks ourselves, and I haven't seen any > > > proposals to do things that we need to do any other way. Also, the > > > failure mode with this approach - infinite looping in readdir, > > > possibly even in our own daemons - is pretty catastrophic. > > > > Any recent Linux client at least should just fail in this case > > Why would it just fail? It's continuing to receive (what appear to be) > valid entries. Is there code in the Linux NFS client to detect loops > or duplicates? Yes, exactly. > > and it > > shouldn't be hard to similarly fix any such daemons to detect loops and > > minimize the damage. (Though there still may be clients you can't fix.) > > We can certainly detect loops in our own daemons, at the cost of adding > yet another secondary fix for problems introduced by the primary one. We > can almost as certainly not fix all clients that our users might deploy. > That includes older Linux clients, BSD clients, Mac clients, Windows > clients, and who-knows-what more exotic beasties. Agreed. Well, I haven't actually tested any clients, and I'd consider the failure to handle a loop a (mild) client bug, but I wouldn't be surprised if it's a common bug. > > > By contrast, the failure mode for the map-caching approach - a simple > > > failure in readdir - is relatively benign. Such failures are also > > > likely to be less common, even if we adopt the *unprecedented* > > > requirement that the cache be strictly space-limited. If we relax that > > > requirement, the problem goes away entirely. > > > > Note NFS clients normally expect to be able to survive server reboots, > > so a complete solution requires a persistent cache. > > It's not ideal that an NFS server (GlusterFS client) crash would result > in an NFS client's readdir failing. On the other hand, one might > reasonably expect such events to be very rare, and not to recur every > time somebody tries to access the same directory. If I were a storage > administrator, I'd prefer that scenario to one in which clients (or > daemons) repeatedly spin out of control as long as the directory is > subject to an unpredictable condition (entries hashing to the same N > bits). > > > My worry is that the map-caching solution will be more complicated and > > also have some failures in odd corner cases. > > Yes, it will add complexity. It might have odd corner cases. On the > other hand, the bit-stealing approach also adds complexity and our users > are already suffering from failures in what can no longer be called > corner cases. 64 bits just isn't enough for both a sufficiently large > brick number and a sufficiently collision-resistant hash. Even if we > could get d_off to expand to 128 bits, we wouldn't be able to rely on > that for years. Therefore, even if we solve issues like brick > renumbering, we'll be stuck in this infinite loop having this same > conversation every year or so until we change our approach. However > inconvenient or imperfect an alternative might be, it's our only way > forward. Maybe. Could we get a sketch of the design with a good description of the failure cases? It'd also be nice to see any proposals for a completely correct solution, even if it's something that will take a while. All I can think of is protocol extensions, but that's just what I know. I don't love the bit-stealing hack either, but in practice keep in mind this all seems to be about ext4. If you want reliable nfs readdir with 16M-entry directories and all the rest you can get that already with xfs. --b. _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel