> The birthday paradox says that with a 44-bit hash we're more likely than > not to start seeing collisions somewhere around 2^22 directory entries. > That 16-million-entry-directory would have a lot of collisions. This is really the key point. The risks of the bit-stealing approach have been understated, and the costs of the map-caching approach overstated. DFS deployments on the order of 20K disks are no longer remarkable, and those numbers are only going to increase. If each disk is a brick, which is the most common approach, we'll need *at least* 16 bits ourselves. That leaves 48 bits, and a high probability of collision at 2^24 or 16M files. Is a 16M-file directory a good idea? Of course not. Do they exist in the wild? Definitely yes. The situation gets even worse if the bit-stealing is done at other levels than at the bricks, and I haven't seen any such proposals that deal with issues such as needing to renumber when disks are added or removed. At scale, that's going to happen a lot. The numbers get worse again if we split bricks ourselves, and I haven't seen any proposals to do things that we need to do any other way. Also, the failure mode with this approach - infinite looping in readdir, possibly even in our own daemons - is pretty catastrophic. By contrast, the failure mode for the map-caching approach - a simple failure in readdir - is relatively benign. Such failures are also likely to be less common, even if we adopt the *unprecedented* requirement that the cache be strictly space-limited. If we relax that requirement, the problem goes away entirely. The number of concurrent readdirs is orders of magnitude less than the number of files per directory. We should take advantage of that. Also, we don't have problems with renumbering etc. The bit-stealing approach seemed clever until the first round of failures. After that first round it seemed less clever. After the second it seems unwise. After a third it will seem irresponsible. That wording might seem harsh, but anyone who has actually had to stand in front of users and explain why this was ever a problem is likely to have heard worse. Some users are reporting these problems *right now*. Do we have any volunteers to ask them whether they'd like us to keep pursuing an approach that rests on shaky assumptions and has already failed twice? _______________________________________________ Gluster-devel mailing list Gluster-devel@xxxxxxxxxxx http://www.gluster.org/mailman/listinfo/gluster-devel