Re: regressions due to 64-bit ext4 directory cookies

"J. Bruce Fields" <bfields@xxxxxxxxxxxx> · Wed, 13 Feb 2013 17:18:01 -0500

On Wed, Feb 13, 2013 at 09:17:28AM +0100, Bernd Schubert wrote:
> On 02/12/2013 10:00 PM, J. Bruce Fields wrote:
> >On Tue, Feb 12, 2013 at 09:56:41PM +0100, Bernd Schubert wrote:
> >>On 02/12/2013 09:28 PM, J. Bruce Fields wrote:
> >>>06effdbb49af5f6c "nfsd: vfs_llseek() with 32 or 64 bit offsets (hashes)"
> >>>and previous patches solved problems with hash collisions in large
> >>>directories by using 64- instead of 32- bit directory hashes in some
> >>>cases.  But it caused problems for users who assume directory offsets
> >>>are "small".  Two cases we've run across:
> >>>
> >>>	- older NFS clients: 64-bit cookies cause applications on many
> >>>	  older clients to fail.
> >>>	- gluster: gluster assumed that it could take the top bits of
> >>>	  the offset for its own use.
> >>>
> >>>In both cases we could argue we're in the right: the nfs protocol
> >>>defines cookies to be 64 bits, so clients should be prepared to handle
> >>>them (remapping to smaller integers if necessary to placate applications
> >>>using older system interfaces).  And gluster was incorrect to assume
> >>>that the "offset" was really an "offset" as opposed to just an opaque
> >>>value.
> >>>
> >>>But in practice things that worked fine for a long time break on a
> >>>kernel upgrade.
> >>>
> >>>So at a minimum I think we owe people a workaround, and turning off
> >>>dir_index may not be practical for everyone.
> >>>
> >>>A "no_64bit_cookies" export option would provide a workaround for NFS
> >>>servers with older NFS clients, but not for applications like gluster.
> >>>
> >>>For that reason I'd rather have a way to turn this off on a given ext4
> >>>filesystem.  Is that practical?
> >>
> >>I think Ted needs to answer if he would accept another mount option. But
> >>before we are going this way, what is gluster doing if there are hash
> >>collions?
> >
> >They probably just haven't tested NFS with large enough directories.
> 
> Is it only related to NFS or generic readdir over gluster?
> 
> >The birthday paradox says you'd need about 2^16 entries to have a 50-50
> >chance of hitting the problem.
> 
> We are frequently running into it with 50000 files per directory.
> 
> >
> >I don't know enough about ext4 directory performance.  But unfortunately
> >I suspect there's a range of directory sizes that are too small to have
> >a significant chance of having directory collisions, but still large
> >enough to need dir_index?
> 
> Here is a link to the initial benchmark:
> http://search.luky.org/linux-kernel.2001/msg00117.html

Hm, so I still don't have a good feeling for when dir_index is likely to
start winning.

For comparison, assuming the probability of seeing a failure due to hash
collisions in an n-entry directory is the probability of a collision
among n numbers chosen uniformly at random from 2^31, that's about:

	 0.0002% for n=  100
	 0.006 % for n=  500
	 0.02  % for n= 1000
	 0.6   % for n= 5000
	 2     % for n=10000

So if we could tell anyone with directories smaller than 10,000 entries:
"hey, you don't need dir_index anyway, just turn it off"--good, the only
people still forced to deal with 64-bit cookies will be the ones that
have probably already found that ext4 isn't reliable for their purposes.

If there are people with only a few hundred entries who still need
dir_index--well, we may be making them unhappy as we're making them
suffer to fix a bug that they've never actually seen.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html